Time Series with Categories

Hey Guys,
So I want to perform anomaly detection on the Veremi dataset. It is an autonomous car dataset. There are 2 parts to this dataset one is the ground truth and the other is the log files from different cars. A lot of machine learning algorithms has been used on this dataset and what they basically do is, train on the ground truth and test on the incoming log files and I want to do something similar with the HTM.

Ground truth has 6 major fields:- Timestamp, Sender, Position, Speed, Acceleration and Distance
The timestamp is a continuous field, but the sender is a categorical field which is car ids, and the other fields can be continuous values but only for the particular sender. So it’s like these senders have 4-time series name Position, Speed, Acceleration and Distance.

I can’t train an individual HTM model for each sender Id because I have only like 600-1000 data points per car/sender. So I have 2 options

  1. To rearrange the data points so that these car ids appear continuously (right now it is not continuous, one message is received from one car, then another message from the second car and then again from the first car) basically sorting it with the sender ids so that the other columns for continuous for each sender ids and distortion only happens when the sender ids change.

  2. Pass the data as is, sorted timestamp wise.

I want to do the second option but will the HTM be able to keep track of the 4 data columns according to the send ids?

all the data columns are not visible :frowning:

Hey @thos1996, so the categories you’re trying to distinguish are the different drivers(/senders)? Like in practice, your system will receive these data (Position, Speed, Acceleration, Distance) and decide which sender its coming from?

My approach would be to train separate models for each sender, then feed the incoming data to each model and compare the anomaly scores. Basically the sender’s model w/the lowest anomaly scores (over some recent time span) is the chosen sender.

I understand the data per sender is limited, but I would definitely try it out. If the signal to noise ratio is high enough the models could still potentially learn enough to be effective. I’ve actually had some success with this approach using similar amounts of data – tho from a different domain where the signal/noise ratio was sufficient. I’d do some exploratory analysis to gauge this, and maybe consider some de-noising preprocessing like differencing.

I’m not sure exactly what you mean by “keep track” here, but any single HTM will learn the transitions between each successive data point. So if a data from sender 3 follows one from sender 1, the model will learn to predict sender 3 at time (t+1) based on sender 1 at time (t).

Given that, I don’t see how having 1 model for all sender’s data will be able to distinguish 1 sender from another.

If you do option 1 (sorting the data by sender and sending it all into 1 model) the model could learn the 1st send well enough to raise anomaly when the data source changes to 2nd sender – but then the same model will have to learn sender 2’s behavior also, in order to later distinguish the 3rd sender from prior 2. The more total sender’s you have, the more that 1 model will have to bear, so I don’t think it would scale well.

Best of luck and hope this helps!

1 Like

Hey @sheiser1, thanks for the insights, to answer your questions.

I want to classify whether the whole log file is anomalous or not, log files are the data that the car received from another car so each log file corresponds to a sender ID.

the problem with training multiple models is that when doing the actual prediction is that, in the log files there can be just 10 entries from another car and I need to pass those different entries to different models matching with the sender ids.
Also the there should be around 500 cars/sender ids in the whole dataset.

There should be an easier way to do that.

Also It makes sense to use sender ids as category and encode it via SDR encoder, right?

1 Like

Anomalous relative to all senders seen thus far you mean? If you don’t need to id a particular sender you could train 1 single model with all the data, then each log file would be anomalous or not relative to all senders.

I’m just skeptical that one model trained on all senders’ data could distinguish between the senders. I get that training separate models would be a logistical mess, I’m just not sure how else to distinguish between many individual senders – if that is a goal. Tho I’m certainly open to other ideas and curious to see any findings from any approach!