Anomaly Detection and feature selection

Hello everyone,
I have a question that I’m having trouble formulating clearly to my colleagues so I though I would try it here.
I’m using a single layer of HTM to perform anomaly detection. My problem is that I have 44+ features and encoding all of them into a single SP leads to bad results (as expected and mentioned here).
So my idea was to try some Feature selection/extraction/Weighting techniques to reduce the amount of features to a reasonable amount.
But the thing is that all these techniques won’t be able to provide any “formal” result but more like a statistical optimum.
Thus, I was wandering if there was a way to know how the proximal connections are connected to the SDR, meaning : Is there a way to know to which feature a particular mini-column is connected ?
The idea would be to find which feature caused, or was mostly responsible for the anomaly to begin with.

I don’t want to deduce which values were responsible for the anomalies but only which features.

Does that make sense ?

Thanks

1 Like

You can track a minicolumn’s proximal input to a subset of the input space, but that subset spans across the entire input space, so each minicolumn is encoding aspects of many features at once. There is no way to unpack the features once they’ve been distributed through the SP that I know if.

Another suggestion, if you have the compute power, is to create 44+ models, one for each feature. This could tell you which features are the most valuable to include in a final model based on how well they turn up the anomalies you are looking for.

3 Likes

Hey Matt,
I already tried to create a single model for each feature but it did not prove useful as there was no linear relationship between the results.
That is to say that if I had following features : A, B, C, D

I actually obtained better results using Model({A,B,C,D}) than combining the results of every model (Model(A) + Model(B) …). I also tried combining the features that led to the best results when modeled alone without any success. HTM seems to process correlation between specific features that is why I wanted to extract these relationships from the SP.

In my mind since I used a a different encoder for each feature used in the model, I thought that it was possible to deduce the bits inside the SDR that encode each feature like in this table representing the SP and the colors representing the bits that are used encode each feature. Thus a mini-column would connect to various of each feature and I would be able to determine the proportions each feature that was relevant to any anomaly detection.

So if I understand correctly, you are saying that the encoding process prevents us from tracking this knowledge, or that the current nupic implementation does not provide tools to get the information ?

How are you evaluating results? It makes sense that a combined model does better because HTM does find the correlations as you said.

Without topology, it means the SP is going to give every minicolumn a chance to connect anywhere in the global input space. In this case it is going to be impossible to decode from the SP. From this point on, the semantics are internal to the system. They mean something to the HTM system itself, but not to us. (For some philosophy on this, read Agency, Identity and Knowledge Transfer).

As Mark just mentioned in HTM and Reversibility, if you have topology enabled, then the SP breaks up minicolumn RFs into local chunks. In this case, if you have separated your input features to match the SP topology, you might be able to do some decoding. But AFAIK we have not tested this.

3 Likes

Thanks again for your reply Matt, my evaluation is simply based on recall, precision and f1. However, I have different “types” of anomalies so I’m specifically trying to get a model that will be able to detect the most different “types”.

I understand that at the initialization of the SP the connections are randomly created for each mini-column. However, the algorithm has to keep track of these connections in order to act on these links right (permanence and decay) ?
Thus, theoretically there is a place somewhere where this information should be stored, the example I’m reminded of is the one that you showed in HTM school, where we could clearly see where the bits “reserved” for the encoding of the features (dates, etc…) were located.

Moreover, same goes for the encoding process which to me, is quite like a non-cryptographic hashing function that should be reversible (at least to some extent), given that they should not satisfy Strict avalanche criterion.

However, I don’t want to be able to reverse the encoding process as mentioned in Mark’s post (+ it is unclear to me how I could separate the input features to match the SP topology).

I just need to locate the bits that are reserved for each features and from that once I get a high anomaly score, check under the hood the state of the SP and the active links to see which feature was most represented during the anomaly or at least the ones that were not at all part of the anomaly.

Thanks again for reading me !

That is a very hard problem, because the SP mashes all the features together. Each minicolumn represents multiple features, and from an isolated activation, you can’t tell which features are being represented. It is true you could hard-code a history somewhere so you can retrieve it, but that is going to be super compute-heavy and non-biological. This gets at the same issues underlying the Exploring the "Repeating Inputs" problem.

When minicolumn activates, you have to trace the predictive cells from the previous step to see the context in which in activated. This context could be one or many contexts combined (ex: sometimes E follows B, sometimes it follows C). These contexts are also within minicolumns, and activated because they were successfully predictive in the previous time step, now you have a tree of contexts that expands exponentially as you go backwards in time. It is the same problem whether the context is time or something else.

Thanks again, I don’t know why but I totally brain-lagged about the sequence memory and the fact that ofc the HTM creates its own representation of what I’m feeding it. Thus, I understand all your points except why :

Would be a problem since it does it in a deterministic way.

My idea would be to perform a brain scan of the SP to create some kind of heat-map that would represent the SP throughout a single run to see which connections were active before an anomaly occurs. Then based on this representation deduce the features. However this still requires to know where the features end up in the SP.

But then do you have any suggestions on feature selection/extraction/weighting that work well for HTM models ?

Thanks again.

2 Likes

Consider this - keep a buffer window of the source data (a parallel stream) and if the HDR farm kicks out an anomalous value delta the stream with the delayed version to see what changed.
A slightly more elaborate version is to feed the two streams into a classic RNN with the anomaly detection as a training signal. Depending on how you set it up you could just learn what is considered an anomaly.

3 Likes

This is an interesting idea,

I’ll give it a try although I’m unsure on how I could infer from the delta that the specific features that changed were responsible, as it could very well be the combination of a feature changing and another staying the same.
The second idea (I understand correctly) won’t fit into my work though.

I’ll keep you posted on the first one, in the meantime I am open to any other idea !

Thanks you guys.

2 Likes

What do you mean exactly by ‘combining’ here?

I use a multi-variate anomaly detection strategy of having n parallel models (44 in your case) and flagging time steps when x% of them breach 0.99 anomaly likelihood. Or if you know when the ‘true’ anomalies are, you could look at the recent likelihood values of each metric leading up to these specific time – maybe make a histogram of the recent likelihood values for each, and see which metrics’s values fall at the higher end.

Hello Sam,
In this case I meant a parallel execution of all the models (44) vs an execution of a single model with the 44 features. I have access to ground truth for my evaluation and already tried something similar to what you’re suggesting although not the comparison of recent likelihood values. I noticed that the only times I was successful in detecting all anomalies was when combining several features. I’ll give it another try in comparing recent values and see if anything good comes up.
Cheers.

1 Like

Alright cool! Curious to hear what comes of it.

Interesting. I think the approach of comparing the metrics’s anomaly likelihoods recent to failure times could inform which features are most telling. Then you could combine these choice features into a single model – or several multi-feature models. I think its still useful to run the parallel single-feature models as well – because if say a 10-feature model shows anomaly, there’s no interpreting the individual roles of features within this.

Hello Quense - any update as how you were able to zero onto the “most significant sub-set” of the initial set of 44 feature?

1 Like

I ended up trying several feature selection methods (filter based) instead of relying on the average prediction error of each individual feature.
I’m actually writing a paper based on these results and I’ll share it with you once it is finished.
The best we found was selecting a small amount of features (3 to 7) using ExtraTrees.
Using the parameters for the SP and the TM from this paper : https://arxiv.org/abs/1510.03336
Another solution would be to use Particle Swarm Optimization to optimize the HTM parameters along with the feature set. We decided not to since it would have required too much computing time to end up with a possibly overfited model.

1 Like

Thanks Quense - will wait to see your paper.
In the meantime, I am having a tough time installing nupic on my local machine and GCP.

Can you point me to a smooth way of installing nupic (on GCP) please?