Multiple features and binary output


#1

I am currenly analysing a dataset of about 15 000 samples, each sample is made of 50 floating point values which represent moments (timestamps) when a certain event is happening. We don’t care about the event itself it’s always the same. So what really matters here are the timestamps.

The 50 values typically ranges from 0.001 second to 45 sec and they are always in growing order (it’s time passing). For each sample I also have the corresponding state of the system: it’s either 0 or 1. This is what needs to be predicted.

I have already read the various messages about having multiple values as inputs to HTM (here the 50 timestamps) so I’m pretty confortable on how to do that. I have 2 additional questions though:

  • Is the sparse representation going to gracefully handle those timestamp values that roughly span 4 orders of magnitude or do I need to apply some kind of pre-processing ? (like putting them on a log scale)

  • Second how shall I use HTM to predict the 0 or 1 state. It’s literally a binary classification problem and I was wondering how to handle this with HTM ? It turns out that state 1 samples are globally 4 times less frequent than state 0 samples, so could those sample 1 be considered as “anomalies”? Anyother approach you can think of would be appreciated.

Thanks a lot for your help.


#2

IMO.

  • Remember that HTM is a memory based system. It doesn’t care about the scale but which bits are on. As long as differnt inputs are represented differently; your clear. But 0.0001~45s is a pretty big range, you might want to use the log scale (if that fits your data). Or try the new GridCell. They can encode a vast range of values with great details without overlapping near by values too much (like the ScalarEncoder does).

  • You can always use a Neural Network. I’m guessing that you want to predict the 0 or 1 state based on the state TM predicted. You can feed a NN the predicted SDR and out comes a prediction. If that is not an option (being too slow to execute, overfitting, etc…). Send the predicted SDR into a SpatialPooler and use a Nearest Neighbor Classifier to classify the result. It has been shown to be effective against unbalanced datasets.
    Or better, you can abuse TemporalMemory as a supervised learning algorithm. TM learns the relation of tn and tn+1. So by sending TM x at t0 and y at t1. And by resetting every time at t1. You turned TM into a supervised learning algorithm! (I have experience with doing that. It’s not doable using the Network API. You need direct access to the layer to do so, but quite amazing when you see it working.).


#3

Hi Marty,

Thanks a lot for your answer. when you mention the “new GriCell” is this a new encoder already available in NuPIC? Unless I missed something, I could not find anything related to GridCell in the NuPIC github repo.

I am not sure I understand how to apply the so-called “TM abuse” approach to my case. What would x and y represent in this case ?

Thank you again for your perspective on this.


#4

If I understand the description correctly, you have just two values that you can work with – the “state” (binary zero or one) and the timestamps at which the “one” state occurred. I assume there is no other metadata (besides the timestamps themselves) that could be used to infer when the “one” states occur.

If we assume there is a temporal pattern in data that can be learned, that pattern must be in the spacing between the “one” states. My initial thought is that you could encode the deltas between “one state” timestamps, and use those as your inputs. If patterns can be learned, then the system should eventually start predicting when the next “one” state will occur. This sounds like also what @marty1885 is suggesting in his first bullet point – the specific data and its required sensitivity (i.e. amount of overlap between adjacent encodings) is going to constrain how you encode the timestamp deltas

EDIT:
A question on the data – do the “one” states tend to have a duration, or do they occur within a single timestep/ sample? If you need to predict not only the start of the “one” state but also its duration, that would require a somewhat different encoding. One way might be to encode the timestamp deltas between when the state toggles between “zero” and “one”. Essentially:

(time since state changed to “one”) -> (time since state changed to “zero”) -> (time since state changed to “one”)…

Of course if the duration of the “one” state is always the same, then it would be more efficient to simply input the deltas between the starts of the “one” state and leave out the duration.


#5

Hi Paul,

Thank you for your reply. I think you are not getting the dataset right so just to be clearer here is a concrete example of my data. Below are 2 samples (out of the 15 000 I have available). Each one comes with 50 values (timestamps). From those 50 values one must predict the state (0 or 1). Is this clearer ? (on other way of looking at it is to say that it’s like a time series but the timestamps itself is the value of the time series).

Knowing this, does your recommendation still hold ?

Thanks

STATE, timestamp_0,timestamp_1,timestamp_2,..., timestamp_47,timestamp_48,timestamp_49
0,     0.1662620174, 0.170520294, 0.176371474,...,   16.99610066, 17.01998751, 17.05640285
1,     0.2006139242, 0.2571713918,0.2881374048,....,16.88379463, 17.05233967, 19.28992181

#6

I’m still a little confused. One set is a sequence of timestamps, each one having a 0/1 state. Taken together they represent an event occurring over time and the state of the system at each point in time.

Some events are very small in length and some are orders of magnitude larger (you said 0.001s to 45s?). Are the very small events structurally different than the larger events? Do they always contain the same amount of samples?


#7

I probably need to understand the problem space better. A few initial questions:

Is the system ever in both state “0” and state “1” simultaneously?
Is the goal to predict what state the system will be at any arbitrary given point in time in the future?
Is the goal to predict when the next state “1” will occur?
Where do the timestamps come from? Are these random samplings that are just not evenly spaced? Or are these driven by events? If event driven, what are the events?


#8

No, I don’t think Numenta has put GridCells into NuPIC yet. But they are trivial to implement. Please refer to this video.

Ohh. Let me explain it in more detail. By x and y I mean the input and output of the TM (as how xy is commonly used in ML terms).
What a TM does is that it predicts a possible SDR based on the current state. So let’s assuming the following pesudo code.

for x, y in zip(input, desired_output):
    tm.compute(x, True)
    tm.compute(y, True)
    tm.reset()

TM has no way to know that the values we sent to it isn’t temporally coherent. So it learns how to map the input into desired_output. To avoid TM learning useless relation between different pairs of out input; we reset the TM every time. Sure the two SDR have to have the same shape. But since TM doesn’t grow connections to cells that haven’t being fired. That’s not a problem. Just fill the empty spaces with 0.

And to generate predictions.

def genPredictino(tm, x):
    tm.compute(x, false)
    pred = tm.getPredictiveCells()
    tm.reset()
    return pred

#9

I have updated my previous post with better formatted samples. i think this will help. To answer your questions:

@rhyolight, @Paul_Lamb

  • The event happening at each timestamp is the same regardless of where your are in the 50 timestamps.
  • You can think of it this way: imagine that a short beep is playing at each timestamp. When hearing this sequence of 50 beeps over time you must guess if the system emitting this particular sequence of beeps is in state 1 or 0.
  • The duration of the series of 50 beeps can vary. In the 2 example above they both finish after about 17 seconds but in some cases it can take up to 43 seconds to emit the 50 beeps.
  • Each sequence of 50 timestamps can only be associated with a single state. it’s either 0 or 1.

Is that clearer ?


#10

Ok, I think I understand now. Essentially a full set of 50 timestamps goes into the definition of “State 0” or “State 1”. The timestamps are sorted, and although there are always 50 of them, they can range anywhere between 0.001 and 45.0. One could think of the state (0 or 1) as an object, and the timestamps as that object’s features.

In this case, I would alter the strategy a bit. If we assume there is a temporal pattern to the timestamps that can be learned, then you could encode the deltas between timestamps as your input. Once the TM layer learned the pattern, the activity in that layer would then represent a specific feature (timestamp) in a specific object (state). Using a strategy like Numenta described in the Columns Paper (adding an output layer), activity in the TM layer would activate the proper representation for State 0 or State 1 in the output layer.

I realize that is a tad theoretical. You’d probably have to be a little creative to set up the system in NuPIC.


#11

Hi @laurent,

What if you had 2 NuPIC models, one trained on all the samples of state 0 and the other on all examples of state 1. Then when a new sample sequence comes in you could run it through both and get total anomaly scores, and classify to whichever state model has the lower anomaly score.

If there are many more sequences of state 0, then that model will have more training data and should usually be harder to surprise than the other model. This could imply that if a given sequence is getting a lower total anomaly score from the state 1-model then it could be distinguishing itself as state-1.

Of course this approach or any is sensitive to the encoding scheme so I’d definitely advise doing some exploration there first, to make sure the values of the different scales have appropriate amounts of overlap.


#12

Thank you all for your suggestions.
@rhyolight: with the above clarification on my dataset, any suggestion on your side ?


#13

The easiest way to try HTM on this would be to train only on state “0” data. How many sequences of this data do you have? If several thousand I would encode the times as deltas. Just send in one delta after the next until the sequence is through, then reset the TM every time.

Once the model has seen as much training data as you can give it, disable learning. Now give it some state “1” sequences and see how anomaly scores compare vs state “0”.


#14

Thanks for the advice. As I have about 13 000 samples of state 0, I think this is what I’m going to try first.

Also as the deltas span roughly 4 orders of magnitude (0.001 to 10) I was wondering what would be the most appropriate encoder to use or if applying a log transformation before the training would be in order.

Kr


#15

I think you will need to experiment, but a logarithmic encoding of the deltas is certainly something I would try.


#16

Hi Matt,

I’m currently trying the strategy you suggested. What I’m observing though is that the processing time to learn a new sequence seem to take longer as time goes (e.g an additional 20 sequences takes 30 seconds to learn after 200 sequences, 45 seconds after 400 sequences, 55 seconds after 600,…). is that an expected behavior ? (note: there is no memory swap happening)

Thanks.


#17

How do you define “learn a new sequence”?

This is typical of the TM when there are no resets. If you want to decrease model size, there is some good advice in HTM anomaly detection model too large for streaming analysis.


#18

A sequence is a set of 50 values the model is learning, then a reset happen (the sequence ID changes in the CSV file fed into the OPF model, declared as a ‘S’ column) and then another sequence of 50 values and so on,…

So what you are saying here is that the reset may not happen ?


#19

No, it should be happening, but I don’t know of any tests that look at how the model size grows with resets vs no resets. I thought that models that don’t reset would perform worse over time, but my intuition could be wrong. @Paul_Lamb do you have any thoughts on how well models perform (reset vs no reset)?


#20

If you do not use resets, the length of the sequence grows longer the more often it is repeated. For example, sequence A-B-C-D repeated for a number of times, the sequence of representations becomes:

With reset: A -> B’ -> C’ -> D’
Without a reset: A -> B’ -> C’ -> D’ -> A’ -> B’’ -> C’’ -> D’’ -> A’’ -> B’’’ -> C’’’ -> D’’’ …

In the second case (no reset) each time it reaches the end of the known sequence, the minicolumns for the next input bursts, and a new input is added to the end of the sequence. Because of the bursting, this results in the next input activating a union of that next input in all itterations of the repeating sequence. For example, if A bursted at the end of my above example, the next timesteps would have the following activity:

A -> B’+B’’+B’’’ -> C’+C’’+C’’’ -> D’+D’’+D’’’ -> A’+A’’+A’’’ -> B’’+B’’’ -> C’’+C’’’ -> D’’+D’’’ …

With the unions becoming more sparse each cycle through the repeating sequence. Since predictions are generated based on active cells, having all of these unions of activity that become denser and denser over time could definitely impact performance.