Here is a very typical data problem. You have a complex system, and you have access to events occurring within it over time. You can monitor for certain events and get streams of information with associations to “things” that actually exist (conceptually or physically). Here is an example from a recent NuPIC question.
senddatehour | channelid | countryid | volume |
---|---|---|---|
14.5.2018 21:00 | 42344 | 100 | 2380.0 |
14.5.2018 22:00 | 42344 | 100 | 1372.0 |
14.5.2018 23:00 | 42344 | 100 | 761.0 |
15.5.2018 0:00 | 42344 | 100 | 410.0 |
15.5.2018 1:00 | 42344 | 100 | 229.0 |
15.5.2018 2:00 | 42344 | 100 | 204.0 |
15.5.2018 3:00 | 42344 | 100 | 285.0 |
Most folks looking to analyze this data with an HTM like NuPIC will try to encode each row in the table above as a different field with a different encoder. This makes sense in a way, but it won’t work. Why not? Because this data is not a continuous stream, it is a confluence of hundreds of smaller streams.
This data has around 200 different countryid
s, each with 4 channelid
s. Each combination of country/channel is a stream. All streams get dumped into the data shown above, which is why I called it a confluence.
Break Apart The Streams
In order to process theses streams, you need to create an HTM model for each one. You can’t use one model to process the entire confluence. So you first have to write code split the confluence up into streams, however you define them in your data set.
Run a few models first
It probably looks daunting to run over a thousand HTM models at one time, so I suggest you just don’t do it, at least not until you’ve run one or two and validated that HTM is actually giving you valuable anomaly detection results.