Data aggregation

Hi! I’m new to HTM, NuPIC and all things ML.

I’m using the OPF to do anomaly detection on time series, and I’m confused about the aggregationInfo parameters in the model. Should I aggregate my data using the same parameters as this, before giving it to the model, or can/should I aggregate it differently (e.g aggregate every second, but configure the OPF to use a 1min window)?

Thanks in advance

I would advice against using the aggregationInfo part of the OPF and aggregate your own data the way you want it beforehand.

We could talk about aggregation strategies, which are very data-dependent. What does your data represent? Is there a timestamp involved?

Hey rhyolight, do you think you could very, very briefly explain why the data needs to be aggregated? With another Numenta engineer the discussion I had was that Nupic might flag “local anomalies” as real anomalies because of noise in the data?

When you think of temporal patterns, you can think of them occurring on many levels. For example, you may have a signal with a pattern that reoccurs every 10ms. That’s okay if your data frequency is 1000 points a second. You will be able to plot that and see the pattern. But you you won’t be able to see a pattern than occurs every 10 hours because you are zoomed into the data too much. In order to see this larger scale temporal pattern, you need to aggregate your data so you have maybe 100 data points per hour.

Hey @rhyolight, I just took the model_params.py from https://github.com/numenta/nupic/blob/master/examples/opf/clients/hotgym/anomaly/model_params.py and started from there.

My data comes from firewall logs, so I’m working with number of connections, number of bytes, etc. And yes, the logs include timestamps.

What is your goal? Do you want to predict something, if so what? Or do you just want to know when something is awry?

I want to find anomalies in connection patterns, with the aim to detect potential security incidents.

At the moment I’m doing so on the network as a whole, but I’m trying to figure out how to properly encode IP addresses, so I can get more meaningful results.

I’m not all that familiar with networking, but is there a “connection pattern” you can match individual user patterns against? I always want to create a model for each user on the system and treat patterns as grouped by users, but that’s usually too many models. So maybe try to understand what global temporal patterns are indicative of an anomalous situation. Then figure out how to encode it.

If you are trying to encode an IP, I think you’d need to do an IP lookup to get meaningful information.

It’s not about users but about servers themselves. Say I have server A, which is a database server. It constantly receives connections from server B, a web application, and periodically connects to server C to make a backup of the database, and to server D to fetch system updates. All that happens in the internal network (private IPs), so if for some reason it starts connecting to or receiving connections from the Internet (any public IP), that’d be an anomaly. Or if it receives connections from an employees workstation. Or if it stops receiving connections from the web application.

Given over 90% of the traffic I see is from internal networks (private IPs) there’s not much I could find doing a lookup. But I don’t really need it anyway.

An easy way to encode IPs would be to use a category encoder, assigning a category to each IP. The problem with that is that I’d lose valuable semantic meaning: the subnet to which the IP belongs. For example, if I had connections coming from 192.168.10.33 and 192.168.10.123, I’d like to encode it in a way that they’re different, but somewhat overlap, because they’re part of the same subnet.

I also thought of encoding it as four separate category values (192, 168, 10, 123; category encoded because 192 and 193 are completely different). There would be exact matches then, instead of overlaps, but they could happen in ways that have no semantic meaning, e.g 192.168.10.33 and 10.10.10.33 would have two matching components, but it means nothing because they’re completely different networks.

That’s what I’d call meaningful information in the IPs. The fact that they’re equal or not, and if they’re part of the same subnet. Plus whether they’re public or private IPs, which is the only easy thing to encode :smile:

1 Like

For the encoding issue how about doing the 4 categories as you said, but giving much more weight (higher ‘n’ values) to the first 2 categories since they determine if the subnets are the same. That way

192.168.10.33 and 192.168.10.123

would highly overlap but

hardly would, since the categories that match carry little weight.

As for the anomaly detection itself it seems the first two examples you mention are essentially spatial anomalies, since receiving from any public ID or employee workstation is anomalous regardless of the timing. In these cases it seems just using basic if statements could track that.

The next example seems to be more fit for the TM, which would detect when it stops receiving connections from a web app or if the rhythm of input from the app changes. If this rhythm is generally volatile and chaotic that could produce some false positive anomalies as well. Maybe these sequences are not so periodic, but its not what you’d consider an anomaly as long as Server A still receives them regularly enough overall. You could also have false positive anomalies depending on the regularity and periodicity of the connections to servers B and C. If these connections happen every once in a while sort of at random they could look anomalous to the TM even if they’re not anomalous for your purposes. It all depends on how much sequential patterns there are in the data.

I know nothing about your field but would certainly be very curious what you find. The TM could at least tell you how periodic and predictable the incoming sequences are. From what you describe with servers A, B, C and D it seems you could have 3 fields, encoding what A is getting from B, C and D. If you wanted to post a snippet of data that would of course help us think about it more concretely.

1 Like

Are there few enough servers that you might create an HTM model for each one?

Oh, that sounds like a fantastic idea! Thank you!

Those were very simple examples, but reality is more complex and it wouldn’t be possible to do it by hand with a few if statements. This network is pretty complex, I don’t know it very well because it’s not my own, and it’s actually not the only one I’ll have to deal with, so I need a flexible solution. I agree they are spatial anomalies, though.

Here’s another example I can think of: there’s server X, which is an Active Directory, which receives a bunch of connections every morning, when everyone logs in. From there on it may receive some random connections from people rebooting their systems or coming late to work. If someone tried to log in repeatedly, trying different passwords, that would generate a lot of traffic, which would be an anomaly at any time except when everyone else is logging in. It could be like that, or connections could be randomly distributed over the day, I really don’t know, and that’s why I need the TM.

Here’s a snippet with modified IPs:

timestamp,src_ip,dst_ip,src_port,dst_port
2018-03-01T11:00:00Z,10.1.1.61,10.1.2.23,51237,53
2018-03-01T11:00:00Z,10.1.1.61,10.1.2.23,51237,53
2018-03-01T11:00:00Z,10.2.3.179,10.1.2.17,1243,53
2018-03-01T11:00:00Z,10.2.3.179,10.1.2.17,1243,53
2018-03-01T11:00:00Z,10.3.1.68,10.1.2.23,5681,53
2018-03-01T11:00:00Z,10.3.1.68,10.1.2.23,5681,53
2018-03-01T11:00:00Z,10.4.4.26,192.168.0.1,58066,53
2018-03-01T11:00:00Z,10.4.4.26,192.168.0.1,58066,53
2018-03-01T11:00:00Z,10.10.10.80,10.2.3.31,1893,8080

At the moment I’m ignoring source and destination ports, to make things simpler. I’ll miss some real anomalies (can’t tell an ssh connection from a DNS request), but it’s fine for the moment.

That’d be the best solution IMO, but we’re talking about around a thousand machines on this network. From my point of view, each server could run it’s own model, and report anomalies to a central system, but my hands are tied here, and I have to run everything on a single machine.

Thank you for your help, guys. :heart: