HTM multiple Fields

Hi,

I am trying to create my first HTM model and I have some questions. If you guys could help me it would be good.
. My model will have 5 features: datetime, clientId, number of Transactions aggegated daily, day of week, predicted value;
. I have a huge amount of transactions, something like 2 millions a day, so data is not a problem;
. I intend to train my model with two weeks of aggregated transactions. Should I pre aggregate the number of transactions or let HTM framework do that for me;
. The predictions must be done by clientId.

How would you think it would be the best way to process this kind of model:

. One model for each client;
. Order the data by ClientId and use the Reset method to inform the htm that it would restart the learning process when ClientId changes;
. Should I use Opf or Network API, in this case what could be the best architecture.

I presume that Reset can be used for this kind of processing. But as I am new to HTM I am not sure.

Can you help me ?

Thanks in advance

2 Likes

Yes, I think you have to do this. How many clients are there? If there are thousands, it will be an interesting scaling problem. Let’s talk this issue out before addressing your other questions, because it might mean HTM is not a good fit. I have never had any success trying to have one NuPIC model process input containing data aggregated from several disparate “contexts”. It works much better with one model per “context”.

Hi,

Yes we a huge number of clients. Today we have a Full Bayesian Network working good but I would like to test HTM in this context.

We have something like 65K clients, 2 million transactions (total) a day. If we aggregate the transactions grouping by ClientId, we’ll have 65K rows a day.

In the NYC Traffic example you have trained for each route separated models, that’s why I was wondering if we should have specific model per client.

I will have 5 fields in my model, at least in the initial model. If I consider separate models what could be the size of my training data ?

Thanks

I think, realistically, you can only expect to run hundreds of NuPIC models at a time even after optimizing and scaling. That being said, you might be able to get considerable value by sampling a small percentage of clients and building out models for them. If that is something interesting, we should start talking about what these 5 fields represent and how they should be encoded and aggregated.

It’s all training data. NuPIC is an online learning system, we don’t differentiate between training data and other data.

1 Like

I would like test for a few clients and depending on the accuracy I can manage to run all of them.

I hope and beleive that HTM can solve some of our problems. We have really huge servers, if this works ok for some of our clients maybe we can scale the process. It is not definitely easy, but I want to try.

Can you help me ?

Thanks

First thing is to give as much info as you can about each field, especially the one you’re trying to predict. Also, what are your goals for prediction? How can prediction provide value in this data set?

Field List:

  1. ClientId: Numeric String;
  2. DateTime of Transaction;
  3. Day of Week;
  4. Daily Number of Transactions grouped by ClientId: Integer, range 0 to 20,000;
  5. Scoring: float, range 1 to 100;

Field 5 will be the predicted value. This field informs a risk scoring based on the transactional profile of the client. I would like to predict this kind of value specially the 5 days ahead.

Just another information. I would like to persist the models in a database.

You said earlier that you had 65K rows of data a day. I assume that is across all your clients, right? About how many per client per day? I’m asking because I don’t understand the field above without this information. If a field is a “Daily Number” of anything, then I must assume there are 65K clients each with one row of data per day? Is that right?

It will be 1 line per client a day. The aggregation is used to reduce the number of rows to processed.

And you want to predict the risk of a client 5 days in the future?

Yes. This will allow us to define priorities and try to discover problems in advance.

My gut feeling is that you need more than daily data points to uncover this level of pattern. If you are trying to identify risk in human behaviors, then “time of day” can be just as important as “day of week”. (Also you don’t need to have a special ‘day of week’ field, we will configure that in the DateEncoder.)

Is there any way you can get data on 15 minute intervals from clients?

Yes, No problem we have transactions all over the day. In fact we have 2 million transactions a day. We can group them as we want.

That’s great! I would suggest you focus more at this time scale. One thing NuPIC does pretty well is identify patterns based on human time structures because of the way they are encoded. I have some details about this in this video if you haven’t seen it.

Now what types of information can you get at this time scale? Can you get latitude / longitude of an event?

Oh, and another thing I should warn you about is that predictions – especially at 5 days out – are not going to be very good. However you might get value out of our anomaly detection capabilities.You might get decent anomalies from this data. Do you happen to have any data where you know an event occurred you want to be able to identify in the future? Something like that is really useful to know when building out a model.

1 Like

Yes. I have the scoring calculated for last three months. I can go back and check how much the prediction is working.

I beleive that we will have a very good sense of precision for 1-step and 5-step prediction.

I’m not talking about prediction precision. That won’t be very good, I can assure you. But, anomaly detection could work well, depending on the data stream. You might be able to test if it works for you if you can get some data into a simple format for HTM Studio, which is a demo app we use to run temporal data through NuPIC and see how well it picks up anomalies.

What you exactly mean will not be very good ?

There are a couple of problems with NuPIC’s temporal memory system when used as a prediction engine:

  1. Temporal pooling is not implemented within the 1-layer SP/TM structure. We don’t understand exactly how to identify what sequences a spatial input has been observed within. The information may be there in the data between the cells, but walking through the neurons to uncover it is prohibitively expensive. Your brain couldn’t be working that way either. So there must be some other way, and that’s one of the things we’ve been researching for the past couple years. In our current research work, we refer to it as a “pooling layer”, and we implement this type of feature using a 2-layer structure. That seems to be how it is done in the brain.

    All our NuPIC examples of temporal anomaly detection contain the 1-layer SP/TM configuration, where temporal pooling doesn’t work. That is why predictions are not as good as they should be, because we are not informed about the possible sequences we are currently within at any given time.

  2. Sequence start/stop must be manually marked. This is a problem we haven’t solved yet. It may have to do with attention, it may require apical feedback.

That being said, NuPIC is good at anomaly detection, because we can extract anomalies scores from the cellular states without needing #1 or #2 above. An indication that behavior is strange might be more valuable than a somewhat accurate prediction.

2 Likes

Hi,

I am working on a pilot with the Network API. I will let you know how things are going.

Thanks

1 Like