I am trying to create my first HTM model and I have some questions. If you guys could help me it would be good.
. My model will have 5 features: datetime, clientId, number of Transactions aggegated daily, day of week, predicted value;
. I have a huge amount of transactions, something like 2 millions a day, so data is not a problem;
. I intend to train my model with two weeks of aggregated transactions. Should I pre aggregate the number of transactions or let HTM framework do that for me;
. The predictions must be done by clientId.
How would you think it would be the best way to process this kind of model:
. One model for each client;
. Order the data by ClientId and use the Reset method to inform the htm that it would restart the learning process when ClientId changes;
. Should I use Opf or Network API, in this case what could be the best architecture.
I presume that Reset can be used for this kind of processing. But as I am new to HTM I am not sure.
Yes, I think you have to do this. How many clients are there? If there are thousands, it will be an interesting scaling problem. Letâs talk this issue out before addressing your other questions, because it might mean HTM is not a good fit. I have never had any success trying to have one NuPIC model process input containing data aggregated from several disparate âcontextsâ. It works much better with one model per âcontextâ.
Yes we a huge number of clients. Today we have a Full Bayesian Network working good but I would like to test HTM in this context.
We have something like 65K clients, 2 million transactions (total) a day. If we aggregate the transactions grouping by ClientId, weâll have 65K rows a day.
In the NYC Traffic example you have trained for each route separated models, thatâs why I was wondering if we should have specific model per client.
I will have 5 fields in my model, at least in the initial model. If I consider separate models what could be the size of my training data ?
I think, realistically, you can only expect to run hundreds of NuPIC models at a time even after optimizing and scaling. That being said, you might be able to get considerable value by sampling a small percentage of clients and building out models for them. If that is something interesting, we should start talking about what these 5 fields represent and how they should be encoded and aggregated.
Itâs all training data. NuPIC is an online learning system, we donât differentiate between training data and other data.
I would like test for a few clients and depending on the accuracy I can manage to run all of them.
I hope and beleive that HTM can solve some of our problems. We have really huge servers, if this works ok for some of our clients maybe we can scale the process. It is not definitely easy, but I want to try.
First thing is to give as much info as you can about each field, especially the one youâre trying to predict. Also, what are your goals for prediction? How can prediction provide value in this data set?
Daily Number of Transactions grouped by ClientId: Integer, range 0 to 20,000;
Scoring: float, range 1 to 100;
Field 5 will be the predicted value. This field informs a risk scoring based on the transactional profile of the client. I would like to predict this kind of value specially the 5 days ahead.
Just another information. I would like to persist the models in a database.
You said earlier that you had 65K rows of data a day. I assume that is across all your clients, right? About how many per client per day? Iâm asking because I donât understand the field above without this information. If a field is a âDaily Numberâ of anything, then I must assume there are 65K clients each with one row of data per day? Is that right?
My gut feeling is that you need more than daily data points to uncover this level of pattern. If you are trying to identify risk in human behaviors, then âtime of dayâ can be just as important as âday of weekâ. (Also you donât need to have a special âday of weekâ field, we will configure that in the DateEncoder.)
Is there any way you can get data on 15 minute intervals from clients?
Thatâs great! I would suggest you focus more at this time scale. One thing NuPIC does pretty well is identify patterns based on human time structures because of the way they are encoded. I have some details about this in this video if you havenât seen it.
Now what types of information can you get at this time scale? Can you get latitude / longitude of an event?
Oh, and another thing I should warn you about is that predictions â especially at 5 days out â are not going to be very good. However you might get value out of our anomaly detection capabilities.You might get decent anomalies from this data. Do you happen to have any data where you know an event occurred you want to be able to identify in the future? Something like that is really useful to know when building out a model.
Iâm not talking about prediction precision. That wonât be very good, I can assure you. But, anomaly detection could work well, depending on the data stream. You might be able to test if it works for you if you can get some data into a simple format for HTM Studio, which is a demo app we use to run temporal data through NuPIC and see how well it picks up anomalies.
There are a couple of problems with NuPICâs temporal memory system when used as a prediction engine:
Temporal pooling is not implemented within the 1-layer SP/TM structure. We donât understand exactly how to identify what sequences a spatial input has been observed within. The information may be there in the data between the cells, but walking through the neurons to uncover it is prohibitively expensive. Your brain couldnât be working that way either. So there must be some other way, and thatâs one of the things weâve been researching for the past couple years. In our current research work, we refer to it as a âpooling layerâ, and we implement this type of feature using a 2-layer structure. That seems to be how it is done in the brain.
All our NuPIC examples of temporal anomaly detection contain the 1-layer SP/TM configuration, where temporal pooling doesnât work. That is why predictions are not as good as they should be, because we are not informed about the possible sequences we are currently within at any given time.
Sequence start/stop must be manually marked. This is a problem we havenât solved yet. It may have to do with attention, it may require apical feedback.
That being said, NuPIC is good at anomaly detection, because we can extract anomalies scores from the cellular states without needing #1 or #2 above. An indication that behavior is strange might be more valuable than a somewhat accurate prediction.