Hi,
I have started to experiment with Numenta few weeks ago and was very impressed with its approach for making predictions, anomaly detection in real time.
I am trying to apply NuPic on a simulated data set, to detect fraudulent transactions. The full data set has been published within a Kaggle competition for generating fraud specific data sets https://www.kaggle.com/ntnu-testimon/paysim1
This is the data set used in my experiment (4900 records): https://gist.github.com/mirsci/23002ef151855d780970bd0e3951854e
datetime,string,float,string,float,float,string,float,float,int,int
T,
2017-9-11 3:32:18,PAYMENT,9839.64,C1231006816,170136,160296.36,M1979787155,0,0,0,0
2017-9-11 3:32:17,PAYMENT,1864.28,C1666544295,21249,19384.72,M2044282225,0,0,0,0
2017-9-11 3:32:16,TRANSFER,181,C1305486145,181,0,C553264065,0,0,1,0
2017-9-11 3:32:15,CASH_OUT,181,C840083671,181,0,C38997010,21182,0,1,0
2017-9-11 3:32:14,PAYMENT,11668.14,C2048537720,41554,29885.86,M1230701703,0,0,0,0
2017-9-11 3:32:13,PAYMENT,7817.71,C90045638,53860,46042.29,M573487274,0,0,0,0
2017-9-11 3:32:12,PAYMENT,7107.77,C154988899,183195,176087.23,M408069119,0,0,0,0
2017-9-11 3:32:11,PAYMENT,7861.64,C1912850431,176087.23,168225.59,M633326333,0,0,0,0
2017-9-11 3:32:10,PAYMENT,4024.36,C1265012928,2671,0,M1176932104,0,0,0,0
2017-9-11 3:32:09,DEBIT,5337.77,C712410124,41720,36382.23,C195600860,41898,40348.79,0,0
2017-9-11 3:32:08,DEBIT,9644.94,C1900366749,4465,0,C997608398,10845,157982.12,0,0
2017-9-11 3:32:07,PAYMENT,3099.97,C249177573,20771,17671.03,M2096539129,0,0,0,0
2017-9-11 3:32:06,PAYMENT,2560.74,C1648232591,5070,2509.26,M972865270,0,0,0,0
2017-9-11 3:32:05,PAYMENT,11633.76,C1716932897,10127,0,M801569151,0,0,0,0
2017-9-11 3:32:04,PAYMENT,4098.78,C1026483832,503264,499165.22,M1635378213,0,0,0,0
2017-9-11 3:32:03,CASH_OUT,229133.94,C905080434,15325,0,C476402209,5083,51513.44,0,1
And its associated model parameters:
The anomaly likelihood and score results are captured in the following chart, where Class label of 0 - no fraud, 1 - fraud:
The higher anomaly likelihood does not seem to match the true fraudulent transactions regions.
Would you know if there is a different way to optimize this model?
Thanks in advance for your help!