Fraud detection model with synthetic data set

Hi,

I have started to experiment with Numenta few weeks ago and was very impressed with its approach for making predictions, anomaly detection in real time.

I am trying to apply NuPic on a simulated data set, to detect fraudulent transactions. The full data set has been published within a Kaggle competition for generating fraud specific data sets https://www.kaggle.com/ntnu-testimon/paysim1

This is the data set used in my experiment (4900 records): https://gist.github.com/mirsci/23002ef151855d780970bd0e3951854e

datetime,string,float,string,float,float,string,float,float,int,int
T,
2017-9-11 3:32:18,PAYMENT,9839.64,C1231006816,170136,160296.36,M1979787155,0,0,0,0
2017-9-11 3:32:17,PAYMENT,1864.28,C1666544295,21249,19384.72,M2044282225,0,0,0,0
2017-9-11 3:32:16,TRANSFER,181,C1305486145,181,0,C553264065,0,0,1,0
2017-9-11 3:32:15,CASH_OUT,181,C840083671,181,0,C38997010,21182,0,1,0
2017-9-11 3:32:14,PAYMENT,11668.14,C2048537720,41554,29885.86,M1230701703,0,0,0,0
2017-9-11 3:32:13,PAYMENT,7817.71,C90045638,53860,46042.29,M573487274,0,0,0,0
2017-9-11 3:32:12,PAYMENT,7107.77,C154988899,183195,176087.23,M408069119,0,0,0,0
2017-9-11 3:32:11,PAYMENT,7861.64,C1912850431,176087.23,168225.59,M633326333,0,0,0,0
2017-9-11 3:32:10,PAYMENT,4024.36,C1265012928,2671,0,M1176932104,0,0,0,0
2017-9-11 3:32:09,DEBIT,5337.77,C712410124,41720,36382.23,C195600860,41898,40348.79,0,0
2017-9-11 3:32:08,DEBIT,9644.94,C1900366749,4465,0,C997608398,10845,157982.12,0,0
2017-9-11 3:32:07,PAYMENT,3099.97,C249177573,20771,17671.03,M2096539129,0,0,0,0
2017-9-11 3:32:06,PAYMENT,2560.74,C1648232591,5070,2509.26,M972865270,0,0,0,0
2017-9-11 3:32:05,PAYMENT,11633.76,C1716932897,10127,0,M801569151,0,0,0,0
2017-9-11 3:32:04,PAYMENT,4098.78,C1026483832,503264,499165.22,M1635378213,0,0,0,0
2017-9-11 3:32:03,CASH_OUT,229133.94,C905080434,15325,0,C476402209,5083,51513.44,0,1

And its associated model parameters:

The anomaly likelihood and score results are captured in the following chart, where Class label of 0 - no fraud, 1 - fraud:

The higher anomaly likelihood does not seem to match the true fraudulent transactions regions.

Would you know if there is a different way to optimize this model?

Thanks in advance for your help!

1 Like

It looks like there is only about an hour of data in the data set. At this time scale, encoding a timestamp doesn’t help you. It looks like the data set is artificial, seeing that it has an even distribution of transactions (exactly one per second). Since there is a regular interval to the data entries (artificial or not), you should be fine just removing the timestamp_timeOfDay field altogether (make it None in the params).

Also, I see you are using a ScalarEncoder for your binary isFraud value. I would suggest this advice instead, as I believe it more evenly distributes bits and assures no overlap between values.

Many thanks, Matt. I included your suggestions in the encoders and the results seem to be slightly better from the previous run (although not much):

Would you suggest to include more data in this dataset (for example, a day or more worth of data), so Numenta has more input to detect anomalies?
Should the data granularity be defined at minute intervals (data is synthetic and there is some room for pre-processing)?

Thanks once more!

More data is almost always better. A NuPIC model should have seen at least 1,000 data points before I would rely at all on its output. IMO 3,000 is much better.

How much total data is available for the contest? Hours, days, months? If there is months of data, then yes definitely aggregate somehow and include the timestamp (at this scale it could be very important). But the aggregation might make it impossible to identify the exact input that caused the anomaly, instead just give you a time range.

1 Like

Hey @rhyolight,

Is it also possible to get the anomaly likelihood to kick in after fewer time steps by changing the ‘WINDOW_SIZE’ in the nupic_anomaly_output.py file? I have a limited amount of data and the current WINDOW of 300 leads to anomaly likelihood values of 0.5 for the first nearly 400 time steps. If I’d like to see likelihood values by time step 100 for instance, should I change WINDOW to ~100? Thanks!

Makes sense to me. Ideally this would be configurable, but this class is not really a “first class citizen” of NuPIC cause it’s not a core algorithm. You might get better values by changing the settings for your data.

1 Like

Ok, so when you say ‘settings for your data’ do you mean things like parameter values or the sampling rate?

I was just generalizing, honestly. I had nothing specific in mind. :man_shrugging: