Fraud detection model with synthetic data set

mirsci · September 20, 2017, 8:09am

Hi,

I have started to experiment with Numenta few weeks ago and was very impressed with its approach for making predictions, anomaly detection in real time.

I am trying to apply NuPic on a simulated data set, to detect fraudulent transactions. The full data set has been published within a Kaggle competition for generating fraud specific data sets https://www.kaggle.com/ntnu-testimon/paysim1

This is the data set used in my experiment (4900 records): https://gist.github.com/mirsci/23002ef151855d780970bd0e3951854e

datetime,string,float,string,float,float,string,float,float,int,int
T,
2017-9-11 3:32:18,PAYMENT,9839.64,C1231006816,170136,160296.36,M1979787155,0,0,0,0
2017-9-11 3:32:17,PAYMENT,1864.28,C1666544295,21249,19384.72,M2044282225,0,0,0,0
2017-9-11 3:32:16,TRANSFER,181,C1305486145,181,0,C553264065,0,0,1,0
2017-9-11 3:32:15,CASH_OUT,181,C840083671,181,0,C38997010,21182,0,1,0
2017-9-11 3:32:14,PAYMENT,11668.14,C2048537720,41554,29885.86,M1230701703,0,0,0,0
2017-9-11 3:32:13,PAYMENT,7817.71,C90045638,53860,46042.29,M573487274,0,0,0,0
2017-9-11 3:32:12,PAYMENT,7107.77,C154988899,183195,176087.23,M408069119,0,0,0,0
2017-9-11 3:32:11,PAYMENT,7861.64,C1912850431,176087.23,168225.59,M633326333,0,0,0,0
2017-9-11 3:32:10,PAYMENT,4024.36,C1265012928,2671,0,M1176932104,0,0,0,0
2017-9-11 3:32:09,DEBIT,5337.77,C712410124,41720,36382.23,C195600860,41898,40348.79,0,0
2017-9-11 3:32:08,DEBIT,9644.94,C1900366749,4465,0,C997608398,10845,157982.12,0,0
2017-9-11 3:32:07,PAYMENT,3099.97,C249177573,20771,17671.03,M2096539129,0,0,0,0
2017-9-11 3:32:06,PAYMENT,2560.74,C1648232591,5070,2509.26,M972865270,0,0,0,0
2017-9-11 3:32:05,PAYMENT,11633.76,C1716932897,10127,0,M801569151,0,0,0,0
2017-9-11 3:32:04,PAYMENT,4098.78,C1026483832,503264,499165.22,M1635378213,0,0,0,0
2017-9-11 3:32:03,CASH_OUT,229133.94,C905080434,15325,0,C476402209,5083,51513.44,0,1

And its associated model parameters:

gist.github.com

https://gist.github.com/mirsci/766495c3df7c9f2e1170e28be54ce7f0

sim_model_params.py

MODEL_PARAMS = \
{ 'aggregationInfo': { 'days': 0,
                       'fields': [],
                       'hours': 0,
                       'microseconds': 0,
                       'milliseconds': 0,
                       'minutes': 0,
                       'months': 0,
                       'seconds': 0,
                       'weeks': 0,

This file has been truncated. show original

The anomaly likelihood and score results are captured in the following chart, where Class label of 0 - no fraud, 1 - fraud:

The higher anomaly likelihood does not seem to match the true fraudulent transactions regions.

Would you know if there is a different way to optimize this model?

Thanks in advance for your help!

rhyolight · September 20, 2017, 5:22pm

It looks like there is only about an hour of data in the data set. At this time scale, encoding a timestamp doesn’t help you. It looks like the data set is artificial, seeing that it has an even distribution of transactions (exactly one per second). Since there is a regular interval to the data entries (artificial or not), you should be fine just removing the timestamp_timeOfDay field altogether (make it None in the params).

Also, I see you are using a ScalarEncoder for your binary isFraud value. I would suggest this advice instead, as I believe it more evenly distributes bits and assures no overlap between values.

mirsci · September 26, 2017, 8:47am

Many thanks, Matt. I included your suggestions in the encoders and the results seem to be slightly better from the previous run (although not much):

Would you suggest to include more data in this dataset (for example, a day or more worth of data), so Numenta has more input to detect anomalies?
Should the data granularity be defined at minute intervals (data is synthetic and there is some room for pre-processing)?

Thanks once more!

rhyolight · September 26, 2017, 3:07pm

More data is almost always better. A NuPIC model should have seen at least 1,000 data points before I would rely at all on its output. IMO 3,000 is much better.

How much total data is available for the contest? Hours, days, months? If there is months of data, then yes definitely aggregate somehow and include the timestamp (at this scale it could be very important). But the aggregation might make it impossible to identify the exact input that caused the anomaly, instead just give you a time range.

sheiser1 · September 26, 2017, 3:58pm

Hey @rhyolight,

Is it also possible to get the anomaly likelihood to kick in after fewer time steps by changing the ‘WINDOW_SIZE’ in the nupic_anomaly_output.py file? I have a limited amount of data and the current WINDOW of 300 leads to anomaly likelihood values of 0.5 for the first nearly 400 time steps. If I’d like to see likelihood values by time step 100 for instance, should I change WINDOW to ~100? Thanks!

rhyolight · September 26, 2017, 4:05pm

Makes sense to me. Ideally this would be configurable, but this class is not really a “first class citizen” of NuPIC cause it’s not a core algorithm. You might get better values by changing the settings for your data.

sheiser1 · September 26, 2017, 4:26pm

Ok, so when you say ‘settings for your data’ do you mean things like parameter values or the sampling rate?

rhyolight · September 26, 2017, 4:30pm

I was just generalizing, honestly. I had nothing specific in mind.

Topic		Replies	Views
Why am I seeing lot of false positives? NuPIC	12	2529	June 22, 2016
Int Assertion Error NuPIC question	41	2768	September 21, 2016
Financial Anomaly detection NuPIC question	29	1952	January 19, 2019
Bad Anomaly detection for complex periods data NuPIC usage-help , anomaly-detection	4	2085	October 10, 2019
NuPIC model better matches new data than data it learned on NuPIC	6	1038	February 11, 2017

Fraud detection model with synthetic data set

Related topics