I came across the ECG dataset on Kaggle and thought that I might be able to approach this as a anomaly detection problem. The dataset consist of data in 5 categories. 1 for normal ECG and 4 abnormal; but since I’m approaching it as detecting anomaly, I’ll be training on the “normal” set of data only and using the anomaly score to decide weather a given ECG is “normal” or not.
The setup is simple. Read data from the CSV file (ptbdb_normal.csv), feed them into a scaler encoder then send the SDR into a Temporal Memory to learn the pattern of a normal heartbeat. Then load data from a test set (ptbdb_abnormal.csv and a parts of ptbdb_normal.csv that is not used during training). Run each ECG record trough the TM and find how many anomaly it contains. Then all record containing large amount of anomalies are considered abnormal.
The results are… interesting. I can get at most 65% of all abnormal record recolonized as abnormal(ie. true positive) while 10% of normal record recolonized as abnormal(false positive). I consider this a good result as this is a classification problem; not anomaly detection. However seems that HTM can be very sensitive to hyper-parameters. Set the PermanenceIncrement off by 0.05 and the true positive rate drops by 20%. Same thing setting ConnectedPermanence too low/high. It can be really frustrating tuning the hyper parameters.
These are the best-tuned params we have found for anomaly detection on generic streaming scalar data. (Although the “time of day” part will not help you and you can remove that encoder entirely).
Thanks! I tried the values from getScalarMetricWithTimeOfDayAnomalyParams. But they don’t work as well as the ones I ended up with.
Also, I tired to visualize what HTM is predicting. Seems that sometimes HTM is predicting nothing (The gap on the left side of the graph) (The blue line is the value feeded to HTM, orange dots are HTM’s predictions, the red line is the anomaly score) (parameter: SDR length = 512, density = 4.6%)
Is there any way to fix this?
@marty1885 I am interested too in the testing with this dataset. I plot this data “ptb_normal.csv” but this data is so noisy, so that I do not find any time windows of this data which has a good QRST like your blue curve.
Did you have any preprocessing before? Thanks
@marty1885: when I load this csv file so I get data as a (4046 x 188)-matrix. As I understood this data, each column describes a ECG with 4046 recorded time points. For my eyes, the heart beats are not easily found when I plot them. I am not sure if we talk about the same thing.
Could you please plot the 1st column for first 100 time points? Thanks
The exactly opposite is true. Each ECG records contain 187 samples (the last one being the category.). And in the file contains 4046 such records. Unlike mathematics, computers tend to be row-majro. The direction which indices of matrices being applied are reversed.
Both the anomaly score and anomaly likelihood are between 0.0 & 1.0. rawScore is the anomaly score. The anomaly likelihood algorithm needs it to compute the likelihood. The anomalyScore poorly named, I think it represents a container for scores related to anomalies. Inspect it to see what properties it gives you. Some better docs here.
@marty1885 i have just looked at your Source Code and am not sure, if your calculation of success rate is correct.
You accumulate the anomaly scores at every measuring points and compare them to a threshold (14).
Where does the value 14 komme from?
In my test, I train HTM with 100 first patterns, and after that test with the left data.
If I considering the classification for each ecg measuring point so I get success rat for detecting normal ecg over 65%
But if I calculating the prediction errors within a class so the success rate will be higher…,
14 is an arbitrary value that I find (by brute-force) that gives me the best result. (The largest difference between true-positive and false-positive). I should have also put that into the hyper-parameter list.
I calculate the final accuracy of the model by how much ECG records are classified correctly versus the total amount of records.
@marty1885: Now, I can improve the detection rate into 92.42%. For doing it, I use 300 first pattern for normal and abnormal ECG, then classify the left 14000 pattern in 2 databases.
I will test HTM for MNIST classification next week…