Learning from real world data (ECG Heartbeat Categorization) as anomaly detection using HTM

marty1885 · October 8, 2018, 6:27am

I came across the ECG dataset on Kaggle and thought that I might be able to approach this as a anomaly detection problem. The dataset consist of data in 5 categories. 1 for normal ECG and 4 abnormal; but since I’m approaching it as detecting anomaly, I’ll be training on the “normal” set of data only and using the anomaly score to decide weather a given ECG is “normal” or not.

The setup is simple. Read data from the CSV file (ptbdb_normal.csv), feed them into a scaler encoder then send the SDR into a Temporal Memory to learn the pattern of a normal heartbeat. Then load data from a test set (ptbdb_abnormal.csv and a parts of ptbdb_normal.csv that is not used during training). Run each ECG record trough the TM and find how many anomaly it contains. Then all record containing large amount of anomalies are considered abnormal.

The results are… interesting. I can get at most 65% of all abnormal record recolonized as abnormal(ie. true positive) while 10% of normal record recolonized as abnormal(false positive). I consider this a good result as this is a classification problem; not anomaly detection. However seems that HTM can be very sensitive to hyper-parameters. Set the PermanenceIncrement off by 0.05 and the true positive rate drops by 20%. Same thing setting ConnectedPermanence too low/high. It can be really frustrating tuning the hyper parameters.

Result

And there is the result
%E5%9C%96%E7%89%87

Source code: https://github.com/marty1885/heartbeat-htm/tree/master
(Please forgive for being lazy for not writing CMake/build script )

Happy HTM hacking.

This is actually my presentation for a introductory class to ML in my collage. But I decided to also share my experiences and results here.

rhyolight · October 8, 2018, 4:10pm

This is great! Thanks for sharing!

I suggest you use getScalarMetricWithTimeOfDayAnomalyParams() like NAB does:

github.com

numenta/NAB/blob/master/nab/detectors/numenta/numentaTM_detector.py#L54-L60


      
          modelParams = getScalarMetricWithTimeOfDayAnomalyParams(
            metricData=[0],
            minVal=self.inputMin-rangePadding,
            maxVal=self.inputMax+rangePadding,
            minResolution=0.001,
            tmImplementation="tm_cpp"
          )["modelConfig"]

These are the best-tuned params we have found for anomaly detection on generic streaming scalar data. (Although the “time of day” part will not help you and you can remove that encoder entirely).

marty1885 · October 10, 2018, 4:16am

Thanks! I tried the values from getScalarMetricWithTimeOfDayAnomalyParams. But they don’t work as well as the ones I ended up with.

Also, I tired to visualize what HTM is predicting. Seems that sometimes HTM is predicting nothing (The gap on the left side of the graph) (The blue line is the value feeded to HTM, orange dots are HTM’s predictions, the red line is the anomaly score) (parameter: SDR length = 512, density = 4.6%)
Is there any way to fix this?

Jonathan_Mackenzie · October 10, 2018, 4:55am

If you want to optimise hyperparameters for your task/dataset, I wrote a script to do it here using hyperopt/hyperas: https://github.com/JonnoFTW/htm-models-adelaide/blob/master/engine/vs_model/optimize_htm.py

The script should be simple to modify for your use case. You can even do it in a distributed fashion if you have access to many different machines.

thanh-binh.to · October 22, 2018, 9:03am

@marty1885 I am interested too in the testing with this dataset. I plot this data “ptb_normal.csv” but this data is so noisy, so that I do not find any time windows of this data which has a good QRST like your blue curve.
Did you have any preprocessing before? Thanks

marty1885 · October 22, 2018, 11:03am

No. The data I have shown is in ptbdb_normal.csv. It should be the around 21th entry in the CSV.

–
I haven’t release the visualizing code yet. It is very ugly now and hacky.

thanh-binh.to · October 22, 2018, 1:05pm

@marty1885: when I load this csv file so I get data as a (4046 x 188)-matrix. As I understood this data, each column describes a ECG with 4046 recorded time points. For my eyes, the heart beats are not easily found when I plot them. I am not sure if we talk about the same thing.
Could you please plot the 1st column for first 100 time points? Thanks

marty1885 · October 22, 2018, 1:46pm

The exactly opposite is true. Each ECG records contain 187 samples (the last one being the category.). And in the file contains 4046 such records. Unlike mathematics, computers tend to be row-majro. The direction which indices of matrices being applied are reversed.

Sure. Here you go.

thanh-binh.to · October 23, 2018, 2:41pm

@marty1885 thanks

thanh-binh.to · October 25, 2018, 7:43pm

@rhyolight: I have just looked at the code of Numenta NAB as you mentioned here:

github.com

numenta/NAB/blob/master/nab/detectors/numenta/numenta_detector.py

# ----------------------------------------------------------------------
# Copyright (C) 2014, Numenta, Inc.  Unless you have an agreement
# with Numenta, Inc., for a separate license for this software code, the
# following terms and conditions apply:
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU Affero Public License version 3 as
# published by the Free Software Foundation.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
# See the GNU Affero Public License for more details.
#
# You should have received a copy of the GNU Affero Public License
# along with this program.  If not, see http://www.gnu.org/licenses.
#
# http://numenta.org/licenses/
# ----------------------------------------------------------------------

This file has been truncated. show original

and at the line 98-105:

if self.useLikelihood:
      # Compute log(anomaly likelihood)
      anomalyScore = self.anomalyLikelihood.anomalyProbability(
        inputData["value"], rawScore, inputData["timestamp"])
      logScore = self.anomalyLikelihood.computeLogLikelihood(anomalyScore)
      finalScore = logScore
    else:
finalScore = rawScore

I understood the function anomalyProbability() provides likelihood, and anomalyScore should be (1.0 - likelihood);
My questions:

Am I right here?
Why Is the logScore provided by AnomalyLikelihoodAlg comparable to rawScore? I understood anomalyScore should be comparable to rawScore.

Could you please explain me? Thanks

rhyolight · October 25, 2018, 8:16pm

Both the anomaly score and anomaly likelihood are between 0.0 & 1.0. rawScore is the anomaly score. The anomaly likelihood algorithm needs it to compute the likelihood. The anomalyScore poorly named, I think it represents a container for scores related to anomalies. Inspect it to see what properties it gives you. Some better docs here.

thanh-binh.to · October 26, 2018, 8:48am

@rhyolight: In my experiments, if I use

for testing with sinus wave,
then the anomalyScore increases into 0.9, even by perfect 1-step-prediction.
What should be wrong here?

thanh-binh.to · October 26, 2018, 11:53am

I checked again and again, and pretty sure that we have to change the line 103 to
finalScore = 1 - logScore
so it works correctly…

marty1885 · October 27, 2018, 1:02am

Hi all.
I just uploaded the slides for my presentation. Feel free to take it for any purpose if you find it useful.

marty1885 · October 27, 2018, 1:05am

@thanh-binh.to
Do you have any luck with classifying ECG? I truly think 65% accuracy can be improved. But I don’t find a way to make that happen.

thanh-binh.to · October 27, 2018, 7:06am

@marty1885 : i am sure, that i can improve it. I am looking for more classes of anomalies, different illness.

thanh-binh.to · October 27, 2018, 1:55pm

@marty1885 i have just looked at your Source Code and am not sure, if your calculation of success rate is correct.
You accumulate the anomaly scores at every measuring points and compare them to a threshold (14).
Where does the value 14 komme from?
In my test, I train HTM with 100 first patterns, and after that test with the left data.
If I considering the classification for each ecg measuring point so I get success rat for detecting normal ecg over 65%
But if I calculating the prediction errors within a class so the success rate will be higher…,

marty1885 · October 27, 2018, 1:59pm

14 is an arbitrary value that I find (by brute-force) that gives me the best result. (The largest difference between true-positive and false-positive). I should have also put that into the hyper-parameter list.

I calculate the final accuracy of the model by how much ECG records are classified correctly versus the total amount of records.

thanh-binh.to · October 28, 2018, 12:13pm

@marty1885: Now, I can improve the detection rate into 92.42%. For doing it, I use 300 first pattern for normal and abnormal ECG, then classify the left 14000 pattern in 2 databases.
I will test HTM for MNIST classification next week…

marty1885 · October 28, 2018, 5:07pm

@thanh-binh.to
That’s amazing. Can you share the code? I think that I can learn a lot from it.

Topic		Replies	Views
Bad results anomaly detection NuPIC	7	1437	November 5, 2018
Learning Normal & Ignoring Anomalous Behavior NuPIC anomaly-detection	1	661	July 26, 2018
Is there any real world public dataset of anomaly detection example? Machine Learning question	3	772	December 10, 2021
Is HTM right tool for my case? (Online anomaly value detection in data stream) HTM.Java	4	696	September 26, 2018
Use Cases Of HTM Theory Lounge classification	3	723	October 22, 2018

Learning from real world data (ECG Heartbeat Categorization) as anomaly detection using HTM

Result

Related topics