I am building an anomaly detector using HTM community fork and Spark. As a first step I am running this anomaly detector over NAB to see if it is able to catch the anomalies in the data.
I am not able to decide what parameters to set for encoders, spatial pooler and temporal memory. After some digging, I found this post https://discourse.numenta.org/t/dont-swarm-for-anomaly-models/915 by @rhyolight. Essentially it sets minValue, maxValue and resolution for the encoder and uses this file to set the parameters for spatial pooler and temporal memory. Does this mean that we can start the anomaly detection with the above mentioned parameters on any data set?
Even after copying the parameters from above the anomaly detection is not happening as intended. How should I go about it?
PS: I am intending to catch temporal as well as spatial anomalies.
We have found that it works very well on the vast majority of scalar data sets.
Are you using a raw anomaly score or also calculating an anomaly likelihood? The likelihood gives you something you can apply a threshold, which can classify what an anomaly is.
What are you expecting, vs what is really happening?
I am using log anomaly likelihood.
As mentioned in this file, do we have to keep the radius for timeOfDay as 9.49? Shouldnât this represent the time difference (in hours) between two consecutive data points? For example: All the AWSCloudwatch readings in NAB dataset are 5 mins apart therefore I was keeping the radius for timeOfDay as 0.08. Also, do we always keep dayofWeek and weekend null?
Please see the following screenshot:
Graph1: Highlights the anomaly region as provided by NAB result data.
Graph2: Plots the anomaly scores as provided by NAB result data.
Graph3: Plots the anomaly scores calculated by my anomaly detector.
I am using following code to generate the results:
from pyspark import SparkContext
import math
import datetime
import numpy as np
import matplotlib.pyplot as plt
from htm.bindings.sdr import SDR, Metrics
from htm.encoders.rdse import RDSE, RDSE_Parameters
from htm.bindings.encoders import ScalarEncoder, ScalarEncoderParameters
from htm.encoders.date import DateEncoder
from htm.bindings.algorithms import SpatialPooler
from htm.bindings.algorithms import TemporalMemory
from htm.algorithms.anomaly_likelihood import \
AnomalyLikelihood # FIXME use TM.anomaly instead, but it gives worse results than the py.AnomalyLikelihood now
from htm.bindings.algorithms import Predictor
sc = SparkContext("local", "Anomaly Detection")
parameters = {
'enc': {
"rdse":
{'resolution': 0.001, 'activeBits': 21, 'size': 700, 'sparsity': 0.02},
"scalar":
{'minimum': 0, 'maximum': 100, 'activeBits':21, 'resolution': 0.001},
"time":
{'timeOfDay': (21, 9.49)}
},
'predictor': {'sdrc_alpha': 0.035828933612157998},
'sp': {'boostStrength': 0.0,
'columnCount': 2048,
'localAreaDensity': 0.01953125,
'seed': 1956,
'potentialPct': 0.8, # % of bits (in SDR) connected to one spatial pooler bit
'synPermActiveInc': 0.003,
'synPermConnected':0.2,
'synPermInactiveDec': 0.0005},
'tm': {'activationThreshold': 13,
'cellsPerColumn': 32,
'initialPerm': 0.21,
'seed': 1960,
'maxSegmentsPerCell': 128,
'maxSynapsesPerSegment': 32,
'minThreshold': 10,
'newSynapseCount': 20,
'permanenceDec': 0.1,
'permanenceInc': 0.1},
'anomaly': {
'likelihood':
{ # 'learningPeriod': int(math.floor(self.probationaryPeriod / 2.0)),
# 'probationaryPeriod': self.probationaryPeriod-default_parameters["anomaly"]["likelihoodâ][âlearningPeriod"],
'learningPeriod': 500,
'reestimationPeriod': 100
}
}
}
useRdseEncoder = True
useScalarEncoder = False
if( useRdseEncoder and useScalarEncoder ):
raise Exception
elif((not useScalarEncoder) and (not useRdseEncoder)):
raise Exception
dateEncoder = DateEncoder(timeOfDay=parameters["enc"]["time"]["timeOfDay"])
rdseEncoderParams = RDSE_Parameters()
rdseEncoderParams.size = parameters["enc"]["rdse"]["size"]
# rdseEncoderParams.sparsity = parameters["enc"]["rdse"]["sparsity"]
rdseEncoderParams.resolution = parameters["enc"]["rdse"]["resolution"]
rdseEncoderParams.activeBits = parameters["enc"]["rdse"]["activeBits"]
rdseEncoder = RDSE(rdseEncoderParams)
scalarEncoderParams = ScalarEncoderParameters()
scalarEncoderParams.minimum = parameters["enc"]["scalar"]["minimum"]
scalarEncoderParams.maximum = parameters["enc"]["scalar"]["maximum"]
scalarEncoderParams.activeBits = parameters["enc"]["scalar"]["activeBits"]
scalarEncoderParams.resolution = parameters["enc"]["scalar"]["resolution"]
scalarEncoder = ScalarEncoder(scalarEncoderParams)
valueEncoder = None
if useScalarEncoder:
valueEncoder = scalarEncoder
else:
valueEncoder = rdseEncoder
encodingWidth = (dateEncoder.size + valueEncoder.size)
enc_info = Metrics([encodingWidth], 999999999)
# Make the HTM. SpatialPooler & TemporalMemory & associated tools.
spParams = parameters["sp"]
sp = SpatialPooler(
inputDimensions=(encodingWidth,),
columnDimensions=(spParams["columnCount"],),
potentialPct=spParams["potentialPct"],
potentialRadius=encodingWidth,
globalInhibition=True,
localAreaDensity=spParams["localAreaDensity"],
seed=spParams["seed"],
synPermInactiveDec=spParams["synPermInactiveDec"],
synPermActiveInc=spParams["synPermActiveInc"],
synPermConnected=spParams["synPermConnected"],
boostStrength=spParams["boostStrength"],
wrapAround=True
)
sp_info = Metrics(sp.getColumnDimensions(), 999999999)
tmParams = parameters["tm"]
tm = TemporalMemory(
columnDimensions=(spParams["columnCount"],),
cellsPerColumn=tmParams["cellsPerColumn"],
seed=tmParams["seed"],
activationThreshold=tmParams["activationThreshold"],
initialPermanence=tmParams["initialPerm"],
connectedPermanence=spParams["synPermConnected"],
minThreshold=tmParams["minThreshold"],
maxNewSynapseCount=tmParams["newSynapseCount"],
permanenceIncrement=tmParams["permanenceInc"],
permanenceDecrement=tmParams["permanenceDec"],
predictedSegmentDecrement=0.0,
maxSegmentsPerCell=tmParams["maxSegmentsPerCell"],
maxSynapsesPerSegment=tmParams["maxSynapsesPerSegment"]
)
tm_info = Metrics([tm.numberOfCells()], 999999999)
# setup likelihood
anParams = parameters["anomaly"]["likelihood"]
# probationaryPeriod = anParams["probationaryPeriod"]
learningPeriod = anParams["learningPeriod"] #int(math.floor(probationaryPeriod / 2.0))
anomaly_history = AnomalyLikelihood(learningPeriod=learningPeriod,reestimationPeriod=anParams["reestimationPeriod"])
print("======================================================")
print("MODELS CREATED")
print("======================================================")
# Iterate through every datum in the dataset, record the inputs & outputs.
inputs = []
anomaly = []
anomalyProb = []
anomalyProb_fromResult = []
label_fromResult = []
SPATIAL_TOLERANCE = 0.05
minValueSeenSoFar = None
maxValueSeenSoFar = None
records = sc.textFile('NAB/results/numenta/realAWSCloudwatch/numenta_rds_cpu_utilization_e47b3b.csv').map(lambda line: (line.split(',')[0], line.split(',')[1], line.split(',')[2], line.split(',')[3], line.split(',')[4])).collect()
print("======================================================")
print("DATA LOADED")
print("======================================================")
count = 0
for record in records:
# Convert date string into Python date object.
if count > 0: # to exclude the headers
dateString = datetime.datetime.strptime(record[0], "%Y-%m-%d %H:%M:%S")
cpu = float(record[1])
inputs.append(cpu)
anomalyProb_fromResult.append(float(record[2]))
label_fromResult.append(float(record[4]))
# Call the encoders to create bit representations for each value. These are SDR objects.
dateBits = dateEncoder.encode(dateString)
cpuBits = valueEncoder.encode(cpu) #rdseEncoder.encode(cpu)
# Concatenate all these encodings into one large encoding for Spatial Pooling.
encoding = SDR(encodingWidth).concatenate([cpuBits, dateBits])
enc_info.addData(encoding)
# Create an SDR to represent active columns, This will be populated by the
# compute method below. It must have the same dimensions as the Spatial Pooler.
activeColumns = SDR(sp.getColumnDimensions())
# Execute Spatial Pooling algorithm over input space.
sp.compute(encoding, True, activeColumns)
sp_info.addData(activeColumns)
# Execute Temporal Memory algorithm over active mini-columns.
tm.compute(activeColumns, learn=True)
tm_info.addData(tm.getActiveCells().flatten())
# Update min/max values and check if there is a spatial anomaly
spatialAnomaly = False
if minValueSeenSoFar != maxValueSeenSoFar:
tolerance = ( maxValueSeenSoFar - minValueSeenSoFar ) * SPATIAL_TOLERANCE
maxValueExpected = maxValueSeenSoFar + tolerance
minValueExpected = minValueSeenSoFar - tolerance
if cpu > maxValueExpected or cpu < minValueExpected:
spatialAnomaly = True
if maxValueSeenSoFar is None or cpu > maxValueSeenSoFar:
maxValueSeenSoFar = cpu
if minValueSeenSoFar is None or cpu < minValueSeenSoFar:
minValueSeenSoFar = cpu
anomalyLikelihood = anomaly_history.anomalyProbability(cpu, tm.anomaly)
logAnomalyScore = math.log(1.0000000001 - anomalyLikelihood) / -23.02585084720009
if spatialAnomaly:
anomalyProb.append(1.0)
else:
anomalyProb.append(logAnomalyScore)
count = count + 1
print("======================================================")
print("MODEL HAS BEEN FED WITH THE DATA")
print("======================================================")
plt.subplot(3, 1, 1)
plt.title("Label - Result")
plt.xlabel("Time")
plt.ylabel("Value")
inputs = np.array(inputs) / max(inputs)
plt.plot(np.arange(len(inputs)), inputs, 'red',
np.arange(len(inputs)), label_fromResult, 'blue',
)
plt.legend(labels=('Input', 'Label - Result'))
plt.subplot(3, 1, 2)
plt.title("Anomaly Scores - Result")
plt.xlabel("Time")
plt.ylabel("Value")
inputs = np.array(inputs) / max(inputs)
plt.plot(np.arange(len(inputs)), inputs, 'red',
np.arange(len(inputs)), anomalyProb_fromResult, 'blue',
)
plt.legend(labels=('Input', 'Anomaly Scores - Result'))
plt.subplot(3, 1, 3)
plt.title("Log Anomaly Scores - Detection")
plt.xlabel("Time")
plt.ylabel("Value")
inputs = np.array(inputs) / max(inputs)
plt.plot(np.arange(len(inputs)), inputs, 'red',
np.arange(len(inputs)), anomalyProb, 'blue',
np.arange(len(inputs)), np.array([0.3]*len(inputs)), 'green'
)
plt.legend(labels=('Input', 'Log Anomaly Scores - Detection', 'Threshold'))
plt.show()
Feel free to change these things. They are generic settings. Youâll do a better job tuning them to your data if you know the dataset.
Regarding your results, I agree you should see an anomaly where you are not seeing one. However, you are comparing two different HMT implementations. When the community forked NuPIC, we at Numenta were clear that we could not vet that the algorithms in the fork could be compared directly with the algorithms in NuPIC (that are run in NAB). Any model param setting change or difference, or a default setting in htm.core, could be altering results. There are a lot of places to troubleshoot, and Iâm not sure where to start helping you, especially since I have not worked on htm.core.
At this point, I have decided to move back to Nupic since it has better support from you guys.
Is my understanding right that the radius should be the time difference in hours between two data points?
I read in one of your posts that one TM detector cannot learn short term and long term patterns. So what should be the ideal frequency of data to learn the long term patterns. My data is expected to have daily patterns with weekdays being different.
The maxVal and minVal in
getScalarMetricWithTimeOfDayAnomalyParams
is used to calculate the resolution for RDSE. Should minVal and maxVal be the extreme ends of the values the data can take or the min and max of the dataset. Example: letâs say we are running the detector on cpu utilization data therefore the min and max values this data can take is 0 and 100 respectively but the dataset might only have values between 30 to 79. In this case what should be the minVal and maxVal?
I treat this as a hyper-parameter to the whole system, sample a few of the first rows to get a sense of what min/max granularity makes sense roughly w/ percentiles. Hereâs the source for the getScalarMetrics⌠RDSE param generator:
It sets the min/max range for the resolution by +/- 1 sigma on the max/min found in âmetricDataâ, with a .001 floor.
You should read this: NuPIC & Python 2 end of life
@rhyolight thanks for the headsup. I have been using docker for my experiments so I guess I am good for now until I think of a new strategy.
Is there any documentation on htm.core that I can refer to for better understanding?
any comments on this??
What post? I donât think this is true. One TM can learn all kinds of patterns and sub-patterns, it just doesnât know when any of them actually end unless you manually tell it with a reset().
I meant that it cannot learn both at the same time. Can it?
Yes, it can learn long and short patterns at the same time.
Hi, thanks for the clarification!
What does the numBuckets parameter represent there? I cant find anything in the docsâŚ
Any hint is highly appreciated!
The encoders quantize the input into Buckets. The numBuckets would be the number of possible values the encoder could produce.
I assume you are using htm.core. Where did you find that parameter? I donât see it in the TM.