This plot is for the number of twitter posts over time of a particular user. The plot of anomaly likelihood and anomaly score seems very counter-intuitive as the periods of time where the number of posts generate a peak in the plot seem to be periods for which anomaly likelihood is minimum.
Min/Max value for my data is around 0-80000. But most of the data has values close to 0-1000.
The encoder resolution I am using is 1.0-2.0
What could be the reason for such behavior of the nupic anomaly detection for this data ?
What do the rest of your model parameters look like?
maxBoost for spatial params is 0.1
global inhibition is 0
potentialPct for spatial params is 0.85
temporal params permanence is 0.15
remaining params are default.
encoder is RandomDistributedScalarEncoder
What about time encoders?
These are the model parameters I am using currently:
'modelParams': {u'anomalyParams': {u'anomalyCacheRecords': None,
u'autoDetectThreshold': None,
u'autoDetectWaitRecords': 5030},
u'clParams': {u'alpha': 0.035828933612158,
u'regionName': u'SDRClassifierRegion',
u'steps': u'1',
u'verbosity': 0},
u'inferenceType': u'TemporalAnomaly',
u'sensorParams': {u'encoders': {'_classifierInput': {'classifierOnly': True,
'fieldname': 'post_count',
'name': '_classifierInput',
'resolution': 1.0,
'type': 'RandomDistributedScalarEncoder'},
'post_count': {'fieldname': 'post_count',
'name': 'post_count',
'resolution': 1.0,
'type': 'RandomDistributedScalarEncoder'}},
u'sensorAutoReset': None,
u'verbosity': 0},
u'spEnable': True,
u'spParams': {u'columnCount': 2048,
u'globalInhibition': 0,
u'inputWidth': 0,
'maxBoost': 0.1,
u'numActiveColumnsPerInhArea': 40,
u'potentialPct': 0.85,
u'seed': 1956,
u'spVerbosity': 0,
u'spatialImp': u'cpp',
u'synPermActiveInc': 0.02,
u'synPermConnected': 0.2,
u'synPermInactiveDec': 0.005},
u'tpEnable': True,
u'tpParams': {u'activationThreshold': 13,
u'cellsPerColumn': 32,
u'columnCount': 2048,
u'globalDecay': 0.0,
u'initialPerm': 0.21,
u'inputWidth': 2048,
u'maxAge': 0,
u'maxSegmentsPerCell': 128,
u'maxSynapsesPerSegment': 32,
u'minThreshold': 10,
u'newSynapseCount': 20,
u'outputType': u'normal',
u'pamLength': 3,
u'permanenceDec': 0.15,
u'permanenceInc': 0.15,
u'seed': 1960,
u'temporalImp': u'cpp',
u'verbosity': 0},
u'trainSPNetOnlyIfRequested': False},
u'predictAheadTime': None,
u'version': 1}
Also, while initializing the model params, it requires input min and inputMax values as input. But what happens when the future values of the input are well outside that limit? For example … in this case, initially the input range is around 0 and 1000 but later on changes to 80,000. Are the model params updated as input changes ?
How are the tweet counts aggregated? What does the tweet count represent? Only one user’s tweets? That’s a lot of tweets, even if this is a daily aggregation.
Ideally, the aggregation would be small enough that you could incorporate time of day into the encoding. So a 15 minute aggregation is usually good.
I think I’ve been confused about some of the things I’ve told you about RDSE resolution… standby I will probably have corrections .
Right now the aggregation is hourly (It may also be changed to daily or 15 mins). Actually it is not for a single user. More like tweets related to a particular topic from multiple users.
How do I include time of day into the encoding ?
I did figure out how to encode the date and time… the parameters for that are as follows:
time of day : (21, 6)
day of week: (21, 3)
season: (21, 4)
Now the plot that I get is as follows:
It does look much better than the previous one. But I observed that after the model processes a lot of data, the anomaly score becomes very low after a while. I think that is based on the date encoder parameters I’ve used. How to decide these parameters ?
In the model params you pasted earlier, your spParams.inputWidth
is 0
, which is not right. That should be the number of bits in the encoding. How did you come up with your model params?
To find out what spParams.inputWidth
should be, I think you might be able to get this number by calling:
model._getEncoder().getWidth()
I am using the getScalarMetricWithTimeOfDayAnomalyParams function from the nupic.frameworks.opf.common_models.cluster_params
And then configuring some of the parameters as needed. I didn’t change the inputWidth. That was returned by this function. It is 400 as given by model._getEncoder().getWidth()
There may be a problem with that function, or maybe you are using it wrong. Here is the code example from the docs:
from nupic.frameworks.opf.model_factory import ModelFactory
from nupic.frameworks.opf.common_models.cluster_params import (
getScalarMetricWithTimeOfDayAnomalyParams)
params = getScalarMetricWithTimeOfDayAnomalyParams(
metricData=[0],
tmImplementation="cpp",
minVal=0.0,
maxVal=100.0)
model = ModelFactory.create(modelConfig=params["modelConfig"])
model.enableLearning()
model.enableInference(params["inferenceArgs"])
That is exactly how I am using it. Only difference is I am modifying some parameters from parmas
before I call [quote=“rhyolight, post:13, topic:2395”]
ModelFactory.create(modelConfig=params[“modelConfig”])
[/quote]
I assume you are using maxval=50000.0
(or whatever the max actually is)?
Yes, I am using the max value and min value of the actual data for that.
Then it returns you params with included datetime encoder configurations, right? Your first model params did not have those. You should use the datetime encoder configurations the function returns.
I tried with the default datetime encoder parameters as returned by the function. I get the plot as follows:
It still looks like it would label a lot of data points as anomalies.
The anomaly likelihood levels looks much better. Now you can flag anomalies by setting a 0.9999 threshold on the anomaly likelihood. Adjust this value for higher/lower resolution on the anomalies.
It looks like it is finding things that are not directly attributed to the spikes in tweets. It would be interesting to see some of these plots closer up with the dates displayed.
In this case, the anomaly likelihood values seem to be pretty high, but that can change over time. In an earlier signal, I was getting significantly lower anomaly likelihood values. In that case, I would need to select a lower threshold value. If we do not know the range of the input data before hand, or if it probable that the range of the input stream can change significantly over time, in that case would the same threshold for anomaly likelihood work for the complete time series ?