How Should I Represent My Input with Multiple fields?

encoders
anomaly-detection

#1

I have some data that’s basically the vehicle flow over a bunch of road lanes in a 5 minute period with a timestamp (eg. lane 1: 10 cars, lane 2: 5, lane 3: 15 etc.) How should I best represent this as in input to my model using the OPF? My aim is to detect anomalies over the whole input. At first I thought I could just sum them all, but that gave poor results (see my last post), plus it’s intuitive to think that congestion in one lane will affect the flow in another lane (eg. a crash in a lane makes that lane slower and the others relatively higher) and these are the sorts of anomalies that I want to capture.

So I thought I could provide to my model each sensor as an input, but then I need a predicted field. I also tried something similar a while ago using the highest traffic sensor as the predicted field with pretty poor results.

So should I create my own encoder that encodes each lane sensor into a single value? How do I go about implementing this and then using it? I figured it would just be a bunch of scalar encoders concatenated. Is it possible to pass through my own encoded data to a CLA model?


Run model without Predicted field input
#2

I like your idea of creating a custom encoder. Here’s how I would do it.

Assumptions:

  • all lanes have a min of 0 and share a logical max value
  • lane traffic has spatial correlations
  • no classifiers are needed because you just want an overall anomaly indication

I would create a ScalarEncoder for each lane, but limit the amount of n bits for each one so the output encoding is not so large. For example, n=40 w=3. You may want to play around with these numbers to get the encoding resolution you desire.

But now given even 10 lanes, the concatenated encoding would only have 400 bits (the same as a typical scalar encoder), and would still contain the semantics you need. This could be much more manageable for the spatial pooler.

Are you able to give me a data sample? Now I’m curious how well this would work, and I have the tools to set up an experiment to show how well the SP would handle this if I had a data sample.


#3

Have you considered using the multiencoder? It will concatenate multiple inputs into a single sdr exactly as you suggest.


Anomaly Detection - Hot Gym Tutorial : How to consider several variables?
#4

So the MultiEncoder is always used by the OPF to concatenate the outputs of the encoders you specify in your model parameters. So for example you could specify different encoder params for each field inside of modelParams.sensorParams.encoders like this:

'field-one-encoder': {
  'clipInput': True,
  'fieldname': 'field-one',
  'maxval': 70,
  'minval': 0,
  'n': 40,
  'name': 'field-one',
  'type': 'ScalarEncoder',
  'w': 3
},
'field-two-encoder': {
  'clipInput': True,
  'fieldname': 'field-two',
  'maxval': 70,
  'minval': 0,
  'n': 40,
  'name': 'field-two',
  'type': 'ScalarEncoder',
  'w': 3
},
...

This is essentially what you want in order to experiment with different encoder resolutions via the OPF.


#5

@rhyolight Here’s the data: https://www.dropbox.com/s/0w5xfh7pp38y6la/lane_data.csv.7z?dl=0

I should also have mentioned that there will be at most 5 lines together (unless you’ve ever seen a road with 6 or more lanes in a single direction). I’m having trouble getting the multi encoder to work as an encoder, when I run the data through my model I get the following exception (this is using the version of nupic from the .whl file on ubuntu, not built from source):

File "/home/CSEM/mack0242/.local/lib/python2.7/site-packages/nupic/frameworks/opf/clamodel.py", line 748, in _handleCLAClassifierMultiStep
    self._classifierInputEncoder = encoderList[self._predictedFieldIdx]
TypeError: list indices must be integers, not NoneType

My input looks like:

{'timestamp': datetime.datetime(2011, 12, 31, 23, 55), 
 'lanes': {'20': 3, '21': 0, '17': 6, '16': 4, '19': 13, '18': 8}}

The offending nupic code is here: https://github.com/numenta/nupic/blob/master/src/nupic/frameworks/opf/clamodel.py#L730-L734

Here’s my sensorParams:

{
    "sensorAutoReset": null, 
    "encoders": {
        "lanes": {
            "fieldname": "lanes", 
            "type": "MultiEncoder",
            "encoderDescriptions": {
                "20": {
                    "resolution": 0.8, 
                    "fieldname": "20", 
                    "name": "20", 
                    "w": 21, 
                    "type": "RandomDistributedScalarEncoder"
                }, 
                "21": {
                    "resolution": 0.8, 
                    "fieldname": "21", 
                    "name": "21", 
                    "w": 21, 
                    "type": "RandomDistributedScalarEncoder"
                }, 
                "17": {
                    "resolution": 0.8, 
                    "fieldname": "17", 
                    "name": "17", 
                    "w": 21, 
                    "type": "RandomDistributedScalarEncoder"
                }, 
                "16": {
                    "resolution": 0.8, 
                    "fieldname": "16", 
                    "name": "16", 
                    "w": 21, 
                    "type": "RandomDistributedScalarEncoder"
                }, 
                "19": {
                    "resolution": 0.8, 
                    "fieldname": "19", 
                    "name": "19", 
                    "w": 21, 
                    "type": "RandomDistributedScalarEncoder"
                }, 
                "18": {
                    "resolution": 0.8, 
                    "fieldname": "18", 
                    "name": "18", 
                    "w": 21, 
                    "type": "RandomDistributedScalarEncoder"
                }
            }
        }, 
        "timestamp_timeOfDay": {
            "type": "DateEncoder", 
            "timeOfDay": [ 51,  9.49], 
            "fieldname": "timestamp", 
            "name": "timestamp_timeOfDay"
        }, 
        "timestamp_weekend": {
            "weekend": [  51, 9], 
            "fieldname": "timestamp", 
            "name": "timestamp_weekend", 
            "type": "DateEncoder"
        }
    }, 
    "verbosity": 0
}

Additionally, has there been any progress on this issue: https://github.com/numenta/nupic/issues/1712 , because I think my problem is a case where multiple predicted fields would really shine.


#6

I don’t think you need to specify the MultiEncoder in the OPF model params. It is always used to concatenate the encoders together. Just take the hotgym example, remove the “consumption” encoder and add the encoders you want keyed by the labels you want (in my example above I used field-one-encoder, field-two-encoder, but you can name them by any unique string). So if you used names of 20 & 21, for example, you’d pass a row of data like this:

row = {
  "timestamp": datetime.datetime(2011, 12, 31, 23, 55),
  "20": 3,
  "21": 0,
}

No, and there likely will not be from Numenta. It is not a high priority given our current research objectives.


#7

Then how do I select a predicted field? Is the anomaly output for the predicted field or the entire input?


#8

Anomaly output is for the entire input. Predicted field doesn’t matter.


Doubt: Predicted field, anomaly Likelihood and multiple inputs
#9

Hey,

Quick question for Matt on the suggestion for dealing with the 10 fields.
If I have it right the idea is to encode each field with sort of a
mini-encoder (n=40, w=3), that way 10 of them will fit inside the normal
scalar encoder size of n=400. My question is, what’s the advantage(s) of
doing this rather than increasing the size of the encoder vector, to say
2000 instead of 400 (by giving each field a sub-encoding of n=200 and
w=15), and using local inhibition to handle the larger encoding vector?

It makes sense that keeping the encoding vector the same size would allow
the SP to continue using global inhibition and not get overwhelmed, though
it seems that compressing each field down into a mini-encoder wouldn’t
scale far past 10 fields or so. If it were 20 fields for instance each
field would only be allocated n=20. I remember you mentioning in another
post that imposing topology w/local inhibition had a negative effect on
results and I’m curious what may be lost doing it this way? It seems to
me that single regions higher up the cortical hierarchy may be looking at
multiple sub-regions, meaning that their SP’s may be dealing with larger
encoding vectors. I’m really curious to know just how you see this.

Thanks again, and to Jonathan for starting this thread,

– Sam


#10

The bigger the input space of the spatial pooler, the longer it takes to process the input. I’m trying to keep the input smaller. We know from experience that just 4 scalar input fields (1600 bits) can slow the SP down quite a bit. Local inhibition slows things down a lot more, and doesn’t make sense to use unless the input data is topological.

The SP will be looking for topological associations in the input space where there are none.

Remember we are not dealing with hierarchy yet.


#11

Ok I can see why it’s best to use global inhibition unless the input data
is topological. The MNIST digit you showed was a good example, but I
wonder how
do you know when input data is going to be topological?
Would it be
incorrect or incomplete to say that the data is topological if the active
bits within the encoding space have causal relationships between each
other?
I can see how this surely wouldn’t be the case if the incoming
streams are independent, though I’m just trying to develop an intuition for
when data will be topological other than in the vision modality. Thanks
again!!


#12

For instance the data streams than Jonathan is using in his post about
handling multiple fields. With 10 lanes of highway, it seems that the
metric for each lane (amount of traffic) must effect on each other, since a
blockage in one lane would cause traffic to spill over from it into the
other lanes. Would this count as topological associations between the
different mini-encoders for each lane that comprise the total encoding
vector??