Encoding vision for HTM

Regarding encoders, I completely agree. I also think using today’s ML techniques for encoding is a good idea and will probably be useful. I’ve already thrown around the idea from a previous hackathon Frank Carey had about using DL to do feature extraction from frames of video in order to create an SDR stream, each bit is a feature. Add weighting to that or convert those features into semantic fingerprints? Who knows. There is so much potential here.

I tested a lot of stuff like that, including the OpenCV retina. 
Convolutional network features worked better

which output of openCV retina do you put into HTM? Parvo or Magno image?

I didn’t get good results from this, so I don’t know if my experiments will illuminate anything, but I fed downsampled images into the retina, and took the top 5% of the activations of each of the channels as my SDRs. Then I concatenated the parvocellular with the magnocellular SDRs and that was my input.

I didn’t pursue this much further because the retina is intrinsically foveated, and controlling saccades for active foveation was far out of scope for the place recognition task I was working on. Without active foveation I think the nonuniform sampling of my imagery was probably throwing away a lot of useful information.

@jakebruce OK. Do you have any comparison results between cnn+nupic and pure DL by any recognition task?

Not on hand. This was very early in my exploration and I pruned this path off quickly because it wasn’t performing well. To give you an idea without specific numbers, some of the state of the art place recognition algorithms simply accumulate the sum of absolute differences between images (SAD) over a sequence, and use that as the difference metric. And as simplistic as that is, it was outperforming the temporal memory representation on every dataset by a wide margin. And this shouldn’t be surprising, the temporal memory representation has very poor invariance properties: it’s a pattern separator, not a pattern completer.

I also wasn’t using NuPIC, I have my own implementations. So you may or may not find that useful.

Note however, “pure DL” is not the state of the art on place recognition. Using DL features in the framework of accumulating differences across sequences can help with invariance, but it’s the sequence search that does the work.

@jakebruce do you have any success?

I’m not sure what you mean. I’ve had success using SAD and CNN features. I haven’t been using temporal memory features for place recognition recently.

@jakebruce how can you put SAD features into your HTM? Do you need any encoder of features into binary?

Ratslam has used scene recognition using a FFT filtered strip view.

I find the results to be surprisingly effective in a cluttered office environment.

1 Like

@Bitking but it is not clear for me how to use it with HTM?

@thanh-binh.to Sorry, I might not have been clear. I am not using HTM for this. SAD is a measure of difference between images, so that’s not what you’d input to an HTM. It’s a way to compare two images, not a way to encode an image.

The RatSLAM style image encoding would be reasonably easy to use in HTM. It gives you a vector of intensities, and you could take the top 5% or something to make an SDR.

Have you examined how they encode strips to get a data set?
The ratslm githubs have the source code that you can examine to see how they did it.

I don’t know how strong your background is in algorithms and various coding techniques but I will try to keep this as general as possible considering the source material.

In the ratslam, they use some observed organization of the hippocampus.

The body of the hippocampus and related entorhinal cortex are known to encode the environment as responses in “grid cells” - cells that fire in relation to where a critter is in a given environment. This grid pattern varies in spacing as you go from front to back.

In HTM we generally call this an encoder, in this case, spatial features.

If you were to put lots of SDR dendrites though this tissue it would be sampling features of the environment. These samples are collected in an area commonly called “place cells” which are a store of significant places (good or bad) inside this environment. If something good or bad happens you trigger learning of this place.

This is not strictly based on visual cues - a rat traveling through dark tunnels also update these grid cells. In the ratslam model, we also keep track of head/body orientation and use the distance traveled to get a rough idea of where we are.

The ratslam does use vision to ground the location to observations. The observed environment in this model is visual strips of the entire local environment. Presenting this directly to a memory has lots of high-frequency edges that would mostly translate into noise so they do a rather clever trick - they analyze the frequency content of the image. This translates spatial data into frequency data. (This is the Fourier domain they are speaking of.) Large objects would contribute to a low-frequency in that part of the image. Smaller objects would be a higher frequency blip in a certain location. Two close views of the same scene would have the same frequency blips in about the same places.

A different way to think about grid cells!
(For you fellow theorists: thevariable sized grid cell patterns in the hippocampus as you travel along the dorsal-ventral axis sure looks a lot like what a spatial FFT plot would look like. The body already does this encoding trick with the waveform to frequencies mechanical FFT in the ear. (The brain does chords in processing.)

Speculation: For the neural equivalent of the inner-ear labyrinth think of these long curled limbic structures with oscillation going on. The length vs width must have a resonant frequency in terms of propagation just as it would in the sound domain. Phase encoding against the predominate 10 Hz wave could allow some interesting computational tricks.)

This is a very crude but effective object recognition method which also encodes relative arrangement between two objects.

A view can be compared to stored views to select the one that most closely matches stored views.

The various metrics (orientation, distance traveled, scenes) can be fused into a real-time map of the environment. One would hope that they are stored in a way where the scene location key is close to the other position data keys in the map as it would be in the brain.

I hope this gives you the tools to read the ratslam paper and understand what they are doing.

For bonus points consider a closely related structure - the amygdala. It also samples the encoded information in the hippocampus and entorhinal cortex. It reacts to things like faces and expressions and various archetypal things like snakes and looming overhead objects. These sensations are colored with emotional tones that drive action and memory. Think about what this means - somehow this soup of encoded features can be recognized as high-level objects. These are hardwired primitives in our lizard brain.


@Bitking thanks for your explanation. In the mean time, I read some papers and even the source codes of ratSLAM for better understanding the visual encoding. Currently, I really do not find any use of FFT for encoding vision in the source codes. Maybe, the use of FFT is only the first idea, but in the final version and also in software implementation, it is more useful to use very simple algorithms for calculating the match quality between two images than expensive methods. I believe, the ratslam combines visual sensing and sensor motion and simulates the function/perception of entorhinal cortex…

1 Like

@Bitking I greatly appreciate your intricate explanation, it really helps me in thinking about the new project I have: real time decision making for a robot with a simple goal (like scanning or cleaning all possible area in a room) moving through a familiar or novel environment to do so.

I’ve only read the abstract of the ratslam paper so I’m expecting to learn a lot on this there, I just wanted to plainly ask if you see a natural fit with the data it produces and HTM (and ideally nupic)? Thanks


I have spent considerable time thinking about this exact question: What exactly is it that the brain learns about what it sees and what does it do with this learning?

The recognition of objects and encoding of space and navigation to important landmarks in this space are some of the oldest problems that neural systems had to solve. This solution seems to be located in the subcortical structures and they do not match the general plan of the six-layer cortical arrangement.

I don’t know if Nupic will be useful here. Nupic has an intense focus on the function of the cortex and the special features of its six-layer arrangement.

There is (at this time) a very weak nod to the hierarchical organization of HTM but at this point, I don’t see a solid theoretical framework to show how information is distributed in this arrangement of parts. Going further - the arrangement of information that was originally proposed in the HTM theory is in flux.

Please note that Jeff Hawkins & crew have started to pay more attention to the hippocampus region of the brain in regards to the encoding of space. I would not be at all surprised if this theoretical focus trickles down to something usable in the Nupic universe.


Yes, there are multiple implementations of the ratslam project. I suggest that you look through the links on the face page of the GitHub I linked. A basic google for “ratslam” pulls up a wealth of projects & papers where each researcher has added a personal take on implementation.

Even if you don’t use the FFT method to encode vision (listed in one of the links) most of the implementation use some sort of visual feedback to learn the ground truth of where the robot is actually located.

I find the one where they strapped a MacBook to a car roof to be very amusing.


@Bitking: I have looked at different implementation of ratslam: Python, C++, Matlab etc, and all of them use very simple image matching algorithms.
I am not sure if the FFT method is effective for encoding vision. I will quickly test it under Matlab…
Currently I am investing a little time for Numenta’s lateral pooler, because I think it looks quite good.


Please report on what you find. I am also very interested in how well HTM works on image recognition.
Learning what works and does not work helps to avoid duplication of effort.

1 Like

I am asking about this in a different but related thread.