Encoding vision for HTM



I see where you are coming from, especially in the long run. Although I am having a hard time imagining an actual use case that would be practical for the current ability of HTM. Can you describe a case where learning repeated inputs is actually useful? On top of my head, if I give the agent positive reward with a delay when it encounters a specific pattern, I guess learning a repeated sequence would help. Because it would be seeing the same thing over and over and the reward would come for example at the 5th same input. But I am not sure if this is a practical case or central ability considering the complexities and implications it introduces onto the system. As I said, maybe we are using the wrong tool to have timing. Moreover, it seems to me that the important tasks that would require timing would also require a lot more than just current HTM + timing (which is why I think you said partially) but I probably lack your experience in timing department.

Repeating input as timing mechanism

I just realized that I did not give an actual answer to your question. Have you tried adjusting stimulusThreshold variable of Spatial Pooler? For example setting it to zero. This would result in active columns even if you have a single on input bit.

If you want activity at all times, you could set that variable to 0 and add an extra but constant overlap to all columns so that the ones that are getting the least use (highest boost value) becomes active. This would ensure activation even without any inputs but I am not sure about its usefulness because I have tried it in the past.


I’ve started another topic HERE to discuss, since it is getting off topic.


I had an idea on how to improve this color discretization step. I have implemented and am currently experimenting with these ideas, with marginal success.

Two requirements for encoded image SDRs are that every bit of information represents a range of possible values, and that for every input a certain fraction of bits activate in the SDR. The effects of these requirements are that each output bit is inaccurate and redundant. This implementation makes every output bit receptive to a unique range of inputs. These ranges are uniformly distributed through the inputs space and the widths are the size of the input space multiplied by the target sparsity. This meets both requirements for being an SDR and has more representational power than if many of the bits represented the same ranges. This design makes all of those redundancies add useful information.


Don’t you lose some semantic similarity with this way? I may have understood it wrong. Can you describe it in other words?


I will write the steps, first the set up steps which are performed at program start up:

  1. Assume the encoder will accept a grey-scale image with (M x N) pixels to encode into an SDR with dimensions (M x N x C) bits, where C is the number of bits in the output SDR for each each input pixel.
  2. Let BIN_CENTERS = Array[M x N x C]
    This array stores a (floating-point) number for each bit in the output SDR. Fill BIN_CENTERS with uniform random numbers in the same range as the pixel intensity (typically 0 to 255, but depends on the input data format).
  3. Let BIN_WIDTH = data-range * (1 - target-sparsity)
    Where data-range is the theoretical range of pixel intensities. For an 8 bit image this is 256, but it depends on the input data format.
    Where the target-sparsity is the desired fraction of zeros in the output SDR.
    The shapes of both of these arrays are (M x N x C). Together these arrays describe the ranges of input values which each output bit will be responsive to.

Steps to encode an image:

  1. Let IMAGE = Array[M x N], this is the input, it is real valued (aka floating point).
  2. Let SDR = Array[M x N x C], this is the output, it is boolean.
  3. Iterate through the SDR using the indexes (x, y, z), and set every bit of the SDR according to step 4.
  4. Let SDR[x, y, z] = BIN_LOWER_BOUNDS[x, y, z] <= IMAGE[x, y] and IMAGE[x, y] <= BIN_UPPER_BOUNDS[x, y, z].

To encode color images create separate encoders for each color channel. Then recombine the output SDRs into a single monolithic SDR by multiplying them together. Multiplication is equivalent to logical “and” in this situation. Notice that the combined SDR’s sparsity is the different; the fraction of bits which are active in the combined SDR is the product of the fraction of the bits which are active in all input SDRs. For example, to recreate the original posters example with 8 bits per pixel and a density of 1/8: create three encoders with 8 bits per pixel and a density of 1/2.

What follows is a discussion of this encoders semantic similarity properties. Semantic similarity happens when two inputs which are similar have similar SDR representations. This encoder design does two things to cause semantic similarity: (1) SDR bits are responsive to a range of input values, and (2) topology allows near by bits to represent similar things.

  1. Effects of thresholds:
    Many encoders apply thresholds to real valued input data which converts it into boolean output. In this encoder, the thresholds are ranges which are referred to as ‘bins’. A small change in the input value might cause some of the output bits to change and a large change in input value will change all of the bits in the output. How sensitive the output bits are to changes in the input value -the semantic similarity- is determined by the sizes of the bins. The sizes of the bins are in turn determined by the sparsity, as follows:
    Assume that the inputs are distributed in a uniform random way throughout the input range. The size of the bins then determines the probability that an input value will fall inside of a bin. This means that the sparsity is related to the size of the bins, which in turn means that the sparsity is related to the amount of semantic similarity. This may seem counter-intuitive but this same property holds true for all encoders which use bins to convert real numbers to discrete bits.

  2. Effects of topology:
    This encoder relies on topology in the input image, the idea that adjacent pixels in the input image are likely to show the same thing.
    2A) If an area of the output SDR can not represent a color because it did not generate the bins needed to, then it may still be near to a pixel which does represent the color. In the case where there are less than one active output bits per input pixel, multiple close together outputs can work together to represent the input.
    2B) If an output changes in response to a small change in input, then some semantic similarity is lost. Topology allows nearby outputs to represent the same thing, and since each output uses random bins they will not all change when the input reaches a single threshold.

Repo for merging various Encoders

If you look at the OgmaNeo software implemented by Eric Laukien you will find out some ideas for encoding images like stdp encoder, chunk encoder etc.
Moreover, OpenCV has also a bio-inspired retina so that you can get motion information from images and put them into SP.
I believe that those encoders are enough good for the first testing before trying cnn or something from deep learning?
Any idea or comment?


I tested a lot of stuff like that, including the OpenCV retina. Convolutional network features worked better. This shouldn’t be surprising, since they effectively “cheat” by using backpropagation to learn the best possible representations (or at least locally optimal ones).

Maybe you want to try purely biological representations first, that’s fine. But another approach would be to say “how well can I do even if I cheat?”

And I’ve found that it’s still hard to get very far on any hard problem. So starting from a less powerful technique just to maintain biological plausibility seems like bad strategy.


Regarding encoders, I completely agree. I also think using today’s ML techniques for encoding is a good idea and will probably be useful. I’ve already thrown around the idea from a previous hackathon Frank Carey had about using DL to do feature extraction from frames of video in order to create an SDR stream, each bit is a feature. Add weighting to that or convert those features into semantic fingerprints? Who knows. There is so much potential here.

I tested a lot of stuff like that, including the OpenCV retina. 
Convolutional network features worked better

which output of openCV retina do you put into HTM? Parvo or Magno image?


I didn’t get good results from this, so I don’t know if my experiments will illuminate anything, but I fed downsampled images into the retina, and took the top 5% of the activations of each of the channels as my SDRs. Then I concatenated the parvocellular with the magnocellular SDRs and that was my input.

I didn’t pursue this much further because the retina is intrinsically foveated, and controlling saccades for active foveation was far out of scope for the place recognition task I was working on. Without active foveation I think the nonuniform sampling of my imagery was probably throwing away a lot of useful information.


@jakebruce OK. Do you have any comparison results between cnn+nupic and pure DL by any recognition task?


Not on hand. This was very early in my exploration and I pruned this path off quickly because it wasn’t performing well. To give you an idea without specific numbers, some of the state of the art place recognition algorithms simply accumulate the sum of absolute differences between images (SAD) over a sequence, and use that as the difference metric. And as simplistic as that is, it was outperforming the temporal memory representation on every dataset by a wide margin. And this shouldn’t be surprising, the temporal memory representation has very poor invariance properties: it’s a pattern separator, not a pattern completer.

I also wasn’t using NuPIC, I have my own implementations. So you may or may not find that useful.

Note however, “pure DL” is not the state of the art on place recognition. Using DL features in the framework of accumulating differences across sequences can help with invariance, but it’s the sequence search that does the work.


@jakebruce do you have any success?


I’m not sure what you mean. I’ve had success using SAD and CNN features. I haven’t been using temporal memory features for place recognition recently.


@jakebruce how can you put SAD features into your HTM? Do you need any encoder of features into binary?


Ratslam has used scene recognition using a FFT filtered strip view.

I find the results to be surprisingly effective in a cluttered office environment.


@Bitking but it is not clear for me how to use it with HTM?


@thanh-binh.to Sorry, I might not have been clear. I am not using HTM for this. SAD is a measure of difference between images, so that’s not what you’d input to an HTM. It’s a way to compare two images, not a way to encode an image.

The RatSLAM style image encoding would be reasonably easy to use in HTM. It gives you a vector of intensities, and you could take the top 5% or something to make an SDR.


Have you examined how they encode strips to get a data set?
The ratslm githubs have the source code that you can examine to see how they did it.