Encoding vision for HTM

did you simulate the function of event based camera based on RGB sensor? That looks like that you more detect edges and generate the right images. Do you use Difference of Gaussian filter? There is an interesting project “Virtual Retina”, using whose SW you can generate retina spikes from static images.

1 Like

@thanh-binh.to Yes, it takes the RGB sensor and checks for color changes above a threshold x which turns the corresponding pixel on and that is the event itself. I did not use anything else so computationally, it literally costs nothing on top of RGB sensor.

@sunguralikaan by using event based camera we often face to unpleasant situations with very little events and event no events at all. In those cases the number of active cells in spatial pooler goes down dramatically.

@sunguralikaan @ycui do you know how to avoid this problem? Thanks

I try to use boostStrength, but it is very time consuming, and sp runs to slowly!
I try to change other parameters but no success

I became interested in event based sensors because I thought maybe we should look at the problem differently. Normal sensors communicate the state of input. Event based sensors communicate the change in the input. As you said, there is no communication when there is no transition. On the other hand, HTM learns the transitions/change of the input too. So if HTM learns transitions, why should it learn anything when there are no input transitions. In other words, do we really need to learn for example A->B->B->B->C? Why is learning that sequence as (A->B->C) or (A->B and B->C as separate sequences) not enough? Maybe there is another solution for the need of learning A->B->B->B->C -if there is any- and maybe we are using the wrong tool to make up for this. The autonomous agent that I work on learns sequences with parts that have no input transitions. This leads to just unnecessary stalls and redundant action selections. Why can’t it learn only the transitions, the stuff that actually changes?

From this perspective, event based sensors are a perfect match for HTM. Having no active columns at all when there are no transitions makes sense and maybe the learning should be designed around this. Or maybe, this is all wrong :slight_smile:

Bonus idea: Think about the functionality of manual reset tags in current HTM theory. No input change and as a result, no active neurons would be the reset itself. A sequence would reset itself naturally when the data stops changing. For some reason, this sounds so right.

2 Likes

Timing is going to be very important for a game agent. One way to partially accomplish timing in HTM today (until there is a more scalable, biologically-inspired timing mechanism implemented) is with repeating inputs. This is one of the reasons I have spent some significant time exploring ways of handling repeating inputs in sequence memory (I am also concerned with being able to exhibit some degree of timing)

1 Like

I see where you are coming from, especially in the long run. Although I am having a hard time imagining an actual use case that would be practical for the current ability of HTM. Can you describe a case where learning repeated inputs is actually useful? On top of my head, if I give the agent positive reward with a delay when it encounters a specific pattern, I guess learning a repeated sequence would help. Because it would be seeing the same thing over and over and the reward would come for example at the 5th same input. But I am not sure if this is a practical case or central ability considering the complexities and implications it introduces onto the system. As I said, maybe we are using the wrong tool to have timing. Moreover, it seems to me that the important tasks that would require timing would also require a lot more than just current HTM + timing (which is why I think you said partially) but I probably lack your experience in timing department.

I just realized that I did not give an actual answer to your question. Have you tried adjusting stimulusThreshold variable of Spatial Pooler? For example setting it to zero. This would result in active columns even if you have a single on input bit.

If you want activity at all times, you could set that variable to 0 and add an extra but constant overlap to all columns so that the ones that are getting the least use (highest boost value) becomes active. This would ensure activation even without any inputs but I am not sure about its usefulness because I have tried it in the past.

I’ve started another topic HERE to discuss, since it is getting off topic.

1 Like

I had an idea on how to improve this color discretization step. I have implemented and am currently experimenting with these ideas, with marginal success.

Two requirements for encoded image SDRs are that every bit of information represents a range of possible values, and that for every input a certain fraction of bits activate in the SDR. The effects of these requirements are that each output bit is inaccurate and redundant. This implementation makes every output bit receptive to a unique range of inputs. These ranges are uniformly distributed through the inputs space and the widths are the size of the input space multiplied by the target sparsity. This meets both requirements for being an SDR and has more representational power than if many of the bits represented the same ranges. This design makes all of those redundancies add useful information.

Don’t you lose some semantic similarity with this way? I may have understood it wrong. Can you describe it in other words?

I will write the steps, first the set up steps which are performed at program start up:

  1. Assume the encoder will accept a grey-scale image with (M x N) pixels to encode into an SDR with dimensions (M x N x C) bits, where C is the number of bits in the output SDR for each each input pixel.
  2. Let BIN_CENTERS = Array[M x N x C]
    This array stores a (floating-point) number for each bit in the output SDR. Fill BIN_CENTERS with uniform random numbers in the same range as the pixel intensity (typically 0 to 255, but depends on the input data format).
  3. Let BIN_WIDTH = data-range * (1 - target-sparsity)
    Where data-range is the theoretical range of pixel intensities. For an 8 bit image this is 256, but it depends on the input data format.
    Where the target-sparsity is the desired fraction of zeros in the output SDR.
  4. Let BIN_LOWER_BOUNDS = BIN_CENTERS - BIN_WIDTH/2
    Let BIN_UPPER_BOUNDS = BIN_CENTERS + BIN_WIDTH/2
    The shapes of both of these arrays are (M x N x C). Together these arrays describe the ranges of input values which each output bit will be responsive to.

Steps to encode an image:

  1. Let IMAGE = Array[M x N], this is the input, it is real valued (aka floating point).
  2. Let SDR = Array[M x N x C], this is the output, it is boolean.
  3. Iterate through the SDR using the indexes (x, y, z), and set every bit of the SDR according to step 4.
  4. Let SDR[x, y, z] = BIN_LOWER_BOUNDS[x, y, z] <= IMAGE[x, y] and IMAGE[x, y] <= BIN_UPPER_BOUNDS[x, y, z].

To encode color images create separate encoders for each color channel. Then recombine the output SDRs into a single monolithic SDR by multiplying them together. Multiplication is equivalent to logical “and” in this situation. Notice that the combined SDR’s sparsity is the different; the fraction of bits which are active in the combined SDR is the product of the fraction of the bits which are active in all input SDRs. For example, to recreate the original posters example with 8 bits per pixel and a density of 1/8: create three encoders with 8 bits per pixel and a density of 1/2.

What follows is a discussion of this encoders semantic similarity properties. Semantic similarity happens when two inputs which are similar have similar SDR representations. This encoder design does two things to cause semantic similarity: (1) SDR bits are responsive to a range of input values, and (2) topology allows near by bits to represent similar things.

  1. Effects of thresholds:
    Many encoders apply thresholds to real valued input data which converts it into boolean output. In this encoder, the thresholds are ranges which are referred to as ‘bins’. A small change in the input value might cause some of the output bits to change and a large change in input value will change all of the bits in the output. How sensitive the output bits are to changes in the input value -the semantic similarity- is determined by the sizes of the bins. The sizes of the bins are in turn determined by the sparsity, as follows:
    Assume that the inputs are distributed in a uniform random way throughout the input range. The size of the bins then determines the probability that an input value will fall inside of a bin. This means that the sparsity is related to the size of the bins, which in turn means that the sparsity is related to the amount of semantic similarity. This may seem counter-intuitive but this same property holds true for all encoders which use bins to convert real numbers to discrete bits.

  2. Effects of topology:
    This encoder relies on topology in the input image, the idea that adjacent pixels in the input image are likely to show the same thing.
    2A) If an area of the output SDR can not represent a color because it did not generate the bins needed to, then it may still be near to a pixel which does represent the color. In the case where there are less than one active output bits per input pixel, multiple close together outputs can work together to represent the input.
    2B) If an output changes in response to a small change in input, then some semantic similarity is lost. Topology allows nearby outputs to represent the same thing, and since each output uses random bins they will not all change when the input reaches a single threshold.

1 Like

If you look at the OgmaNeo software implemented by Eric Laukien you will find out some ideas for encoding images like stdp encoder, chunk encoder etc.
Moreover, OpenCV has also a bio-inspired retina so that you can get motion information from images and put them into SP.
I believe that those encoders are enough good for the first testing before trying cnn or something from deep learning?
Any idea or comment?

I tested a lot of stuff like that, including the OpenCV retina. Convolutional network features worked better. This shouldn’t be surprising, since they effectively “cheat” by using backpropagation to learn the best possible representations (or at least locally optimal ones).

Maybe you want to try purely biological representations first, that’s fine. But another approach would be to say “how well can I do even if I cheat?”

And I’ve found that it’s still hard to get very far on any hard problem. So starting from a less powerful technique just to maintain biological plausibility seems like bad strategy.

1 Like

Regarding encoders, I completely agree. I also think using today’s ML techniques for encoding is a good idea and will probably be useful. I’ve already thrown around the idea from a previous hackathon Frank Carey had about using DL to do feature extraction from frames of video in order to create an SDR stream, each bit is a feature. Add weighting to that or convert those features into semantic fingerprints? Who knows. There is so much potential here.

2 Likes
@jakebruce 
I tested a lot of stuff like that, including the OpenCV retina. 
Convolutional network features worked better

which output of openCV retina do you put into HTM? Parvo or Magno image?

I didn’t get good results from this, so I don’t know if my experiments will illuminate anything, but I fed downsampled images into the retina, and took the top 5% of the activations of each of the channels as my SDRs. Then I concatenated the parvocellular with the magnocellular SDRs and that was my input.

I didn’t pursue this much further because the retina is intrinsically foveated, and controlling saccades for active foveation was far out of scope for the place recognition task I was working on. Without active foveation I think the nonuniform sampling of my imagery was probably throwing away a lot of useful information.

@jakebruce OK. Do you have any comparison results between cnn+nupic and pure DL by any recognition task?

Not on hand. This was very early in my exploration and I pruned this path off quickly because it wasn’t performing well. To give you an idea without specific numbers, some of the state of the art place recognition algorithms simply accumulate the sum of absolute differences between images (SAD) over a sequence, and use that as the difference metric. And as simplistic as that is, it was outperforming the temporal memory representation on every dataset by a wide margin. And this shouldn’t be surprising, the temporal memory representation has very poor invariance properties: it’s a pattern separator, not a pattern completer.

I also wasn’t using NuPIC, I have my own implementations. So you may or may not find that useful.

Note however, “pure DL” is not the state of the art on place recognition. Using DL features in the framework of accumulating differences across sequences can help with invariance, but it’s the sequence search that does the work.

@jakebruce do you have any success?

I’m not sure what you mean. I’ve had success using SAD and CNN features. I haven’t been using temporal memory features for place recognition recently.