Encoding vision for HTM

@sunguralikaan I hope you don’t mind I pulled this into a new topic since it seems like it could spawn a whole discussion of its own. I’ve tried a few things for encoding visual input to HTM; here are my thoughts.

(1 - color discretization) The simplest starting point that I’ve used for a visual encoder is to bin each pixel into one of 8 categories based on the RGB intensity: low-red/low-green/low-blue, low-red/low-green/high-blue, low-red/high-green/low-blue, low-red/high-green/high-blue, and so on.
Advantages: fixed sparsity, no pretraining, invariant to small changes in color.
Disadvantages: increases the dimensionality of the input by a factor of 8, doesn’t play nice with colors near category boundaries.

(2 - edge detection) I’ve also tried feeding in edges from an edge detector as a binary image. I think the color-changes approach you mention (similar to event-based cameras) would be similar to this.
Advantages: no pretraining, doesn’t increase the dimensionality of the input.
Disadvantages: edge detectors are not very reliable so the same scene can look very different based on small changes in the image (your access to the geometry may reduce this problem), they require hand-tuning of their parameters, their sparsity is not fixed, and they don’t represent flat textures.

(3 - gabor filters) Another approach that seemed reasonable was to convolve the input with a set of gabor filters and threshold the output to a binary image for each filter.
Advantages: represents edges in different orientations as different features, no pretraining, you can choose the top N% of activations to fix the sparsity, and you can adjust the stride of the filters (or do max-pooling) to reduce the degree to which they blow up the input dimensionality.
Disadvantages: hand-tuning of thresholds, and like edge detectors they don’t represent flat regions well.

(4 - learned local features) Along the same lines as gabor filters, and similar to some of the feature descriptors in computer vision (like ORB), I’ve used learned local features at every point in the image. I’ve trained these features for autoencoding (similar to the trainable universal encoder I posted) and I’ve also learned them by sparse coding methods, or just a simple histogram over encountered image patches.
Advantages: the features are better tuned to represent the kinds of patterns that actually exist in your data, and you can adjust the stride or max-pool to reduce dimensionality.
Disadvantages: requires pretraining, hand-tuning of thresholds.

(5 - pretrained networks) And I’ve briefly experimented with using pretrained networks like AlexNet, ResNet, VGGNet, which have already learned good features (they won the ImageNet competition) and then thresholding those features as input to the HTM.
Advantages: good features, the ability to choose features at multiple scales (since we don’t have hierarchy in HTM yet).
Disadvantages: very expensive to compute compared to the other methods here, and if your data doesn’t have much in common with ImageNet images then the features may not be appropriate.

Things I haven’t tried include global image descriptors like GIST, or greyscaled and adaptive-histogram-equalized images, both of which would have to be thresholded to binary and would not be fixed in sparsity. Another global encoding approach is the trainable universal encoder I posted. You could also just compute the descriptors for existing handcrafted features from computer vision (SIFT, SURF, ORB, BRISK, BRIEF, LBP) in a dense sampling over the image and binarize those somehow.

Unfortunately I don’t have performance curves for these evaluations, as most of them performed too poorly to merit further investigation. I suspect this was due to the other weaknesses in HTM on the problem I was working on (visual place recognition and robot navigation) as opposed to particular problems with the encoding styles.

Let me know if anyone has any other ideas for encoding visual input to HTM. I think it’s definitely an area with lots of unexploited potential.

3 Likes

This would be worth trying, but I anticipate problems. If you had some feature map of scalars you were encoding in the input, then I can see how this would be useful, as a sort of non-maximal-suppression approach. If you’ve got edges, or +/- intensity events, then I’m not sure you’d want to do this, because it could wipe out pixels that do actually contain edges (or events).

My current sensor is closer to this but with 125 categories. It is fixed sparsity but you cannot control the sparsity easily. You would have to include more categories if you wanted more sparsity, which is what I did.

I would eliminate option 1 because I want to also reduce the dimensionality that the SP needs to capture as well as being able to control the level of sparsity in an easier way, like some sort of a threshold.

Option 2 is a contender but I have to do image processing to detect “edges” in flat textures as you said. The event based camera would also take care of this, if I am not mistaken?

Option 3, is this possible in real time?

Option 4, I am not sure about a pretrained sensor. Not that it wouldn’t work or it wouldn’t be bioplausible. It’s just that I would want to be able to quickly modify the way I construct the image from the POV of the agent and expect it to work. Is this real time again?

I’ve considered option 5 previously but the costs are kind of counterproductive for me and pretrained again.

You mean the options you provided right? I am closer to trying the event based approach as it encodes the change in data not the actual image which kind of maps perfectly with an HTM based agent that learns behaviors according to the sensory change an action results in.

Yes but would that be significant enough to cancel out the advantage that you get from being able to fix your sparsity by controlling the inhibition?

1 Like

In a sense. If the agent is moving, then events are due to either moving objects, or edges (intensity gradients) in the world. If the agent is stationary, then events are only due to moving objects.

Yes, this is similar to how an edge detector works, and it’s very fast (1000+ Hz depending on your image size). Even faster on the gpu but gpu is not required.

Yes, this would be the same computational cost as the gabor filters (if the local features are the same size as the gabor filters, which is reasonable).

Yes, plus some others that I didn’t mention (such as random projections [see Johnson-Lindenstrauss lemma], etc). Keep in mind this was on a totally different task than yours, so don’t take this as evidence that it wouldn’t work for you.

It’s a good question. I don’t know, but both seem like reasonable options.

We need saccadic movement, i.e. transform the image in a temporal sequence. This has to be done according past experience (prediction).

The fovea might be quite simpler. Perhaps option 1) is enough. In any case seems like static image should considered before to jump into motion.

@jakebruce I prefer to use Option Gabor filter too.

So here is some early implementation of event based sensor compared to the default RGB sensor. It kind of works simpler and better than I expected. You can control its sparsity via a threshold. You can even fix the sparsity if you introduce some sort of inhibition among pixels. It also has a significantly lower dimensionality compared to RGB sensor; a pixel is either on or off.

The main limiting factor is that the image motion speed effects everything. The current agent movement that rotates and jumps from Voronoi cell to Voronoi cell (0:15 in thesis video) needs to be altered accordingly. This may prove to be a good constraint in the long run though.

So what would be the preferred way of handling the static images though? What happens if the agent stops; should it not really see anything?

4 Likes

did you simulate the function of event based camera based on RGB sensor? That looks like that you more detect edges and generate the right images. Do you use Difference of Gaussian filter? There is an interesting project “Virtual Retina”, using whose SW you can generate retina spikes from static images.

1 Like

@thanh-binh.to Yes, it takes the RGB sensor and checks for color changes above a threshold x which turns the corresponding pixel on and that is the event itself. I did not use anything else so computationally, it literally costs nothing on top of RGB sensor.

@sunguralikaan by using event based camera we often face to unpleasant situations with very little events and event no events at all. In those cases the number of active cells in spatial pooler goes down dramatically.

@sunguralikaan @ycui do you know how to avoid this problem? Thanks

I try to use boostStrength, but it is very time consuming, and sp runs to slowly!
I try to change other parameters but no success

I became interested in event based sensors because I thought maybe we should look at the problem differently. Normal sensors communicate the state of input. Event based sensors communicate the change in the input. As you said, there is no communication when there is no transition. On the other hand, HTM learns the transitions/change of the input too. So if HTM learns transitions, why should it learn anything when there are no input transitions. In other words, do we really need to learn for example A->B->B->B->C? Why is learning that sequence as (A->B->C) or (A->B and B->C as separate sequences) not enough? Maybe there is another solution for the need of learning A->B->B->B->C -if there is any- and maybe we are using the wrong tool to make up for this. The autonomous agent that I work on learns sequences with parts that have no input transitions. This leads to just unnecessary stalls and redundant action selections. Why can’t it learn only the transitions, the stuff that actually changes?

From this perspective, event based sensors are a perfect match for HTM. Having no active columns at all when there are no transitions makes sense and maybe the learning should be designed around this. Or maybe, this is all wrong :slight_smile:

Bonus idea: Think about the functionality of manual reset tags in current HTM theory. No input change and as a result, no active neurons would be the reset itself. A sequence would reset itself naturally when the data stops changing. For some reason, this sounds so right.

2 Likes

Timing is going to be very important for a game agent. One way to partially accomplish timing in HTM today (until there is a more scalable, biologically-inspired timing mechanism implemented) is with repeating inputs. This is one of the reasons I have spent some significant time exploring ways of handling repeating inputs in sequence memory (I am also concerned with being able to exhibit some degree of timing)

1 Like

I see where you are coming from, especially in the long run. Although I am having a hard time imagining an actual use case that would be practical for the current ability of HTM. Can you describe a case where learning repeated inputs is actually useful? On top of my head, if I give the agent positive reward with a delay when it encounters a specific pattern, I guess learning a repeated sequence would help. Because it would be seeing the same thing over and over and the reward would come for example at the 5th same input. But I am not sure if this is a practical case or central ability considering the complexities and implications it introduces onto the system. As I said, maybe we are using the wrong tool to have timing. Moreover, it seems to me that the important tasks that would require timing would also require a lot more than just current HTM + timing (which is why I think you said partially) but I probably lack your experience in timing department.

I just realized that I did not give an actual answer to your question. Have you tried adjusting stimulusThreshold variable of Spatial Pooler? For example setting it to zero. This would result in active columns even if you have a single on input bit.

If you want activity at all times, you could set that variable to 0 and add an extra but constant overlap to all columns so that the ones that are getting the least use (highest boost value) becomes active. This would ensure activation even without any inputs but I am not sure about its usefulness because I have tried it in the past.

I’ve started another topic HERE to discuss, since it is getting off topic.

1 Like

I had an idea on how to improve this color discretization step. I have implemented and am currently experimenting with these ideas, with marginal success.

Two requirements for encoded image SDRs are that every bit of information represents a range of possible values, and that for every input a certain fraction of bits activate in the SDR. The effects of these requirements are that each output bit is inaccurate and redundant. This implementation makes every output bit receptive to a unique range of inputs. These ranges are uniformly distributed through the inputs space and the widths are the size of the input space multiplied by the target sparsity. This meets both requirements for being an SDR and has more representational power than if many of the bits represented the same ranges. This design makes all of those redundancies add useful information.

Don’t you lose some semantic similarity with this way? I may have understood it wrong. Can you describe it in other words?

I will write the steps, first the set up steps which are performed at program start up:

  1. Assume the encoder will accept a grey-scale image with (M x N) pixels to encode into an SDR with dimensions (M x N x C) bits, where C is the number of bits in the output SDR for each each input pixel.
  2. Let BIN_CENTERS = Array[M x N x C]
    This array stores a (floating-point) number for each bit in the output SDR. Fill BIN_CENTERS with uniform random numbers in the same range as the pixel intensity (typically 0 to 255, but depends on the input data format).
  3. Let BIN_WIDTH = data-range * (1 - target-sparsity)
    Where data-range is the theoretical range of pixel intensities. For an 8 bit image this is 256, but it depends on the input data format.
    Where the target-sparsity is the desired fraction of zeros in the output SDR.
  4. Let BIN_LOWER_BOUNDS = BIN_CENTERS - BIN_WIDTH/2
    Let BIN_UPPER_BOUNDS = BIN_CENTERS + BIN_WIDTH/2
    The shapes of both of these arrays are (M x N x C). Together these arrays describe the ranges of input values which each output bit will be responsive to.

Steps to encode an image:

  1. Let IMAGE = Array[M x N], this is the input, it is real valued (aka floating point).
  2. Let SDR = Array[M x N x C], this is the output, it is boolean.
  3. Iterate through the SDR using the indexes (x, y, z), and set every bit of the SDR according to step 4.
  4. Let SDR[x, y, z] = BIN_LOWER_BOUNDS[x, y, z] <= IMAGE[x, y] and IMAGE[x, y] <= BIN_UPPER_BOUNDS[x, y, z].

To encode color images create separate encoders for each color channel. Then recombine the output SDRs into a single monolithic SDR by multiplying them together. Multiplication is equivalent to logical “and” in this situation. Notice that the combined SDR’s sparsity is the different; the fraction of bits which are active in the combined SDR is the product of the fraction of the bits which are active in all input SDRs. For example, to recreate the original posters example with 8 bits per pixel and a density of 1/8: create three encoders with 8 bits per pixel and a density of 1/2.

What follows is a discussion of this encoders semantic similarity properties. Semantic similarity happens when two inputs which are similar have similar SDR representations. This encoder design does two things to cause semantic similarity: (1) SDR bits are responsive to a range of input values, and (2) topology allows near by bits to represent similar things.

  1. Effects of thresholds:
    Many encoders apply thresholds to real valued input data which converts it into boolean output. In this encoder, the thresholds are ranges which are referred to as ‘bins’. A small change in the input value might cause some of the output bits to change and a large change in input value will change all of the bits in the output. How sensitive the output bits are to changes in the input value -the semantic similarity- is determined by the sizes of the bins. The sizes of the bins are in turn determined by the sparsity, as follows:
    Assume that the inputs are distributed in a uniform random way throughout the input range. The size of the bins then determines the probability that an input value will fall inside of a bin. This means that the sparsity is related to the size of the bins, which in turn means that the sparsity is related to the amount of semantic similarity. This may seem counter-intuitive but this same property holds true for all encoders which use bins to convert real numbers to discrete bits.

  2. Effects of topology:
    This encoder relies on topology in the input image, the idea that adjacent pixels in the input image are likely to show the same thing.
    2A) If an area of the output SDR can not represent a color because it did not generate the bins needed to, then it may still be near to a pixel which does represent the color. In the case where there are less than one active output bits per input pixel, multiple close together outputs can work together to represent the input.
    2B) If an output changes in response to a small change in input, then some semantic similarity is lost. Topology allows nearby outputs to represent the same thing, and since each output uses random bins they will not all change when the input reaches a single threshold.

1 Like

If you look at the OgmaNeo software implemented by Eric Laukien you will find out some ideas for encoding images like stdp encoder, chunk encoder etc.
Moreover, OpenCV has also a bio-inspired retina so that you can get motion information from images and put them into SP.
I believe that those encoders are enough good for the first testing before trying cnn or something from deep learning?
Any idea or comment?

I tested a lot of stuff like that, including the OpenCV retina. Convolutional network features worked better. This shouldn’t be surprising, since they effectively “cheat” by using backpropagation to learn the best possible representations (or at least locally optimal ones).

Maybe you want to try purely biological representations first, that’s fine. But another approach would be to say “how well can I do even if I cheat?”

And I’ve found that it’s still hard to get very far on any hard problem. So starting from a less powerful technique just to maintain biological plausibility seems like bad strategy.

1 Like