@sunguralikaan I hope you don’t mind I pulled this into a new topic since it seems like it could spawn a whole discussion of its own. I’ve tried a few things for encoding visual input to HTM; here are my thoughts.
(1 - color discretization) The simplest starting point that I’ve used for a visual encoder is to bin each pixel into one of 8 categories based on the RGB intensity: low-red/low-green/low-blue, low-red/low-green/high-blue, low-red/high-green/low-blue, low-red/high-green/high-blue, and so on.
Advantages: fixed sparsity, no pretraining, invariant to small changes in color.
Disadvantages: increases the dimensionality of the input by a factor of 8, doesn’t play nice with colors near category boundaries.
(2 - edge detection) I’ve also tried feeding in edges from an edge detector as a binary image. I think the color-changes approach you mention (similar to event-based cameras) would be similar to this.
Advantages: no pretraining, doesn’t increase the dimensionality of the input.
Disadvantages: edge detectors are not very reliable so the same scene can look very different based on small changes in the image (your access to the geometry may reduce this problem), they require hand-tuning of their parameters, their sparsity is not fixed, and they don’t represent flat textures.
(3 - gabor filters) Another approach that seemed reasonable was to convolve the input with a set of gabor filters and threshold the output to a binary image for each filter.
Advantages: represents edges in different orientations as different features, no pretraining, you can choose the top N% of activations to fix the sparsity, and you can adjust the stride of the filters (or do max-pooling) to reduce the degree to which they blow up the input dimensionality.
Disadvantages: hand-tuning of thresholds, and like edge detectors they don’t represent flat regions well.
(4 - learned local features) Along the same lines as gabor filters, and similar to some of the feature descriptors in computer vision (like ORB), I’ve used learned local features at every point in the image. I’ve trained these features for autoencoding (similar to the trainable universal encoder I posted) and I’ve also learned them by sparse coding methods, or just a simple histogram over encountered image patches.
Advantages: the features are better tuned to represent the kinds of patterns that actually exist in your data, and you can adjust the stride or max-pool to reduce dimensionality.
Disadvantages: requires pretraining, hand-tuning of thresholds.
(5 - pretrained networks) And I’ve briefly experimented with using pretrained networks like AlexNet, ResNet, VGGNet, which have already learned good features (they won the ImageNet competition) and then thresholding those features as input to the HTM.
Advantages: good features, the ability to choose features at multiple scales (since we don’t have hierarchy in HTM yet).
Disadvantages: very expensive to compute compared to the other methods here, and if your data doesn’t have much in common with ImageNet images then the features may not be appropriate.
Things I haven’t tried include global image descriptors like GIST, or greyscaled and adaptive-histogram-equalized images, both of which would have to be thresholded to binary and would not be fixed in sparsity. Another global encoding approach is the trainable universal encoder I posted. You could also just compute the descriptors for existing handcrafted features from computer vision (SIFT, SURF, ORB, BRISK, BRIEF, LBP) in a dense sampling over the image and binarize those somehow.
Unfortunately I don’t have performance curves for these evaluations, as most of them performed too poorly to merit further investigation. I suspect this was due to the other weaknesses in HTM on the problem I was working on (visual place recognition and robot navigation) as opposed to particular problems with the encoding styles.
Let me know if anyone has any other ideas for encoding visual input to HTM. I think it’s definitely an area with lots of unexploited potential.