The topic of vision (even limited to object recognition) is a huge topic. A detailed description of computer vision, the visual system, or past literature on this topic is well beyond the scope of this wiki page! The goal of this writeup is to document how we might create CLA networks for object recognition using NuPIC. We’ve limited the focus to network structure, setting up the experiments, evaluating the results, expected results, etc. The details are currently a reflection of experiments done at Numenta through 2011. Over time we hope the community will add to this page and improve the details.
The challenge of object recognition
The goal of object recognition is to reliably recognize objects in images. This is a task that humans perform effortlessly but is notoriously difficult for computers. One of the main difficulties is the problem of invariance. No two images of the same object will be identical. It is very likely that every pixel will be at least slightly different. Even small changes in the physical object (such as a slight shift) can cause dramatic changes in the pixel representation. Ideally we would like to recognize objects under a large number of transformations. This includes pixel noise, changes in lighting, translation (shifts), rotations (2D and 3D), scale changes, occlusions, and clutter. We even want to recognize objects after complex non-rigid deformations. For example, we would like to label a hand as a hand irrespective of finger positioning.
Invariant representations: Suppose we could process an input image and create an output representation that was very stable (i.e. invariant) with respect to changes in the object. This would help solve the problem. Unfortunately, as stated, it is insufficient to simply have an invariant representations. In fact it is trivial to create a useless invariant reprsentation: simply output a constant.
Discriminant representations: To really solve object recognition, you need to meet a second property. You also need a representation that can accurately discriminate between objects.
A raw pixel representation is perfectly discriminant: every object has a different representation. A constant representation is perfectly invariant. Somewhere in between is a representation that is invariant to the changes we don’t care about (such as lighting), but discriminates between changes that we do care about (changing the object). It turns out this tradeoff is actually a useful way to view the problem. This characterization leads to concrete measurable goals (see discussion on “onion plots” below).
Humans are extremely good at this task and the visual cortex has solved this representation problem. For example, there are face specific neurons in the regions V4 and IT that are highly invariant to changes in the input but only respond to faces. (One experiment even detected a Bill Clinton neuron, specific only to pictures of the President.)
In theory, an accurate theory of the neocortex should be able to achieve the same result. In practice, this is an extremely challenging task. The sections below describe an initial design and experimental approaches for object recognition with CLA based networks.
CLA network design
This section can serve as a starting point for designing CLA networks for object recognition.
CLA network structure and topology
Images are 2D and the visual cortex is arranged in a matching 2D topology. A CLA network for vision should similarly be arranged in a 2D topology. In such a setup, each column has an implicit center point on the image.
At the lowest level, each column in the spatial pooler receives input from pixels around that center point (i.e. each column has a “receptive field”). It is ok for multiple columns to have the same center point. The extent of the receptive field is a parameter that needs to be determined. In theory this is actually not an important parameter: it should be ok to make the receptive field large (need rationale for this). In practice the computational requirements increase as the receptive field size increases. Another important practical consideration is the initial permanence values for each column. Although in general any input within the receptive field may be connected, columns generally tend to end up connecting to inputs “close” to their center. As such we found the networks converged faster if permanences were initialized such that they were more likely to be connected near the center of the receptive field.
The columns in the temporal pooler line up with the columns in the spatial pooler. The temporal pooler also creates synapses from cells around the center point (controlled by learningRadius in the pseudocode). In practice we found it desirable to make the learningRadius larger than the spatial pooler’s inhibitionRadius. (Please see discussion below regarding the number of cells per column and the use of pooling.)
In order to learn invariant representations across a large scale, the network must be arranged in a hierarchy. (to be filled in).
Inputs and encoders
There are a couple of possible ways to feed in image data to a CLA:
a) the simplest approach is to use binary images and feed pixel data directly to the spatial pooler. Here each pixel is either 0 or 1. Although this method is limited and not likely to lead to practical vision systems, experiments with binary images are much easier to construct and analyze. It is easier to hypothesize what should happen and compare against what actually happened. As such, this approach is very useful for understanding some of the fundamentals as well as debugging code.
b) a more sophisticated approach is to use grey scale images followed by some filtering. There are a number of different filtering schemes that can be used. We have had good success using Gabor filters in the past. One approach is to use a number of different Gabor filters at each location, followed by a threshold. Each filter would be sensitive to a slightly different orientation. Filters should overlap in orientation discrimination so that, for any given edge, you would have multiple neighboring Gabor filters with responses above threshold. This leads to a coarse distributed input suitable for the spatial pooler.
Experimental design: what do we want the CLA to learn?
[Discussion of onion plots]
Hannah - This dataset is based on the movie “Hannah and her sisters” by Woody Allen. The full movie (153,825 frames) has been manually annotated by a single annotator for several types of audio and visual information. Audio annotation indicates speech segments and associated speaker identification (consistent with face identification). Visual annotation concerns all shot boundaries and all identified face tracks within shots.
MPI Sintel Dataset - A naturalistic open source movie for optical flow evaluation.