VISION, by David Marr

VISION, by David Marr

Published posthumously, this tome defined the genre of computational neuroscience.
I’ve read through it and I’d like to summarize and implement some of these ideas.
Although it was intimidating before I started, I found it quite enjoyable to read.

The book begins by defining the purpose of the visual system, and Marr relates everything to this purpose: “Vision is the process of discovering from images what is present in the world, and where it is.”

Each section of the book states a problem to be solved and then analyses it at several levels. First Marr looks at the computational theory: what is the goal to be accomplished and how to achieve it at an algorithmic level. Then he looks at what information needs to be represented and how the inputs are transformed into the outputs. Finally he looks at how the process can be realized physically, either by a computer or by the brain.

Inside of the Retina

The first stages of visual processing happens inside of the retina. The retina detects light and immediately applies mexican-hat shaped filters to it. The filtered image is transmitted to the brain, not the raw light intensities. The choice of mexican-hat filter is well justified, both with theory and with biology. This filter responds to variations in the input, but not areas of constant intensity or linear gradients. Also it is only sensitive to input features that have a similar size as the filter. There are at least 4 four different sizes of mexican-hat shaped filters which are sized at powers of two of each other, so the retina can detect features over a broad range of scales.

These transformations preserve almost all of the incoming visual information and it is possible to mostly reconstruct the original image from the filtered outputs. The notable exception is that while the relative differences between pixels are preserved, the absolute magnitude of the image’s light intensity is lost.

To demonstrate these transforms, I will apply them to this test image:

Converted to greyscale and with the mexican-hat filters applied:

Colors are processed by taking the difference between the color channels before applying the mexican-hat filters. The retina subtracts (red - green) and (blue - yellow). Here is a false-color representation of the result:

Finally, the retina takes the derivative of the filtered greyscale image. This will be useful later for detecting motion.

For a more recent and in-depth review of the retina’s biology see: . However, it’s worth reading VISION first because that review does not attempt to explain the computations of the retina.

1 Like

The Primal Sketch

The next stage of processing seeks to understand the 2D image in a very literal way. The image is decomposed into primitive features, such as edges, lines, or blobs of color. Marr calls this the primal sketch. The purpose of these primitive tokens is to represent the aspects of the 2D image which correspond to the 3D structure of physical objects.

The most basic primitives are the “zero-crossings” in the retinal inputs. The zero crossings are caused by the mexican-hat filters passing over sudden changes in light intensity. They are essentially a form of “edge detection”. At this point Marr discusses the nature of edges in 2D images versus in the 3D world. All (detectable) edges in the 3D world will have some kind of corresponding edge in the 2D image, and these 2D edges will almost certainly be visible at multiple scales. However there are also edges in the 2D image which do not correspond to an edge in the 3D world.

Then the zero crossings are built up into various larger tokens, such as line segments, curves, circular blobs, etc. Of particular importance are the terminations (end points) of line segments, because they are highly recognizable and can be located very precisely which makes them good features. In addition, these tokens can have a lot of associated information for example about the color, contrast, size, or orientation.

The brain groups together primitives which are similar to each other and in close proximity. These groups can form “virtual lines” where you perceive a line connecting the primitives, even though no line truly exists. They can also form “virtual edges” at the boundaries of the group. Groups of primitives are also primitive features which can be composed into larger groups, in a recursive fashion.

Figure 2-34

At this point we have detected all of the low-level 2D image features which we will use. The next stage of processing is to sort through the features and piece together an understanding of the physical surfaces which generated these image features.


In 1958 Hubel and Weisel discovered “simple cells” which are neurons in the primary visual cortex (V1) that respond to edges. These cells appear to detect the zero crossings that Marr writes about. Since then more cells have been found in V1 which detect other features that Marr writes about.


Simulating Simple Cells

I used a spatial pooler to simulate the “simple cells” of the primary visual cortex; to detect the primitive features and to form the primal sketch. This post will describe how I did it and analyze the results.

Encoding Positive & Negative Inputs

The retina transmits positive and negative values using different axons. The retina contains two separate pathways for applying the mexican-hat shaped filters for this purpose. This encoding scheme makes it very easy for downstream neurons to detect the zero crossings in the filtered image.

Rate-Coded Inputs

The retina uses rate-coding to encode real-values, and I modified the spatial pooler algorithm to support this. I replaced the regular binary synapses with weighted synapses, and I replaced the learning rule with:
delta_weight = learning_rate * presynaptic_input * postsynaptic_activity


  • The postsynaptic activity is still a binary value (0 or 1) and so only active cells learn.
  • Weights are always positive, this rule will only increase the weights.

After applying the learning rule the sum of synaptic weights to each cell is normalized to one. This implements the synaptic decrement for inactive synapses and this also controls the positive feed-back caused by hebbian learning.

Convolution for Consistent Representations

Another key modification to the spatial pooler algorithm is to alter how it handles topology. The spatial pooler algorithm, as described by Numenta, has cells spread out over the input space and each of those cells has synapses to the nearby inputs. This allows the population of cells to cover the whole input space despite the fact that each cell has a small local receptive field. Instead I modified the spatial pooler to have a single set of cells which are repeated across at every location in the input space. This is the convolution trick borrowed from convolutional neural networks. The primary advantage of convolving a single set of cells across the image (instead of having different cells at each location) is that they will generate comparable output because they use the same synaptic weights. Given identical image patches, the convolved cells will have the same response. Note that this modification is likely incompatible with foveated eyes, it will only work with regular flat 2D images.

This change also enables significant run-time optimization and drastically reduces the memory required to store the synapse data.


I trained my modified spatial pooler on natural imagery; I used images from the documentary “Planet Earth, by David Attenborough”. The size of the cells’ receptive fields are 5 x 5 pixels.

I reverse engineered the synaptic weights to find images which maximally activate each cell, and I found many cells that detect the features described by Marr, and experimentally found by Hubel and Weisel. In addition there are cells for a variety of different colors, sizes, and angles.

Cells that respond to edges:


Cells that respond to lines:


Cells that respond to line terminations:


Cells that respond to blobs:


There are also some cells that respond to miscellaneous/uncategorized inputs.



But I still don’t get why they call it spatial pooling, instead of far more suggestive and established lateral inhibition?
Also what’s interesting here is that lateral inhibition in retina forms driving outputs, while in the cortex it’s supposed to only suppress them. Not that the later is not needed, but the former must happen in the cortex too, some gradients are meaningful even in sparse representations?

Update: I’ve begun porting my work thus far onto a graphics cards with the goal of running as fast as the brain runs. Evidence suggests that the brain processes visual inputs at about 10 frames per second (for more info see: Alpha wave - Wikipedia). I’m using CUDA because it’s honestly the best solution for this kind of work. Devil take my soul; NVIDIA take my money. So far I’ve converted the retina’s code and the runtime of it went from ~100 hz to ~1,000 hz but I haven’t yet made a serious effort to optimize it. Because the retina processes images and because graphics cards are designed to process images, I was able to use some of the graphics card’s special-purpose hardware features to improve performance. In theory all of the computations should be “embarrassingly parallel” and so using a graphics card should be both easy and highly effective. When the project is a bit more finished I will publicly release the source code; if I don’t release it and you’re interested just PM me.


if you want to get a deeper understanding in Retina, I’d like to recommend you to look at very interesting Virtual Retina project from INRIA, which models Magnocellular and Parvocellular pathways.
I used it some years ago for encoding images for HTM (spatial pooler or GridCell).

Here is the links:


VIRTUAL RETINA allows large-scale simulations of biologically-plausible retinas, with customizable parameters, and different possible biological features:

Spatio-temporal linear filter implementing the basic Center/Surround organization of retinal filtering.
Non-linear contrast gain control mechanism providing instantaneous adaptation to the local level of contrast. This stage is modelled through dynamical adaptation conductances in the membranes of bipolar cells; the resulting model reproduces contrast-dependent amplitude and phase non-linearities, as measured in real mammalian retinas by Shapley & Victor 78.
Spike generation by one or several layers of ganglion cells paving the visual field. Magnocellular and Parvocellular pathways can be modelled in the same framework according to the parameters chosen. Large-scale simulations can be pursued on up to 100,000 spiking cells.
Possibility of a global radial inhomogeneity modeling the foveated organization of mammalian retinas. In this case, the spatial scales of filtering, and the density of spiking cells, both depend on the eccentricity from the center of the retina.
Possibility to include a basic microsaccades generator at the input of the retina, to account for fixational eye movements.
1 Like

Hi, @dmac, interesting pursuit you have here.

  • What kind of image input / SP output you use at 100Hz? I mean things like input resolution, colors, output sdr size and solidity.
  • one problem with any vision (sub)system is evaluating its worth. “How good is it” depends on what we mean by good. Do you have in mind any metrics of measuring it
  • one limited, crude but cheap such metric would be using the generated SDRs aka embeddings in a ML task like MNIST digit classification. You can use e.g. htm.bindings.algorithms.Classifier - not the best but is quite fast. The purpose isn’t to break MNIST records but compare “worth” of the embeddings generated by different types of encoders.
  • e.g. I’ve seen previous trials here which SDR-fied DL vision autoencoder embeddings but they also lacked any subsequent valuation metric.
1 Like

I improved my retina by normalizing the contrast.

Basically, in my previous model the output was a linear combination of the inputs. One issue with linear filters is that the output of the filter is linearly proportional to the magnitude of the input stimulus.

In my improved model there is an additional non-linear filter which normalizes the contrast of each patch of the linear filters output. One result is that weak stimuli are amplified and strong stimuli are weakened. This allows the retina represent a larger range of stimuli strengths with a smaller range of output values. Another result is that the retina encodes the relative strengths of nearby stimuli because it looks at large patches of the image. Adjacent stimuli attenuate each other.

The contrast normalization filter is implemented in two steps:

  1. First it averages together all of the linear responses in an area. I use a binomial filter instead of the more standard Gaussian filter because it’s simpler to implement and runs faster.
  2. The linear response of each pixel is transformed by the equation:
    linear_response / (1/gain_factor + normalization_factor*average)
    Where gain_factor and normalization_factor are user controlled constants.
    Where average is the average of the linear responses in the neighborhood of the pixel.

I was inspired by this prior model of the retina which also does contrast normalization. Thank you @Thanh-binh for posting it!

  • Virtual Retina: A biological retina model and simulator, with contrast gain control
    Adrien Wohrer & Pierre Kornprobst, 2009
    DOI 10.1007/s10827-008-0108-4


I ran the retina with the following parameters:
No contrast normalization: gain_factor = 5, normalization_factor = 0.
With contrast normalization: gain_factor = 10, normalization_factor = 2.

Left: No Contrast Normalization | Right: With Contrast Normalization

Notice that with contrast normalization you can see the color of the pizza in the pizza box, more details on the cats face, and the light switch in the top-right corner. The faint stimuli are amplified by about 2x and yet the strong stimuli are not saturated.

@dmac If you are interested in contrast gain control, then Virtual Retina is very good choice, because it models it in Parvocellular pathway of LGN (exactly Parasol cells). Virtual Retina can also generate spikes from Magnocellular and Parvocellular (in M-Spi… and P-Spi… picture), which allows us to encode streaming video in any spike type. In my experiment, I used hexagonal neurons for the whole image and see how each neuron generates spikes and I can form them together in a gridcell encoder (the bottom picture).


@dmac I want to test your algo. Could you pls share me your source code? Thanks

Just out of curiosity, were you able to get the virtual retina repository to compile. When I try it, the build keeps failing to find the mvaspike repository. The original site no longer exists, and I can’t seem to find a back-up of the site or its files anywhere. (Caution: Links are broken.)

1 Like

@CollinsEM There are 2 ways for solving this problem:

  1. you contact any INRIA author related to Virtual Retina, or BioVision Team:
    Contact & venue – Biovision Lab
    for getting the complete SW
  2. if you use Ubuntu, I will check by me if any pre-compiled library is available. In this case please send me an email:
    I hope that helps.
1 Like

really interesting post, I have been looking at the retina recently aswell. The computation performed in the retina is quite remarkable and seems a natural starting point for more accurate biological models of intelligence for visual input.

I have a question about your original post, apologies if it I have just failed to understand. Are you applying predefined filters to images as encoding schemes for the pooler or are you inputting pixel values and then using a modified pooler to perform computation?

Also, you say that you model the pooler as consising of a repeated set of neurons to simplify computation. Do you have any more information about the biological justification of this?

In your post about the modified learning rule, what is the “presynaptic_input” term? Is this a binary value or pixel value or something else?