Book Review: Sparse Distributed Memory by Pentti Kanerva - April 7, 2021

Our research intern Alex Cuozzo discusses the book Sparse Distributed Memory by Pentti Kanerva. He first explores a few concepts related to high dimensional vectors mentioned in the book such as rotational symmetry, distribution of distances etc. He then talks about the key properties of the Sparse Distributed Memory model and how it relates to a biological one. Lastly, he gives his thoughts and explores some follow up work that aims to convert dense factors to sparse distributed activations.

“Sparse Distributed Memory” by Pentti Kanerva
“An Alternative Design for a Sparse Distributed Memory” by Louis Jaeckel
“A Class of Designs for a Sparse Distributed Memory” by Louis Jaeckel
“Comparison between Kanerva’s SDM and Hopfield-type neural networks” by James Keeler
“Notes on implementation of sparsely distributed memory” by James Keeler et al.


I have a project which is based on Kanerva BSC vectors, item based Cleanup Memory even simple Prolog interpreter based on VSA.

In the past I also implemented SDM … it is easy just two arrays. One randomized binary array for locations/address in 2^n space (it is say 10000 addresses out of 2^1000) and one with the content-vectors which collect the sums … homing algo is a bit harder , but not much …etc

if memory serves me correctly … the performance on time-series was comparable to TM … may be !! … it was 7-8 y ago
Also performance degraded as you go below 300 bits

And AFAIK know the ideas for VSA, started out of Holographic Reduced Representation (HRR), which are real based vectors with size 100-300, same logic … Tony Plate, there is a book

What is Bi ?

When you use programming languages you depend on the symbols of the language to have a specific meaning, very rarely it is the case for a programming language to allow fuzzy and context based meaning, because it complicates the implementation. In natural languages concepts/symbols are both discrete and fuzzy at the same time.

To make programming languages more natural we have to embrace this dichotomy… one way to make the symbols behave this way is to represent symbols and/or context as vectors to achieve fuzziness, but still preserve discreteness.

Something like having the whole cake, but eating it too.

That is why Bi is build on top of the so called VSA (Vector Symbolic Architecture).

github /vsraptor/bi/blob/master/docs/Bi_framework.ipynb

Local vs Distributed

Information in computers is stored locally, that is, in records with fields. For example let say we want to encode the colors : red, violet, orange and green. If we use local representation we have to use 4 bits, one for each ‘feature’.

In Local representation one unit is used per concept. This is common also in neural nets.

The alternative is to distribute information from many sources over shared units. Thus our previous color example we could use 2 or 3 bits to represent those colors, using multiple overlapping bits for representing a ‘feature’.

Dense vs Sparse

Representation could also be Dense or Sparse.

For example Numenta Grok SDR uses 2048 bit binary, 2% sparse representation i.e. 2% of bits are ONE the rest are ZERO.

SDP on the other hand are Dense : 10000 bit binary with 50% sparsity.

SDP are sparse in a different sense, they sparsely occupy the space they are allowed to occupy, as you will see in a minute.


The other feature of vectors which is of utmost importance is the similarity between vectors. Real-valued vectors normally employ geometrical measures like Euclidean distance. In our case we are interested in binary vectors. Here again SDR and SDP differ.

SDR uses “Overlap” i.e. count_ones(A AND B)

SDP uses “Hamming distance” i.e. count_ones(A XOR B)

Kanerva : BSC (Binary Spatter Code)

In Bi at the base of it all are binary hyper-vectors or in Bi lingo Semantic Distributed Pointers (SDP) . In the current implementation an SDP hyper-vector is 10_000 bits binary vector (size can be changed easily, but according to Kanerva that is the optimal size and who m’I to opine).

Why binary ?

I’ve been pursuing binary approach to machine intelligence (I make distinction between MI and AI) for a long time, because it is no brainer. Simplifies implementation and when people finally awake from “real-valued” hegemony ;), we will see some tremendous CPU acceleration in MI programs. Can you imagine there is not even software CPU optimized libraries for large bitarrays in almost any computer language !!! I’ve checked ;(. BTW FYI computers are machines build on binary !! go figure …

Why so big ?

The reason is that at those sizes we get the required characteristics for doing clever stuff.

The main characteristic of large vector spaces is ORTHOGONALITY . ( Very important ). (If you don’t know what orthogonal means, think “perpendicular” in 2D space).

If we randomly create 10_000 bits vector, where every bit has 50/50 chance of being 0 or 1 , we are guaranteed that any time we do that the vector will be nearly orthogonal to all the previously created vectors. Wow, isn’t that amazing ? ( Kanerva mentioned somewhere that this is valid even if you create millions of vectors )

( Or we can talk about normally distributed real-valued vectors, with 100-300 dimensions : HRR ).

This is HUGE, think about that for a minute ?!!

Every time we create new SDP it is guaranteed it will be far away from any previously created one. Perfect for defining new terms (data atoms, symbols), to which we can attach new meaning.

Lets do the following thought experiment …

Imagine a sphere and you are at the north pole point. If we assign that point the meaning of the word “car”. Closer to the north pole will be points more similar to “car”. F.e. “sport car”, “mazda”, “blue car”, “car with 3 wheels”, “bus”, rather than concepts like “chair”, “sky”, “grass” or “computer” which will be further away.

The majority of the points on the sphere will be on the equator i.e. all car-unrelated terms. In fact the points around the equator are 99.XX% of all the points, that is why they are so orthogonal. But not to worry, even less than 1% of 2^10_000 points is huge number i.e. you can still have billions of them to represent things similar to “car”.

In 10_000 bit space only billionth of the points are within 4700 bits range and another billionth further than 5300 bit away i.e. majority of points are between 4700 and 5300 bits hamming distance. (99.7% to be more precise).

Make any point/concept your north-pole and you will see the same distribution comparative to this new position.

From this metaphor we can see the benefit of highly orthogonal spaces. (BTW in reality the space is hypercube).

Creating the knowledge “point” (SDP) is the most basic operation we can do to instantiate new meaning/knowledge/data-point/symbol. It is our starting point from where everything blossoms. At the same time being distributed and large guarantees that it will be noise resistant.

In traditional programming languages when two pointers point to the same place we say they are identical, but if we change even a single bit they become totally different things, not so with SDP, similar vectors imply similar meanings. The SDP is both the pointer/address and the content/meaning at the same time.

… more in the docs


BTW… at the end of the video somebody mentioned VSA… You should know you can do VSA stuff with SDR .

I have project for that too :wink:

BASIC Operations

There are three basic operations we can apply on SDP:SDR’s : union, concatenation and thinning

Here are what distinguish them :

  1. UNION : as the name implies this op combines all the 1-bits. This is logical OR operation. What is important to remember about it is that vsize is preserved but the sparsity decreases (i.e. the SDP becomes more dense, more 1-bits)
  2. CONCATENATION : joins the SDP’s, but also shifts the bits positions, so that every operand “moves” to the right in a sense in its own segment. Result, vsize increases and the sparsity stays the same.
  3. THINNING : picks randomly bits from the operands with equal probability. vsize and sparsity stay the same. The other option is to use Context Dependent Thinning (CDT) isdp.cdt().