The power of bit pair expansion


I’m not sure this is a good term for it. Other term could be bit pair addressing.

One can think of it as a simple function that:

Takes any SDR (or vector of bits with a majority of bits 0) of size N with P 1 bits, P << N, and transforms it into a much larger and sparser SDR of size N*(N-1)/2 and P*(P-1)/2 one bits.

The output space represents all possible bit pairings in the input N bit space, and the 1 bit positions in the output vector are formed by all possible pairings of bits of value 1 in the input space.

A short example, having an input space of N=10 bits and P=4 (four active bits), or 4/10 input like:

Input in dense format: [0,1,0,1,1,0,0,0,1,0]
Input in sparse format: [1,3,4,8]
Output pairs: [(1,3), (1,4), (1,8), (3,4), (3,8), (4,8)]
We see there are 6 output pairs which is 4*(4-1)/2
The total number of possible pairs for N=10 is 10*(10-1)/2 = 45 possible pairs in the output space.

So the bit pair expansion transforms the 4/10 input to an 12/45, so sparsity decreases from 0.4 to 0.26 while the storage space increases proportional to N**2/2

When we work with more… “meaningful” SDRs the expansion is significant, e.g. a 50/1000 bit SDR expands to a “huge” 1225/499500 bit SDR.

Ok, I hope the above is a clear enough, the obvious following question is why such a redundant, wasteful encoding would be more “powerful” than the original one in any way?
If you try to feed that into a Spatial Pooler or Temporal Memory almost certainly would slow them dramatically, assuming the available RAM is not exhausted first.

Well, let’s begin with the SDR Classifier, from the HTM core package.

When trained directly against MNIST images (represented as a 784 bit SDR), the classifier plateaus at almost 92% accuracy.
When trained on the bit pair encodings of the same images, the same classifier will plateau at over 97.8% accuracy.
Otherwise said, failure rate drops almost four times from 8% to 2.2%

Just to be clear - bit pair expansion only creates an intermediate very wide (~300 kbits) and very sparse hidden layer between input and classifier, without weights, biases, no convolutions nor data augmenting, no parameters to tune, so no time wasted to train a first layer of ~240M potential weights, just a simple and fast expansion function, resulting in accuracy loss dropping from 8% to 2.2%

I think this result alone is reason enough to motivate a few of you to further research/investigate towards two key questions:

  1. why is this happening? like… what properties hidden in the input data are exposed, and which an output single layer classifier finds so useful.
  2. what other interesting SDR/ML machinery can be made based on this function: bit pair expansion

this bit pair expansion kinda resembles the outer product binding operation on those vector symbolic architectures, but you are binding the vector with itself.

I’m not familiar with that.
However it is possible and easy to compute expand bit pairs between two SDRs as in triadic memory.
In @POv 's terminology, “diadic” refers to auto correlating bit pairs within a single SDR and “triadic” to bit pair correlations between two SDRs.
In the original post I outlined only the diadic expansion to keep message short and also because I consider it the most general form of the two. Bit pairing between two SDRs is only a subset of what you get by applying bit pairing to a concatenation of the same SDRs.

To further expand the subject, I reached the point of two questions -

  1. why is bit pair interesting or useful, what properties makes it that way.
  2. what can it be used for.
    I shall add:
  3. what are its drawbacks and limitation.

I will begin with 2. because we already have quite a few experimental results. Other use cases might be invented.

Here is a compact enumeration of what I know of:

  • significant classification improvement when a classifier is trained on the bit pair expanded SDR instead of the original SDR (already mentioned in the first message above)

  • triadic and diadic associative memories extensively discussed in another thread
    These are SDR equivalent of either key->value or (key,key)->value repositories, having the associative property which means

    1. with imperfect/degraded/noisy versions of the keys, chances of correct value retrieval is still high
    2. even when retrieved value is different than the stored one it has a high bit overlap to the original stored value.
  • Based on the triadic memory @POv also implemented a sequential memory which predicts following SDRs in time series with promising initial results.

  • ValueMap is an associative memory that maps SDRs to a vector of scalars. So far it proved useful in two ML tasks:

    1. CartPole RL problem, where a SDR is used to encode state and values stored for different states encode a Q-value of agent’s choice of moving the cart pole left vs right.
    2. MNIST classification problem applied to the un-encoded B/W, flattened, 784 bit size MNIST image, a makeshift classifier in which a 10 value vector encode chances of a SDR being ‘0’…‘9’ digit class, plateaus at ~97.6 accuracy.
      Slightly higher results, around 97.8% are reached using a 2d or fly hash encoding
  • ID (or pointer) map. In this case the SDR is mapped to a set of potential IDs. An ID is a simple integer. So an ID map can connect SDRs to arbitrary data. This sounds a bit abstract, here are a few potential use cases:

    1. retrieving the exact most recent moments in time (time steps) when a current time SDR eventually occured
    2. a variant of near neighbor indexer, that was tested promising results in (again) MNIST digit classification.
    3. Sequential memory / next time step prediction.

Ok, I’ll pause here, I guess this is already TLDR :slight_smile:

I view this bit expansion as a form of pattern sepparation, maybe thats why its increasing performance on tasks that would benefit from better sepparated sdrs.

in a way, you may be doing something similar to cerebellum, which also does this kind of expansion but with a 4bits fan-in on average.

I was also looking for a computationally inexpensive way to perform pattern separation on CPU and I came up with a method which I call “hash bind”, it basically replaces thousands of random projections and WTA compartments with a few xor operations on bit indexes to decide the final winner. I believe if you tweak the parameters it will have a very similar result to bit pair expansion.

could you test it and do a comparison on how it performs compared to bit pair expansion?

from typing import List
from random import sample, seed

def hash_2d(i: int, j: int) -> int:
    return hash((i, j,))

def hash_bind_op(sdr: List[int], mix_rounds: int, sdr_size: int, p: int) -> List[int]:
    compartments = [0xffffffff] * p
    for i in sdr:
        for j in range(mix_rounds):
            compartments[hash_2d(i, j) % p] ^= hash_2d(~i, j)

    return list(set(i % sdr_size for i in compartments))

input_sdr = sample(range(1000), 10)
input_sdr2 = input_sdr.copy()
input_sdr2[0] = -4
# input_sdr is identical to input_sdr2 except for a single bit

out_sdr = hash_bind_op(input_sdr, mix_rounds=4, sdr_size=100000, p=20)
print('input sdr:', input_sdr)
print('hashed sdr:', out_sdr)
# input sdr: [864, 394, 776, 911, 430, 41, 265, 988, 523, 497]
# hashed sdr: [64513, 28043, 13069, 25749, 35225, 77601, 10148, 82099, 19510, 87110, 5326, 7893, 90838, 58075, 23518, 67295, 12643, 22123, 94448, 6135]

out_sdr2 = hash_bind_op(input_sdr2, mix_rounds=4, sdr_size=100000, p=20)
print('input sdr:', input_sdr2)
print('hashed sdr:', out_sdr2)
# input sdr: [-4, 394, 776, 911, 430, 41, 265, 988, 523, 497]
# hashed sdr: [64513, 69382, 28043, 13069, 19476, 35225, 39834, 10148, 82099, 19510, 5691, 87110, 90838, 41943, 58075, 23518, 67295, 94448, 6135]

print('difference:', set(out_sdr) ^ set(out_sdr2))
# difference: {69382, 5326, 19476, 25749, 41943, 7893, 39834, 77601, 12643, 22123, 5691}

Possible but I think it is more to it. If two sdrs overlap 50% of their 1 bits, their respective bit pair expansions overlaps on 25%
Which I assume it matters.

Testing a bit your hash_bind_op() it spread similar SDRs too far apart. I didn’t attempt to run a full MNIST classification on that encoding, what baffled me was how do I control the output size?

>>> l2 = hash_bind_op(sdr2, 500, 306936, 4950)
>>> len(l2) 

sdr2 is an P/N = 100/784 sdr

If mix_rounds is too small the result is even shorter.
If it is large (to get close to the desired solidity) then it gets too much mixed, hashing SDRs with high overlap produces encodings with quite small overlap.

bit pair expansion produces exactly P*(P-1)/2 = 4950 bits by expanding a SDR of length P = 100
And the operation is fully reversible, the original SDR can be restored from its bit pair expansion.

def address(sdr):
    l = [] 
    for iy in range(1,len(sdr)): 
        y = sdr[iy]
        for ix in range(iy): 
            x = sdr[ix]
            l.append(y*(y-1)//2 + x)  
    return l 

yea, I didn’t care about the exact output size of the SDR as they are robust to those kinds of errors and I dont expect the brain to have an exact density either, I was trying to find a fast way to bind two SDRs in a way that is invariant to bit order.

Ok, if by binding you mean a sort of random (but deterministic) projection of sdr A to a different N/P ‘dimension’ then yeah it makes sense, with the caveat that at first glance similarity is not very well preserved.
e.g. A and A’ have high overlap, a “good” projection would produce B and B’ also having a significant overlap.

PS the address() example above assumes the sdr is sorted, otherwise same sdr with different order produces a different bit pairing.

well the binding operation I was referring to is one that is needed in VSA machines, the operation is kinda defined as multiplication of two representations, C = A ⊗ B where C is not similar to either A or B but still contains information about both in a reversible manner.

I was going to implement the reverse operation using an associative memory.

Its used to produce compound representations for example: C1 = A ⊗ B and C2 = A ⊗ D in this case even though C1 and C2 contain information about A they still overlap very little because of the other operand’s overlap

I mean, I could use a full blown HTM to do that but I think its overkill.

also, if you use this line instead you’ll have constant size output SDRs, I know its a ugly looking hack but works. I also tweaked the mix_rounds to have 2.0 average fan-in so the overlaps will be hopefully higher.

l2 = hash_bind_op(sdr2, 140, 306936, 7000)[:4950]
1 Like

Ok when the intention is to minimize overlap between C1 and C2, despite having both “mother” A, then yeah it should be fine.
On classification tasks… maybe not that much. Imagine you end up with a dataset with 60000 low overlapping SDRs each derived from its own original image.
A classifier may overfit to 100% but would yield poor results on test sdrs.

At least that’s my intuition, I could be wrong.
Overlapping parts contain common “features” which classifiers cling on.

yea, that may happen with images but I think this can be mitigated by replacing the naive hash function with a smarter random projection like your 2D_FH encoder does.

In the VSA framework, Triadic Memory plays the role of multiplication. It’s reversible and distributes over addition (bit-wise OR). To multiply A and B, you first query if this product is already known. If not, you generate a random SDR C and store the triple {A,B,C}. As a random hypervector, C will be nearly orthogonal to both A and B.

It’s… almost random. Nearby pixels have close projections

PS fly hash encoder however, despite not having the 2D constraint on adjacent pixels still has a high degree of preserving similarity.

Because it does not mix the outputs so… hard.

yea, thats what I mean, my hash function is totally random, there’s no topology built into it so far ends of the image will interfer with eachother.

also, I asume that when you do the bit pair expansion, you use a randomized bit order, have you tried using something more topological? like maybe a hilbert curve?

No the bit order is strict, the whole bit pairing concept assumes sorted sparse SDRs. see the address() function example above, if you shuffle the input sdr you get a different bit pair addresses output

I mean random but determininstic, if you assume sorted bit indexes, it means that in image space its actually row or column order and since close pixels in the image are correlated, it means that the bit pairs are also correlated but the results could be different if the oder of the bits was remapped.

yea, but to be completely honest, one thing I dont like about triadic memory is its N³ ram usage to store a comparativelly small amount of information. we gotta find a more efficient way to build associative memories.

The SDR Classifier results I mentioned in first post I just flattened a 28x28 image, converted it to 784 vectors of booleans for all pixels values > 127 , and refer to it as “DIRECT” which means no encoding/projection/expansion yet. You can think of it as “original”

This original was either:

  • encoded via FH, FH_2D, SpatialPooler which all involve some randomness and some controlled output
  • bitpair expanded - which derives strictly all bit pairs from original
  • both of the above.

I did so many it can be confusing.

It’s large but not much empty - at least on random SDRs the triadic memory on its own has a capacity of ~
(N/P) ** 3 writes before messing things up.
E.G. 1M entries for 10/1000 SDRs . Which considering these are 1M vectors 1000 bits each (1G bits) isn’t that bad.

What I personally find redundant is the need to store in memory an (A,B)->C relationship with a random C for every possible A,B when C can be generated with your C = hash_bind_op(concatenate(A,B),…)

ok so the bitpair expansion is really being performed on row-major oder bits, so they may indeed be picking up really tight topological correlations in the image. in a sense, most of those pairs have a 1x2 receptive fiield and a few jump accros edges of the image.


but I’m curious to see what would happen if we were to remap those pixels order just enough to make pairs sample pixels that are further apart and uniformly oriented interms of verical/horizontal jumps