Here are a few observations about using dense scalar vectors in the context of SDR-based machinery.
A dense intermediate vector is what basically DL is working with. - e.g. each hidden layer computing a vector of floats for the following layer.
If you think that “life dosn’t like floats” please bare with me. First I won’t suggest to do any learning stuff on the actual scalar vectors, only to use them as intermediaries in at least two cases:
The cases are:
- encoding input scalar values to SDRs via dense vectors.
- “transcribing” SDRs to other SDRs with different properties (size, sparsity)
Of course the obvious question arising is why would we need such a thing.
The short answer is composability, and the follow up question is what does that means?
Lets postpone the encoding of inputs and take composition of SDRs. Most basic example is overlap - we can use a union of bits for both A and B to represent the pair C = (A,B)
Very simple and efficient except for the problems it generates:
- decreasing sparsity. If we want to combine C with D which in itself is a the pair of (E,F) we induce a lot more difficult problem for downstream processing and exhaust bit space pretty quickly.
In associative memories decreased sparsity is a big problem.
- Bit overlap can represent sets of SDR but not sequences. We can’t tell whether A was before B or vice-versa
Using intermediate dense vector would allow to not only to compound larger number SDRs but also to control both sparsity and SDR size of resulting SDR.
The basic methodology is simple:
- take each SDR of density P and size N and project it to its own dense vector of size N’ (N’ could be N or other desired compounded SDR size)
- each vector is normalized to [0,1] limits if it uses floats or an integer max threshold if we struggle for efficiency.
- sum up the resulting dense vectors.
- use the top (or bottom) ranking P’ values to output the compound SDR. where P’ is chosen to achieve the desired P’/N’ sparsity of the resulting SDR
ok the algorithm above doesn’t address the ordering issue, that will follow after we talk about scalar encoding.
Till then let’s notice is we - or it - can use “knobs” to control how much influence each input SDR will have on the output SDRs. An exploration algorithm compounding A, B, C could knob each at full strength or tune it down by simply multiplying its intermediate dense vector with a value between 0 and 1.
Compounding of (A, B, 0.5 *C) would get somewhere between full (A,B) and full (A, B, C)
Now about scalars - We already have a few scalar encoders, here-s another one:
- have a fixed, random encoding vector of values between [0,1] of size N
- map the scalar to be encoded to the [0,1] interval too.
- add it to the encoding vector and divide the result modulo 1
- take the top (or bottom) P values and here is our P/N SDR result.
Nothing interesting yet except a seemingly complicated way to encode scalars.
The nice properties stem, again, from compounding.
When we want to represent four scalars in a SDR, we can add the intermediate dense representations before thresholding the max values to obtain a resulting SDR.
It is not obvious why such composition would be better than simply concatenating 4 x 1/4 size SDRs, one for each scalar. Yet in the CartPole tests I already mention, this encoding outperforms by a far margin the simple concatenation of 4 sub-sdrs.
The only explanation I have is this encoding exposes potentially useful correlations for certain combinations of values to the learning algorithm which in many ways is a simple form of regressor.
Another potentially useful property of the two cases above is we can compound not only SDRs with SDRs or scalars with scalars but also arbitrary combinations of scalars and SDRs , e.g. we can have an encoding for e.g. “5 times chicken combined with 1 times egg” and thus we can specify not only which came first but also how many days it takes from one to transform into the other.
A couple more notes and I’ll hopefully stop here:
- the scalar encoder can encode cyclic values (like 0, pi, 2pi) seamlessly
- performance concerns - it seems (and it is) like a whole bunch of computation but the fashion nowadays are TPUs, NPUs and this kind of stuff on lightweight devices to accelerate inference. e.g. the newest ARM RK3855 cores have a NPU capable of 6 TOPS on integer vectors, it will almost surely saturate on memory bandwidth before coughing it has too many numbers to add.
PS sorry for spelling it’s inference not interference
Very neat way to combine SDRs.
You’re right that this is better than concatenation because concatenation dilutes the semantic information and also does not provide a superposition of the component SDR information. You need some way to make the resultant composite SDR a distinct superposition of the inputs. For example, given the vectors (1,1,1) and (1,1,0), we want a composite SDR to have small similarity. If the composite SDR was simply a concatenation, the similarity would be large, around 66%. Your approach probably gets it down to around 10%.
I know that in some VSA literature, they achieve a superposition by doing the cross product of the vectors to get their composite which is orthogonal to the inputs. This way, you can use the composite vector to recover either of the original vectors if you know its companion. They call this a binding and is a solution to the binding problem. You can read about it in Kanerva’s paper on hyperdimensional representations.
Also, recognize in both your scalar encoding and SDR composition approaches, you use a k-WTA activation layer on the resulting SDR, where k=P. The prevalence of the k-WTA in all Binary Pattern Neural Networks is one of their key features which have interesting semantic properties. One of these is that they discard irrelevant information automatically.
For instance, if you are recognizing a “cat”, you will represent only cat features, but will discard information on how likely it is a “chair” or a “boulder”. This relevance for each neuron’s output corresponds to its activity magnitude. Irrelevant info is automatically discarded by inhibition from the k-WTA algorithm.
Your SDR combination approach creates salient bits where the sum of random projections creates larger vector elements. These larger elements are sufficiently distinct to provide the necessary discriminative properties. All of the smaller-valued vector elements of the composite dense vector is discarded as irrelevant information.
Another potential benefit of the intermediate scalar vector is one can use either top %sparsity of values, bottom ones, or any narrow “band” in between.
That could help e.g.
- ensemble solvers each “seeing” different compounded features
- use the “low” band of bits to sparsely route a problem towards 50 (out of 1000) solvers while each solver sees the “top” 50/1000 bits which are “disconnected” from the ones used for routing
Or maybe not, it’s just an idea.