# Visualizing Properties of Encoders

One of the biggest unspoken challenges I had when exploring HTM theory and trying to build applications was: how do you choose and configure encoders to translate your raw input data into SDRs? This problem has been acknowledged by many, and there’s a paper or two on the problem of encoders, but very little analysis has been done on them.

I’ve wanted to understand the problem of representations in HTM-like Binary Population Neural Networks and understand how information is represented and transformed through the system. However, we must start from the beginning of how data is encoded in the first place.

What always frustrated me from the beginning is that the output of encoders are not SDRs, according to their definition. They are not always sparse, and they are usually localist, not distributed. In fact, according to the classical definition of distributed representation, (every element is participating in the representation of the content), its questionable if a binary array can ever be distributed since each element has only one state in which it is expressing information “1”. The “0” represents “no information”, or rather “no evidence of the thing I represent”.

So what kind of properties can we derive from an encoders output? Here I show some visualizations of a scalar encoder on the unit interval. In particular, I call this a Fixed Weight Encoder since the output is guaranteed to have the same number of active bits.

There’s quite a lot going on in this figure and I want to break it down by pieces. This is the simplest possible scalar encoder. Here, the parameter w=1 which means 1 bit is active at any time. By setting w=1, the scalar encoder becomes a one-hot encoder.

The x-axis indicates every possible input value to the encoder along the unit interval [0,1]. Each subplot is aligned along the x-axis, so you can see how the properties of a changing input affects the properties of each subplot

Starting from the bottom:

Row 1 shows the encoded binary array for each value of x. The array is oriented sideways so we can see it change as x changes.

Row 2 shows two properties, the weight of the encoding, which is the total number of 1-bits in the output, and the location of the crossover points, the locations along the x-axis where the code changes and by how many bits. Here the weight is constant since we are using a Fixed Weight Encoder, and the crossover points are evenly distributed and only one point at a time. If we complicate the types and numbers of encoders, this plot will get more interesting.

Row 3 shows how similarity changes between two randomly chosen values. This tries to evaluate how well this encoder adheres to the principle of “similar inputs produce similar outputs”. Given this is a one-hot encoding, similarity for a value only exists in the discrete region it falls into. Given higher w and additional encoders, more graduated relations of similarity emerge between similar encoded inputs.

Row 4 shows discrete “bin” representations of each of the bits. If the input value falls into a bin, the corresponding bit is activated. This gives a well-defined meaning for each of the encoder’s output bits. In particular, the meaning of the 0’th bit, b_0, would be something like:

b_0 = \left\{ \begin{array}{ c l } 1 & \quad \textrm{if } 0 \leq x \leq 0.14 \\ 0 & \quad \textrm{otherwise} \end{array} \right.

Here’s a plot with w=4.

We can see a much smoother transition on the similarity plots since the encoding retains some property of “similar inputs, similar outputs”. Why don’t we call this property SISO.

Another way to view SISO is through a self-similarity heatmap:

Along the diagonal, we see the maximum possible amount of similarity.

To clarify, our metric of similarity is the number of common bits between two representations. This is equivalent to the cosine similarity or dot product for vectors. For binary arrays, this can be computed by an AND() operation between followed by a POPCOUNT() operation which counts the number of set bits.

I hope people find this interesting since I have a lot more visuals and variations. This extends to a couple variations of the scalar encoder, place cells, grid cells, and generalized periodic encoders. It also extends to 2D and n-D, but I think focusing on 1D for the moment gives us the opportunity to understand how these representations work.

7 Likes

@jacobeverist to me the interesting bit must be the principle on which the encoder is built:

In the ideas I’ve been presenting here (and now exploring with @complyue ) the metric of similarity is the number of common contexts between two representations. This can be seen as equivalent to a kind of cross product for vectors. So, contrasting with yours, which is a form of dot product, this might be seen as a form of cross-product.

What is interesting is that a cross-product generates new vectors, so new structure. Compared to a dot product, which only compares what is known.

There are other ways of relating elements to build “encoders” which expand the data rather than just compressing it. For instance, I think this talk by Chris Domas I referenced earlier, describes a few:

Domas has representations which visually distinguish… “locality” was one (Hilbert transformation) or adjacency(?) I think he had which made text visually distinct from… encrypted binary (so he could easily distinguish in computer code where encrypted areas, or computer viruses… might be? So, identifying general properties of viruses, rather than specific viruses as former tools had?)

The common feature of Domas’s representations, and why I liked them, was because they were all “expanding”. They all generated new structure, which was meaningful even though new, because of the way it related elements, instead of just comparing with known structure.

For me the interesting question is whether a representation of a given set of data is finite and compresses the data. Or if it is infinite and expands ever more structure out of the data like mine, or Domas’s.

I think representations which expand ever more structure out of a given set of data are going to be the ones which turn out being central for the study of cognition.

Has that contrast between encodings which compress data, and encodings which expand structure from data, been something you have looked at in your thinking about encoders?

1 Like

I’m unfamiliar with this “cross product” you mention. I’ve looked at a lot of mathematics that transforms and expands the representation in a lot of interesting ways. However, I’ve been mostly focused on ways of compacting and clarifying semantics.

You are right that contextual information is key to a lot of computation. For that reason, I think it’s important understand the difference between denotation and connotation in semantics. In linguistics, denotation is the dictionary definition of a word, and connotation is the variations and cultural understandings that are invoked by a word. In this work, I would say the above direct encoders are a form of denotation, and other operations that work in context, such as mini-columns, are a form of connotation.

For now, I’m trying to be as simple and as small as possible to create representational primitives that we can all understand, from which we can scale up to more complicated representations and computations. Being able to understand the qualities of representations when we understand what all the inputs mean is key before we transition to systems where its impossible to keep track of the semantic meaning of every single bit.

1 Like

Well, I started with a very loose idea of “cross product”. Basically …a relationship which combines elements to create new elements?

I’ve later found parallels in maths, like the thread of representation more generally as tensors, quantum maths, and category theory. Bob Coecke might be the best example of that:

From quantum foundations via natural language meaning to a theory of everything

"In this paper we argue for a paradigmatic shift from ‘reductionism’ to ‘togetherness’.

(Edit: to link this to where I refer to it in earlier threads)

Coecke emphasizes the entangled, contradictory, aspect. Of language in particular. I like that. But actually I think Coecke misses a trick. Because quantum maths is top down. It captures the entangled quality, but not so much an expansive quality (new particles?)

Linas Vepstas has also been doing work for years trying to find grammar for language. And he also came to something of a category theory view.
Sheaves: A Topological Approach to Big Data

I actually think both those approaches are limited by starting with the maths. I’m reminded of Chaitin’s assessment of Goedel’s proof of mathematical incompleteness. That the proof relied on a superior power of computation over mathematical formalism:

“If you look at Gödel’s original paper you see what to me looks like LISP, it’s very close to LISP, the paper begs to be rewritten in LISP!”

A Century of Controversy Over the Foundations of Mathematics
G.J. Chaitin’s 2 March 2000 Carnegie Mellon University School of Computer Science Distinguished Lecture.
http://arxiv.org/html/nlin/0004007

But as a way of summarizing a contrast between the status quo in AI today, which I think you have summarized well as a kind of dot product, I think the maths of cross products, and more generally in the way that Coecke, Vepstas, and others are relating it to tensor maths and category theory, at least goes part of the way to a full power of expansion perhaps only summarizable by computation.

You could make that distinction. By that definition “denotation” is limited to where you have a dictionary. I see that as the fundamental problem with current AI. We are attempting to encode the world as a large dictionary. The dictionary keeps on getting bigger and bigger. You have to decide when you want to stop thinking of it as a dictionary, and just keep the building process. You might call that a shift to emphasis on connotation.

I disagree. I think it is key that we are unable to keep track of the semantic meaning of every single bit. The ambiguity of elements, bits, is the key power of the system we need. I’m sure that’s why words in language are ambiguous. It’s not nature being lazy. Ambiguity is a more powerful way to represent meaning. It means you can create more meanings. If the elements depend for their meaning on the whole, then the same elements can “mean” a near infinity of combinations.

This, I think, is a doomed endeavour.

Read Chaitin’s very entertaining talk about the attempts to find a logical basis for maths at the turn of the earlier century. It failed, he says. But that failure was the invention of the power of general computing… Even maths was demonstrated not to have a complete set of representational primitives which captured all meaning. But it hints at a greater power of computation.

There’s something in combinatorial power which we are only beginning to discover.

1 Like

the metric of similarity is the number of common contexts between two representations.

Beware contexts as anything else rarely match perfectly. Which raises the question: what is the metric of similarity between contexts?

1 Like

For reference, here are Numenta’s relevant publications on this topic:

3 Likes

1 Like

The HTM-School series is another great resource for learning about numenta’s theories.
These videos are intended for people with little to no pre-existing knowledge about computational neuroscience.

1 Like

@robf

I mostly agree with you whilst you seeming to disagree with me

So I think this would be basically any type of “outer product”, versus my dot product which is a form of “inner product”, reducing to a single value. I can think of a couple that apply right off the top of my head:

1. A truth table of two binary arrays with an element-wise AND/OR/XOR operation. One array is in the top row, and the other array is in the first column like so:
& 0 1 1 0
0 0 0 0 0
1 0 1 1 0
1 0 1 1 0

This creates an m \times n binary matrix.

1. Another kind of “outer product” could be the Cartesian product of two arrays which instead of single bits, would create a matrix of bit “2-tuples”.
\times 0 1 1 0
0 (0,0) (0,1) (0,1) (0,0)
1 (1,0) (1,1) (1,1) (1,0)
1 (1,0) (1,1) (1,1) (1,0)

I’ve trawled a lot of mathematics to find different transformation operations but they often don’t really apply to binary arrays. A lot their expressiveness only finds root in continuous vector spaces, tensor spaces, complex spaces, etc.

Neat paper. I’m going to study it more closely. I’ve been following the wash of papers that add quantum into their title and their quantum formalisms to describe things. Usually they are only applicable if you add a Bayesian formalism to your analysis which I’m avoiding doing. Perhaps in the future I may add it when I understand the discrete behavior first.

I completely agree here. I’m not interested in assigning definitions to bits so much as I’m interested in developing a theory of mutable semantics. I want to understand the shape and quality of that ambiguity, how it changes, how it forms, and what are the bounds, and how does it interact? How are semantics actually formed from scratch?

With this goal in mind, it’s useful to start analysis with the case where the semantics of the bits are known. Only then do we proceed to the case where semantics are unknown, ambiguous, contradictory, overloaded, etc.

I’m not sure why you’re so pessimistic. Just because I can’t create a formal system that adequately describes itself, doesn’t mean I can’t build some heuristics that give understanding and clues about how to build larger scale systems. I’m trying to get a computer scientists understanding of how binary patterns represent things and are transformed so that we can successfully build larger computational workflows with Binary Pattern Neural Networks (BPNNs), similar to how computational theories were developed to be able to build algorithms and computer hardware.

1 Like

Here’s some further visualizations.

Here we show the cases of multiple differently configured encoders concatenated together. The first case has each encoder w=1.

We can actually see some interesting behavior in the crossings count. Given that the way the multiple encoders are partitioned into equally sized bins, the number of bins for each encoder share some of the same factors. Therefore, the crossing points are the same in some places and you have major disjoint changes in the similarity value as seen in the cliff-drop in the point similarity plot. It’s even more obvious in the similarity heatmap.

By setting w=2 for the same encoders, the quality changes drastically below:

We now have much smoother drop-off in the similarity value.

Truly, how to properly configure your encoders is about getting the right shape and quality of those slopes in the similarity plot, shown the 2nd from the bottom. However, we still don’t know what a measure of “goodness” would be.

Things get even more interesting when we consider periodic encoders such as grid cells. Given that grid cells work outside the finite interval off into infinity, we need to consider whether an encoding is sufficiently unique within a bounded locality.

More to come…

3 Likes

Sure, figuring the best metric of similarity between contexts is an issue. Hence “attention”.

But the real game changer is that making similarity between contexts the basis, means the value itself can be superficially unrelated to each other. And actually can be a completely new set.

Before I compared it to Google Pixel low light photography where (I believe) they take a number of photos, and match contexts to better estimate a particular pixel. In one photo the pixel identified by the context might be complete noise. In another it might be black.

That’s not the only way to generate new sets. I think it’s the one human cognition uses. (Chris Domas had others based on locality preservation and such. Useful for different data, amplifying different signals, which human cognition is generally blind to. But they shared the property of generating new sets.)

1 Like

Hmm. This still sounds to me like saying, “I understand a lead balloon will not fly, but I want to build a lead balloon first, and only then proceed to ones that fly…”

Or, maybe, “I understand the weather is turbulent flow, but I want to build a laminar flow weather prediction model first…”

Well, maybe that can help some things.

Anyway, just giving you my perspective on the problem. If at any point you feel you are having difficulty finding fixed semantic values for bits, you can think about examining the combinatorial route again.

1 Like

So, I’m a little interested in this cross-product idea. I’m just curious what properties you would like it to have, and then how you would like to use it? What would it be useful for? You probably said somewhere and I missed it.

Anyway, the first part is easy enough to answer. Define a function X(A,B) such that:

1. X(A,A) = 0
2. X(A,B) = - X(B,A)
3. for some similarity function sim(A,B), we have sim(A, X(A,B)) = sim(B, X(A,B)) = 0

Which should serve as an initial definition of a cross product function.
OK. So how would we now find such a function? Well, one quite general definition is:

• X(A,B) = 1/2 (P(A,B) - P(B,A))

where P(A,B) is some kind of product of A and B.

I know it can be implemented in my toy semantic db, since I have a working version, but can an interesting X(A,B) be defined for binary vectors? Or even vectors of floats that don’t have explicit basis labels (instead the basis labels are implicitly defined by their position in the array of values)?

This is where the SDB is a little more general, because all vectors have explicit basis labels. So we have:

• a single element in an SDB vector is given by c|s> where c is some float, and s is some string
• a full vector (which we call a superposition) in the SDB is given by: c1|s1> + c2|s2> + … + cn|sn>

I may as well give my code, but it probably doesn’t add too much to the conversation:

-- first define our product function:
P {the |sp 1>, the |sp 2>} #=>
unlearn[the] |result>
for( the |ket 1> in the |sp 1>):
for( the |ket 2> in the |sp 2>):
the |new basis> => clean the |ket 1> __ |*> __ clean the |ket 2>
the |new value> => (extract-value push-float the |ket 1>) ** (extract-value push-float the |ket 2>)
the |new ket> => pop-float (the |new basis> :_ the |new value>)
the |result> +=> the |new ket>
end:
end:
the |result>

-- now our drop-zero operator that drops terms with coeff == 0:
drop-zero |*> #=>
if( extract-value push-float |__self> == |0>):
|>
else:
|__self>
end:

-- now our normalize basis elements operator:
-- eg: |alpha * beta> maps to |alpha * beta>
-- and: |beta * alpha> maps to - |alpha * beta>
normalize-basis |*> #=>
the |basis sp> => drop-zero split[" * "] |__self>
the |coeff> => extract-value push-float |__self>
if( how-many the |basis sp> != |number: 2>):
|>
else:
if( sp2seq the |basis sp> == sp2seq natural-sort the |basis sp>):
|__self>
else:
pop-float ( smerge[" * "] natural-sort the |basis sp> :_ times-by[-1] the |coeff> )
end:
end:

-- Now finally, our cross-product function:
X {our |sp 1>, our |sp 2>} #=> 1/2 normalize-basis ( P(our |sp 1>, our |sp 2>) |> - P(our |sp 2>, our |sp 1>) |> )


And these satisfy properties (1) and (2) defined above. Here is a quick demonstration:

-- define a couple of toy superpositions/vectors:
the |one> => 2|alpha> + 3|beta>
the |two> => 5|alpha> + 7|beta>

-- now print out the results:
sprint["one: "] the |one>
sprint["two: "] the |two>
print | >
sprint["P: one vs one: "] P(the|one>, the|one>)
print | >
sprint["X: one vs one: "] X(the|one>, the |one>)
sprint["X: two vs two: "] X(the|two>, the |two>)
print | >
sprint["X: one vs two: "] X(the|one>, the |two>)
sprint["X: two vs one: "] X(the|two>, the |one>)


with the result:

one: 2|alpha> + 3|beta>
two: 5|alpha> + 7|beta>

P: one vs one: 4|alpha * alpha> + 6|alpha * beta> + 6|beta * alpha> + 9|beta * beta>

X: one vs one: |>
X: two vs two: |>

X: one vs two: - |alpha * beta>
X: two vs one: |alpha * beta>


Likewise, if we define:

the |three> => 11|alpha> + 13|beta>
the |four> => 17|alpha> + 19|beta>


we get:

X(the |three>, the |four>) == -12|alpha * beta>
X(the |four>, the |three>) == 12|alpha * beta>


Finally, a brief mention of a similarity measure. The SDB uses something pretty close to:

simm {the|sp1>, the|sp2>} #=> measure-currency intersection(normalize the |sp1>, normalize the |sp2>)


But since the output of X() has no basis elements in common with the input vectors, the similarity is trivially 0, and hence property (3) above is also satisfied.

One final note is that the X() function is not associative. Ie X(A,X(B,C)) != X(X(A,B),C). So the above SDB code doesn’t even try to handle the X(A, X(B,C)) case. But with a little work, we could probably make it work.

1 Like

It occurred to me that we can tighten up our implementation of the product function P(A,B). We make use of a property of the string merge operator:

c1|s1> __ c2|s2>
returns:
c1*c2 |s1 s2>
noting that there is a space char between the strings s1 and s2.


And hence we can now do:

P {the |sp 1>, the |sp 2>} #=>
unlearn[the] |result>
for( the |ket 1> in the |sp 1>):
for( the |ket 2> in the |sp 2>):
the |result> +=> the |ket 1> __ |*> __ the |ket 2>
end:
end:
the |result>


Noting that |*> here has no explicit coefficient, and hence has value 1. So the final result of our merge step is to merge the basis elements (with a * in the middle), but also to multiply their corresponding coefficients.

Anyway, just a small point.

1 Like

If you look at what Coecke et al did initially, they went to tensors, for a more general “outer product”. But they got dimensional explosion, because of something which looks like your equation:

This might be the first paper I found on this. Around 2007(?):

A Compositional Distributional Model of Meaning

As I recall the initial papers mentioned the combinatorial explosion of dimensions problem. And later papers appeared to try and solve that using maths of approximation from quantum mechanics, and category theory.

Here, I found a video presentation in my notes:
Stephen Clark: “Compositional and distributional models of meaning for natural language”

I forget how Coecke described the history of this collaboration. I think he said he was working on category theory, and his friend Clark, who’s a linguist, said that looks like my grammar maths (with a grammar formalism which dated back to… forget the name.)

I recall Coecke talks about the history in this podcast.

I did not have the combinatorial explosion problem, because I was working bottom up from… embeddings of natural language text, basically. So I just generalized combinations on the embedding.

See a pre-print of that here:

Parsing using a grammar of word association vectors

I now think that generalization on the embedding was a mistake. Even though my vector product was generating new structure, I had a fixed embedding frozen in the definition of my vectors.

I now think that is a problem inherent to vectors. The only way to keep all the information is to keep it in a network. Hence what I’ve been talking about trying to do in recent threads here, projecting out “embeddings” directly from a network.

In the network formulation the “cross-product” becomes just a recombination of elements based on an “embedding” which varies at run-time.

By reference to my pre-print above, I would say… yes. You get an interesting combination relation. It was enough for me to generate a parse structure. But with the caveats above.

Basically, now, I think you have to keep the network, and find the relations to recombine your components, at run time.

SDB?? Semantic Data Base??

2 Likes

Cool, thanks! I will follow up on those links when I get some time tomorrow.

Also, the SDB is, as you may recall, a kind of associative memory that basically implements a directed network. As part of that, just recently I have been looking at the idea of “spreading activation”, where each element in the result sequence is the output from the n-th step through the network. And then run our similarity function on these sequences. In brief testing, merging all the steps into a single superposition produces terrible results, since in that case almost everything is similar to everything. It is the number of steps from the source that is interesting, not the object itself. But I have a lot more testing to do with that idea.

Finally, now you mention it, yes we did have conversations a long while back. The SDB has evolved quite a bit since then! Amongst other things it now has a GUI, and a bunch of other additions and improvements.

Also, here is a quick example of the network structure and the spreading activation idea:

age |Fred> => |37>
friends |Fred> => |Sam> + |Eric> + |Emma> + |Liz>
mother |Fred> => |Trude>
father |Fred> => |Robert>


So, from here we can start to construct a sequence:

• |Fred>
• |37> + |Sam> + |Eric> + |Emma> + |Liz> + |Trude> + |Robert>

The next step is then what we know about 37, Sam, Eric, Emma, Liz, Trude and Robert.
Then the next steps are generated in similar fashion.

Or we can restrict the spreading activation to particular operator types. Say just the “friends” operator? Or maybe “mother” + “father”, and so on.

1 Like

You might have better luck with something like a Geometric Algebra, also sometimes referred to as Clifford Algebra. GA generalizes linear and vector spaces much better than traditional vector algebra/calculus. In particular, the concept of the cross-product is generalized for arbitrary dimensions to the outer product (not the same as tensor product) which represents the subspace spanned by the arguments to the product. (e.g. two linearly independent vectors, u and v, will span a 2D subspace which is represented by a bivector B = u ^ v; three linearly independent vectors will span a 3D subspace and be represented by a trivector C = u ^ v ^ w; etc.)

To take this a step further, the geometric product actually produces a graded algebra that includes many relations between the entities represented by its arguments, not just the inner and outer products. The geometric product subsumes the inner product, the outer product, and much more. For example, the product of a m-vector with an n-vector produces a linear combination of entities with grades: | m - n |, | m - n + 2 |, …, | m + n - 2 |, | m + n |. The inner product between the two entities results in the lowest grade component, while the outer product results in the highest grade component. All of the intermediate components are potentially relevant geometric quantities, however most of them have not been extensively studied due to the fact that we are typically used to working with algebras with six or less dimensions (in which case there may only be at most 3 components resulting from the geometric product of two pure-grade entities).

Geometric Algebra solves a lot of long standing issues with incompatible representations across multiple science, engineering and mathematics disciplines. There is a fairly active community working with GA on applications as diverse as relativity, quantum mechanics, robotics, computer graphics, signal processing, and even some preliminary work on neural networks and other machine learning approaches.

As for visualization, you may want to take a look at some of these demos. In particular, this one.

Let me know if you find any of this interesting or useful.

3 Likes

Thanks for the suggestion.

I’m basically stuck on the idea that you can’t reduce the representation below a network of examples, at the moment. Bob Coecke and the like have got into some fairly sophisticated maths. But more than anything I think the sophisticated formalism might limit them. Because formalisms are always top down. I think they miss simplifications which are apparent physically. So, for instance, it strikes me that Coecke is breaking a sweat to squeeze the body of example in a language into a quantum mechanical formalism. And then immediately burns a hole in the planet trying to use quantum computing to pick it out again!

The maths captures (a lot of?) the complexity. But why bother. If the complexity is in the body of examples, why not use that directly?

Much simpler just to stay with the examples!

That’s my thinking at the moment, anyway.

Of course Coecke has got a lot of funding from quantum computing sponsors. They have a big outfit centered around… Cambridge? 500 people? Working on it. So he’s heavily incentivized to find a reason to do it that way.

1 Like

Where was that? Was that on NuPIC?

2 Likes

#### edit: forgot to add self-similarity

this is a similar behavior to the scalar hash encoder which can theoretically support a unlimited range.


from matplotlib import pyplot as plt
import numpy as np

from math import log2

def hash_2(x, y):
# saw something like this hash function on stackoverflow but dont remember the source
# there are better magic numbers but those are just random primes
mask = 2 ** 64 - 1
x = (((x or 61214540001836966713) ^ (x >> 32)) * 38456183908924439717) & mask
x += y or 61607333722841864273
x = ((x ^ (x >> 32)) * 25347052591918980319) & mask
x = ((x ^ (x >> 32)) * 89644427646567325781) & mask
return x

def scalar_hash_encoder(value, min_resolution=0.01, p=50, n=1000, scaling=1.1, self_similar=True):
output_sdr = set()
if self_similar:
# not sure if math is correct but eyeballing the graph tells me it proabaly is.
value = log2(abs(value) + 1) * [-1, 1][value > 1]

scale = min_resolution
for i in range(1, p + 1):
output_sdr.add(hash_2(round(value / scale), i) % n)
scale *= scaling
return output_sdr

representations = []
X = 100
for i in range(X):
representations.append(scalar_hash_encoder(i / X))

similarities = np.zeros((X, X), dtype=np.int32)

for i, a in enumerate(representations):
for j, b in enumerate(representations):
similarities[i, j] = len(a & b)

plt.imshow(similarities)
plt.show()

1 Like