Visualizing Properties of Encoders

Yes, this is the correct analogy. Before doing the hard problem, try to do the less hard problem first. Of course, we would discover different methods apply for each case and different limitations on the questions that can be answered. Just being able to say and even prove, “no, we cannot answer this type of question,” is still valuable.

Baby steps…

1 Like

Yes. I found that good encodings are closely related to hashing functions and even Locality-Sensitive Hashing (LSH). The challenge for hashing functions approach in the literature is that they usually focus on tuning or training the hashing function to the dataset that is already available. Then an “optimal” hashing function is created.

I think with these binary pattern encoders, neural networks, and the brain, it’s a matter of providing options, anticipating novel data, and doing the best with what you got. In practice, encoders are more like sub-optimal, redundant hashing functions.

2 Likes

Here is the visualization of 100 randomly generated 1D place cells.

In many ways, the previous Fixed Weight Scalar Encoder is just an efficient layout of place cells. Whereas this plot is a set of randomly placed place cells.

Their inefficiency can be dramatically seen in the mismatched similarity peaks on the 2nd from top row. We can also see the highly random distribution of crossover points on the 3rd from top row. We also finally see an encoding where the weight of the encoding is not constant and changes over the input space (also seen on 3rd row).

We also finally see some visual difference between the bins on the 1st row, and the binary encodings on the 4th row. For the former, some of the bins are placed straddling the unit interval boundary. However, the corresponding encoding strictly adheres to the interval boundaries, according to this particular configuration.

One saving grace to this particular approach is any similarities are guaranteed to be strictly local as seen on the similarity heatmap. Any non-zero similarity exists solely on or near the diagonal.

1 Like

I’m still more interested about how to learn those kinds of encoders from arbitrary vector data and preferentially have decoders too.

1 Like

It was over email.

1 Like

Further to the point of mapping a set of “learn rules” to a sequence, using the idea of “spreading activation”, we come to the problem of how do we generate the learn rules in the first place (we certainly don’t want to do it by hand!)?

Ie, given the concept X, how do we automatically build the structure from some corpus?:

op1 |X> => sp1
op2 |X> => sp2
...
opn |X> => spn

where each of the above learn rules captures some information about |X>. The simplest is to consider the words immediately preceding and following a word in the corpus (which I see Rob has considered in his paper, the function Con(w), the context of w).

So for example, maybe:

pre |man> => |old> + |young> + |very old> + |hungry> + |cold> + |tired> + ...
post |man> => |sat> + |watched> + |ate> + |was> + |is> + ...

Or further, with coefficients representing the counts of each occurrence of each word. I don’t remember my exact algo, but a while back I generated so called “word classes” just using the similarity of pre and post superpositions. And indeed, it worked really quite well at finding words that could be substituted for each other. But is there a sensible way to improve even further on this idea?

1 Like

You can do like a transformer, and “learn” which contexts to attend to with “attention”. 10 billion dollars in investment from Microsoft to OpenAI says clearly it works quite well.

Or you can leave all the information in a network, and let the runtime context select what it wants to attend to.

To expand on that second solution…

I think the bigger problem than identifying context to generalize on, and even bigger than the entanglement of generalizations which Coecke etc. have focused on, is the fact that these generalizations actually act like expansions of the data, and there appears to be no limit to them. So any fixed set will always be incomplete, and the only way to completely capture them, is to find the ones relevant to a given situation at run time. I went into this a lot in the “How Your Brain Organizes Information” thread. This post might be a good summary of (the first part of) that:

I gave a lot of examples, from the history of linguistics, maths, and philosophy, to support the idea that meaning can’t be completely abstracted. That in fact “learning”, acts like an expansion of the data. But I thought the data that… @cezar_t presented was also good support - that doing more training over transformers, acts much like just using more data:

So that’s the first option. “Learn” the context to “attend” to. And capture the entangled, quantum, quality of generalizations, by having a black box, where no-one is sure what the structure is.

But I say it runs into the bigger problem, that generalizations expand. They are not only entangled as Coecke etc see, but they get forever bigger. The current solution to that seems to be to just try to have it as big as possible. (Which also means that only the biggest entities can get involved at all, and the little guy is reduced to begging, or paying, to get access to big company APIs.)

The real flaw with that is that no matter how big you make it, it will never be as “big” as human performance, and you’re forever chasing an asymptote:

The second option, one to deal with this expansion, the one I’m now focused on, Is to leave all the information in a network, leave it “embodied” in a set of data, essentially, and let the runtime context select what it wants to attend to. As I said earlier in this thread:

This explanation to @JarvisGoBrr might be a good summary:

Together with this:

1 Like

OK. I gave it some thought, and here is one way to construct a basic graph from some text using the SDB. Basically I built a simple parser. The structure is as follows:

  • map layer
  • equality layer
  • merge layer
  • less than layer
  • cast layer
  • learn layer

In particular, putting it all together we invoke it using:
(yeah, once you have defined your operators, operator sequences can be quite clean!)

our |result> => learn-layer sdrop cast-layer less-than-layer^2 merge-layer equality-layer map ssplit[" "] the |sentence>

Now for a brief discussion of the details of this algo:
First, map-layer, maps words to tagged/typed kets (otherwise known as parts-of-speech).
eg:

map |frog> => |noun: frog>
map |green> => |colour: green>
map |seven> => |number: seven>

The code does this by defining a bunch of lists, and then prepending the relevant type to objects in that list. Note, this toy code does not handle the case where a word has more than one type. I’m not sure the cleanest way to handle that problem! And in the real world, that is almost always the case.

Next, the equality-layer, merges objects with the same type:
(If more than two in a row, merge them all together)

|A: alpha> . |A: beta> maps to |A: alpha beta>

eg:

|proper-noun: Fred> . |proper-noun: Smith> maps to |proper-noun: Fred Smith>

Next, the merge-layer:

|noun> . |of> . |noun> maps to |noun>
|number> . |and> . |number> maps to |number>
|colour> . |and> . |colour> maps to |colour>

eg:

|colour: red> . |and: and> . |colour: green> maps to |colour: red and green>

Next, the less-than layer:
(see the appendix where we define the relations for types)

If A < B then:
|A: alpha> . |B: beta> maps to |B: beta>
and we learn: A |beta> +=> |alpha>

eg:

|colour: green> . |noun: apple> maps to |noun: apple>
and we learn: colour |apple> +=> |green>

Next, the cast-layer:

|noun> . |comma> . |noun> maps to |protolist>
|protolist> . |comma> . |noun> maps to |protolist>
|protolist> . |and> . |noun> maps to |protolist>
|noun> . |verb> . |noun> maps to |NVN>

Finally, the learn-layer, using some if-then machines:

If we see the pattern: |*> . |is> . |*>
then learn: is |alpha> +=> |gamma>

If we see the pattern |*> . |was> . |*>
then learn: was |alpha> +=> |gamma>

If we see the pattern |proper-noun> . |comma> . |proper-noun>
then learn: where |alpha> +=> |gamma>

And so on for other patterns of interest.

With this basic parser in place, we have results such as:

the |sentence> => |The capital city of Western Australia is Perth #EOS#>

produces:

pre|capital city of Western Australia> => |The>
is|capital city of Western Australia> => |Perth>

And given this sentence:

the |sentence> => |Sydney comma New South Wales #EOS#>

produces:

map|Sydney> => |proper-noun: Sydney>
where|Sydney> => |New South Wales>

And finally, given this sentence:

the |sentence> => |The example listy is one thousand two hundred and thirty two red and blue green fish comma thirty three sharks comma eel and turtle #EOS#>

produces:

map |listy> => |noun: listy>
adj |listy> => |example>
pre |listy> => |The>
is |listy> => |fish comma sharks comma eel and turtle>

map |fish> => |noun: fish>
colour |fish> => |red and blue green>
number |fish> => |one thousand two hundred and thirty two>

map|sharks> => |noun: sharks>
number |sharks> => |thirty three>

Anyway, just a quick taste of an idea.

Appendix: Here are our grammar rules, just add more as needed:

relation-1 |noun __ noun> => |equal>
relation-1 |proper-noun __ proper-noun> => |equal>
relation-1 |adj __ adj> => |equal>
relation-1 |number __ number> => |equal>
relation-1 |colour __ colour> => |equal>

relation-2 |adj __ noun> => |less than>
relation-2 |pre __ noun> => |less than>
relation-2 |number __ noun> => |less than>
relation-2 |colour __ noun> => |less than>
relation-2 |title __ proper-noun> => |less than>

type-1 |number __ and __ number> => |number>
type-1 |colour __ and __ colour> => |colour>
type-1 |noun __ of __ noun> => |noun>
type-1 |noun __ of __ proper-noun> => |noun>

type-2 |noun __ comma __ noun> => |protolist>
type-2 |protolist __ comma __ noun> => |protolist>
type-2 |protolist __ and __ noun> => |protolist>
type-2 |noun __ verb __ noun> => |NVN>

And here are our if-then machines:
(again, define more as required)

template |node: 1: 1> => |*> . |is> . |*>
then |node: 1: *> #=>
    is sselect[1,1] the |values> +=> sselect[-1,-1] the |values>

template |node: 2: 1> => |*> . |was> . |*>
then |node: 2: *> #=>
    was sselect[1,1] the |values> +=> sselect[-1,-1] the |values>

template |node: 3: 1> => |proper-noun> . |comma> . |proper-noun>
then |node: 3: *> #=>
    where sselect[1,1] the |values> +=> sselect[-1,-1] the |values>
1 Like

Here is an encoder that uses “Periodic Cells” which can be thought of as just 1D grid cells. They respond according to some period and receptive field width. I’ve arranged them in order of increasing period and increasing bin width to give them a nice visual appeal.

You can note the big discontinuity and complete gap in the middle with the absence of any activation at all. Also note that the first value “0.21” has a good activation but the second value “0.75” has hardly any activation at all. It makes this particular encoder arrangement very asymmetric in its sensitivity. Also note that this kind of encoder goes beyond the boundaries of a finite interval.

We can see that using these types of cells with periodic receptive fields is particularly sensitive to the arrangement of all the cells in aggregate. We either need to be deliberate in how we lay out the receptive bins or use a huge amount of random cells so the problem is solved by brute force.

4 Likes

Some important properties of encoders (which might not be easy to visualize) are related to how they influence downstream learning:

  • capability to generalize
  • sample efficiency (how fast their attached models can learn)
  • learner complexity - e.g. a deep MLP learner is more complex than another based on linear regression.
  • computing cost - for both encoding and learning/inferring stages. Which is hardware dependent but currently most of us only afford cpus and gpus.

PS: Regarding

We can use a an evolutionary search - start with brute force encoding then search for some “significance metrics” for both efficiency and improvements, by:

  • pruning the bins that are irrelevant for the desired outcome
  • moving relevant ones in better “positions”.
2 Likes

Well, then you are solving the Locality-Sensitive Hashing (LSH) problem. If you know the prior distribution of the inputs, then you can optimize your binning/hashing function to get the desired result.

For more brain-oriented approaches, you want to be able to switch how you process based on the current context, and you want to be able to process and accommodate entirely novel inputs. Of course, there’s a limit to what you can process (unheard sound frequencies and unseen light frequencies), but you want to figure out how to lay out those encoders so that downstream processing can adapt to any situation it may find itself in.

2 Likes

Now we show the PeriodicScalarEncoder. This is essentially a 1D HTM “grid module”.

Here we show an encoder with n=7 bins, w=1 (no overlap), and a period of L=1.0. The shaded area is what we call the “fundamental region” which is where the original bins are placed, and the faded bins in the unshaded regions are the so-called “congruent bins”. Each fundamental bin forms a “congruence class”, of which each of the congruent bins are a member. This is just fancy math terminology which means each faded bin is a copy of its corresponding unshaded bin.

We can see this effect in its similarity heatmap. Instead of a single diagonal for a fixed interval encoder, we have repeating diagonals at every multiple of L.

This particular encoder is the closest analogue to cortical grid cells and probably deserves the most analysis.

3 Likes

Now if we use multiple PeriodicScalarEncoders with different variations of their parameters, we start to get even more interesting effects. This is analogous to aggregating the codes of multiple 1D HTM grid modules. Here we only consider the case where w=1.

This figure is a 2x2 multi-plot of varying different types of parameters.

For the left column, we set all bin sizes to be equal and let the periods vary.

For the right column, we set all periods to be equal and let the bin sizes vary.

For the top row, we set the number of bins per encoder to be a multiple of 4.
n = \{4, 8, 12, 16\}.

For the bottom row, we set the number of bins per encoder to be primes.
n = \{5, 7, 11, 13 \}.

In the top row, the plots show the synchronous and discontinuous effects caused by each n being a multiple of 4. The bottom row’s prime distribution avoids these synchronous effects and gives more gradual change in similarity.

We also see in the left column that fixing bin size leads to discontinuous and discrete similarity metrics.
Letting the bin size vary in the right column leads to a more graduated similarity metric.

Conversely, allowing the period to vary in the left column creates a unique local maxima for the similarity metrics. Of course, when we mean “unique”, we are bounding our interval of interest because a sufficiently large interval will eventually see every pattern repeat.

By forcing the periods to be equal for each encoder shown in the right column, the similarity metric creates repeated local maxima at each cycle of the period.

Here are the associated similarity heatmaps showing these effects over the entire interval of interest:
[-1.0, 2.0].

2 Likes

Doing the same plots as above, we set w=3 so that at every point for each encoder, a value intersects 3 bins at a time.

This continues the trends we identified when varying bin size and varying periods. On the left column, when varying periods, we see local uniqueness of a points similarity. On the right column, when varying bin size, we see a gradual smoothness of declining similarity as the value differs.

Furthermore, we see the same trends for encoders with n being multiples of 4 and n being primes. The top row, multiples of 4, we see synchronization artifacts and steep drops, also indicated with the crossover count dots being many at multiple points. The bottom row, the primes, mostly avoids these synchronization artifacts. However, note for both cases that synchronization artifacts are only avoided in the right column when the bin sizes are allowed to vary. If bin sizes are the same across all encoders, then the encoders all transition at the same time.

The mini-squares shown in the left column of this heatmap demonstrate the synchronization effect caused by having the same bin sizes. Whereas smoother gradients in similarity are shown in the right column from varying bin sizes.

The macro-squares seen in the right column also indicate the transition points with high crossover counts. The bottom one only has the discontinuous boundary every period, whereas the 4-multiples on top shows 4 squares per period.

Some questions to ask yourself until the next post:

Can we get rid of these discontinuities in an easy way? Can we combine the benefits of local similarity on the left with smooth gradients on the right? Is selecting each encoder’s n to be primes the way to go, or is this an artifact of our experiments?

1 Like