Scaled dot product attention mechanism in cortical columns?

Jim · July 10, 2022, 12:49pm

Hi, I’m wondering if there is any evidence of something similar to a scalar product attention mechanism in the cortical column or in Numenta’s proposed architecture for intelligent machines?
Thanks for any pointers here.

hsgo · July 10, 2022, 3:21pm

I’m not claiming that this is how it actually works in the brain, but GLOM, a deep learning model designed by Geoffrey Hinton, incorporates self attention to achieve what basically is “lateral voting” in the Thousand Brains Theory by Numenta. So you can say that there are some connections in some perspectives(reference frames ).

neel_g · July 10, 2022, 9:48pm

Even if there is an analogous mechanism (which I doubt so), the entire point of the attention mechanism allows dynamically approximating any function by influencing weights from the inputs themselves - a form of meta-learning, to put it simply.

So HTM/TBT doesn’t need attention, because that’s not what happens in the brain, atleast in a biologically corresponding way. Deep learning whereas, prefers Attention to be a much more scalable and robust system capable of yielding impressive results.

Whether Attention is superior to biological mechanisms is another question.

bkaz · July 10, 2022, 11:18pm

In my understanding, attention in transformers is simply pretraining on a larger dataset than the one for specific use cases, to add more general embeddings. And there could be multiple orders of pre-pretraining to add embeddings of embeddings, etc. Then these nested embeddings form a hierarchy of generalization, similar to the hierarchy in HTM / neocortex.

neel_g · July 11, 2022, 12:10am

No, that’s Masked Language Modelling. Attention is a mechanism, not a process; and pre-training is not always a requirement.

It’s adding nothing “general” - that’s the wrong terminology. Its a manipulation of vector spaces, but we have nothing concrete to go on after that.

Again, that sounds like you’re confusing different aspects. There is no such recursive pre-training, nor is there any recursive aspect in an LLM at all.

Again, there is no nested embeddings. The closest analogy you can come up with would be that the layers in multi-head attention is similar to hierarchy.

However, that structure of understanding is only verified for convolutional Neural Networks, which extract more and more abstract features the higher up they go. Generalizing it to attention may be highly misleading to anyone giving this topic a cursory glance.

bkaz · July 11, 2022, 12:28am

Well, how do you get the embeddings then?

Embeddings are more stable than use-case-specific “soft weights”, thus more general.

That’s not saying anything.

I did not say there is, only that there could be. Same for nested embeddings.

neel_g · July 11, 2022, 10:38am

Pre-training and embeddings are mutually exclusive concepts.

There is no measure of stability; they’re simply a representation of your sequence. There’s nothing inherently general about embeddings; as I said, you’re misinterpreting them.

There’s an entire field of XAI devoted to exactly that. You’re welcome to add your own findings and breakthroughs to it.

There is nothing such as nested embeddings. They simply go through a series of non-linear transformations at the end of which we transform it to our target space, and hope the model learnt the high dimensional distribution.

bkaz · July 11, 2022, 11:21am

How do you get these embeddings / attention heads if not by previous more general training?

neel_g · July 11, 2022, 11:38am

As I said before, you’re mixing concepts.

you don’t ‘get’ attention heads by any prior training.They’re a deceptively simple mechanism to allow for dynamic weights in a language model.

You don’t need to train anything to obtain embeddings - a randomly initialized matrix works as well. embeddings aren’t required to have any relational dependencies on neighbours - the term itself allows for any representation. You can tune those embeddings like how word2vec does, but thats a different matter.

bkaz · July 11, 2022, 12:20pm

How is it different from training regular MLP? You have some kind of squared randomness: randomized embeddings and randomized regular weights, trained together?

I may well be mixing things up, because it’s all so perverse: random shit piled on top of some other random shit.

neel_g · July 11, 2022, 12:56pm

…that is exactly what an MLP is. There is no difference. Which is why interpreting self-attention vectors as a linear projection is so incorrect.

You may think so, but that’s because you clearly have little idea of what you’re talking about.

I have little idea about neuroscience, and all I see TBT too as random mechanisms strung together with little chance of working. Some other people here might have better idea due to their knowledge on how it might all operate together, but that doesn’t mean I can claim HTM/TBT is all nonsense. I think it has merit, otherwise I wouldn’t be here in the first place.

bkaz · July 11, 2022, 9:35pm

Ok, I did assume that, but thought that embeddings are trained before regular weights. Thanks for clarifying that both are trained together, via the same backprop? What makes them more dynamic then? Would is be accurate to distinguish them as lateral vs. vertical weights? You mentioned layers of attention, is that lateral over greater distance? Why is that mutually exclusive with pretraining?

I do have some idea about neuroscience, but it doesn’t make me like the brain either. It’s also loaded with random shit and evolutionary constraints, on all levels. But that’s all we got at the moment, plus those deep NNs.

My own approach is totally different, it’s a strictly functional design. But I have one steady collaborator, vs. maybe a million neuroscientists and ML researchers around the world. So, that’s a plan B, but even if they get there first, their AI will then switch to my scheme on its own :). Because it makes sense.

neel_g · July 11, 2022, 11:54pm

Usually. In Self-attention, they’re both directly modified because those matrices are mixed the weights (the W_q, W_k, W_v matrices)

I don’t get you; all weights are vertically processed. This is not a gated Mixture-of-experts type architecture to have lateral weights (though they’re effectively vertical too).

Attention has no bearing with pre-training, self-supervised learning, weakly supervised learning etc. are part of this paradigm. You can pre-train LSTMs too if you want.

bkaz · July 12, 2022, 1:10am

They are all trained vertically, but attention weights are assigned to lateral connections: between tokens within an input vector. Which I think makes that input vector a graph, and then transformer is a type of graph neural network: Transformers are Graph Neural Networks | by Chaitanya K. Joshi | Towards Data Science While regular weights are assigned to vertical connections, between nodes of different layers.

bkaz · July 15, 2022, 12:26am

So, are these lateral key, value connections are within input vector only, or they can also be between nodes within higher layers, @neel_g?

neel_g · July 15, 2022, 12:48am

Just because you can re-write something as something else doesn’t really mean it neatly falls into that category. Treating Self-attention as a graph though can be very illuminating when you compare against the multitudes of attention variants.

You can’t have lateral computation with any vector you apply, whether it be the computation performed by a prior layer or a different stack of them, usually meant for conditioning on another distribution (cross-attention).

Topic		Replies	Views
Self-attention in HTM Numenta Theory question	0	691	November 11, 2021
An ode to biological principles Lounge	28	1237	September 23, 2023
Attention Approximates Sparse Distributed Memory Lounge	1	525	October 22, 2021
Basic Question on thinking feedback loop and the HTM Lounge	2	303	January 9, 2024
A Deeper Look at Transformers: Famous Quote: "Attention is all you need" Implementations transformers	59	5689	May 26, 2023

Scaled dot product attention mechanism in cortical columns?

Related topics