Scaled dot product attention mechanism in cortical columns?

Hi, I’m wondering if there is any evidence of something similar to a scalar product attention mechanism in the cortical column or in Numenta’s proposed architecture for intelligent machines?
Thanks for any pointers here.

1 Like

I’m not claiming that this is how it actually works in the brain, but GLOM, a deep learning model designed by Geoffrey Hinton, incorporates self attention to achieve what basically is “lateral voting” in the Thousand Brains Theory by Numenta. So you can say that there are some connections in some perspectives(reference frames :wink:).

2 Likes

Even if there is an analogous mechanism (which I doubt so), the entire point of the attention mechanism allows dynamically approximating any function by influencing weights from the inputs themselves - a form of meta-learning, to put it simply.

So HTM/TBT doesn’t need attention, because that’s not what happens in the brain, atleast in a biologically corresponding way. Deep learning whereas, prefers Attention to be a much more scalable and robust system capable of yielding impressive results.

Whether Attention is superior to biological mechanisms is another question.

In my understanding, attention in transformers is simply pretraining on a larger dataset than the one for specific use cases, to add more general embeddings. And there could be multiple orders of pre-pretraining to add embeddings of embeddings, etc. Then these nested embeddings form a hierarchy of generalization, similar to the hierarchy in HTM / neocortex.

No, that’s Masked Language Modelling. Attention is a mechanism, not a process; and pre-training is not always a requirement.

It’s adding nothing “general” - that’s the wrong terminology. Its a manipulation of vector spaces, but we have nothing concrete to go on after that.

Again, that sounds like you’re confusing different aspects. There is no such recursive pre-training, nor is there any recursive aspect in an LLM at all.

Again, there is no nested embeddings. The closest analogy you can come up with would be that the layers in multi-head attention is similar to hierarchy.

However, that structure of understanding is only verified for convolutional Neural Networks, which extract more and more abstract features the higher up they go. Generalizing it to attention may be highly misleading to anyone giving this topic a cursory glance.

Well, how do you get the embeddings then?

Embeddings are more stable than use-case-specific “soft weights”, thus more general.

That’s not saying anything.

I did not say there is, only that there could be. Same for nested embeddings.

Pre-training and embeddings are mutually exclusive concepts.

There is no measure of stability; they’re simply a representation of your sequence. There’s nothing inherently general about embeddings; as I said, you’re misinterpreting them.

There’s an entire field of XAI devoted to exactly that. You’re welcome to add your own findings and breakthroughs to it.

There is nothing such as nested embeddings. They simply go through a series of non-linear transformations at the end of which we transform it to our target space, and hope the model learnt the high dimensional distribution.

How do you get these embeddings / attention heads if not by previous more general training?

As I said before, you’re mixing concepts.

you don’t ‘get’ attention heads by any prior training.They’re a deceptively simple mechanism to allow for dynamic weights in a language model.

You don’t need to train anything to obtain embeddings - a randomly initialized matrix works as well. embeddings aren’t required to have any relational dependencies on neighbours - the term itself allows for any representation. You can tune those embeddings like how word2vec does, but thats a different matter.

How is it different from training regular MLP? You have some kind of squared randomness: randomized embeddings and randomized regular weights, trained together?

I may well be mixing things up, because it’s all so perverse: random shit piled on top of some other random shit.

…that is exactly what an MLP is. There is no difference. Which is why interpreting self-attention vectors as a linear projection is so incorrect.

You may think so, but that’s because you clearly have little idea of what you’re talking about.

I have little idea about neuroscience, and all I see TBT too as random mechanisms strung together with little chance of working. Some other people here might have better idea due to their knowledge on how it might all operate together, but that doesn’t mean I can claim HTM/TBT is all nonsense. I think it has merit, otherwise I wouldn’t be here in the first place.

Ok, I did assume that, but thought that embeddings are trained before regular weights. Thanks for clarifying that both are trained together, via the same backprop? What makes them more dynamic then? Would is be accurate to distinguish them as lateral vs. vertical weights? You mentioned layers of attention, is that lateral over greater distance? Why is that mutually exclusive with pretraining?

I do have some idea about neuroscience, but it doesn’t make me like the brain either. It’s also loaded with random shit and evolutionary constraints, on all levels. But that’s all we got at the moment, plus those deep NNs.

My own approach is totally different, it’s a strictly functional design. But I have one steady collaborator, vs. maybe a million neuroscientists and ML researchers around the world. So, that’s a plan B, but even if they get there first, their AI will then switch to my scheme on its own :). Because it makes sense.

1 Like

Usually. In Self-attention, they’re both directly modified because those matrices are mixed the weights (the W_q, W_k, W_v matrices)

I don’t get you; all weights are vertically processed. This is not a gated Mixture-of-experts type architecture to have lateral weights (though they’re effectively vertical too).

Attention has no bearing with pre-training, self-supervised learning, weakly supervised learning etc. are part of this paradigm. You can pre-train LSTMs too if you want.

They are all trained vertically, but attention weights are assigned to lateral connections: between tokens within an input vector. Which I think makes that input vector a graph, and then transformer is a type of graph neural network: Transformers are Graph Neural Networks | by Chaitanya K. Joshi | Towards Data Science While regular weights are assigned to vertical connections, between nodes of different layers.

So, are these lateral key, value connections are within input vector only, or they can also be between nodes within higher layers, @neel_g?

Just because you can re-write something as something else doesn’t really mean it neatly falls into that category. Treating Self-attention as a graph though can be very illuminating when you compare against the multitudes of attention variants.

You can’t have lateral computation with any vector you apply, whether it be the computation performed by a prior layer or a different stack of them, usually meant for conditioning on another distribution (cross-attention).

1 Like