A Deeper Look at Transformers: Famous Quote: "Attention is all you need"

Think of it this way; when you see a maze (or are in one) then the grid cells fire and you have a spatial understanding. You know things, which are extremely basic but the network does not. For instance, a single look at the maze and you can deduce the correct path. But why should you go towards the vertical stem of the I maze? why not just keep going horizontally? well, the answer it obvious - you would hit the wall.

Now why would I not progress on hitting the wall? If I press the right arrow, then I move. But near that wall, this rule breaks down. Why? because walls are immovable and there’s no way you can pass through them. i.e, at the edge of the maze, you can simply not traverse, lets say the blacked out area.

These are assumptions. You make millions of them everytime, everyday. There are certain priors in this world - such that things repeat. If I drew a pattern here:
1,2,1,0, -1, -2, _, _

Most would answer the blanks are -1,0. But that’s a prior. This is like a sin wave and we find that in this world, there’s lots of periodic motion. The sun rises and sets. Day and Night follow etc.

But Imagine an Alien who lived in a mirror world. As in, their world was covered by mirrors and they survived there. In that world, things aren’t periodic but are symmetrical. Mirror images of each other. So they might look at it, and say the solution is obviously:

1,2,1,0, -1, -2, ||| -2, -1, where ||| is a mirror.

From their POV, the solution is obvious too. Why should it repeat, when it should clearly continue to form a symmetry.

The exact same happens here. You, as a human can tell what the correct way to solve the maze is instantly because of those same priors. Here’s an idea - what if, I give the agent +1000 reward if it hits the wall 3 times in a row? A human would never figure that out - why would you intentionally make the same mistake thrice? but the agent would, due to the simple \epsilon-greedy strategy which explores the environment. The agent doesn’t think its stupid, because it doesn’t know how things work. That’s what it’s finding out. For it, hitting the wall makes perfect sense if it gets the +1000 reward.

So everytime you let the agent loose, its spending most of its iterations learning that. “Ok, pressing right moves in this vague direction. black areas are some weird things which I can’t escape from, but which waste time colliding with - avoid them. I travel this area of pixels to reach the green one. Oh! I just got a reward. Maybe reaching those pixels gets me the reward?”

Which is why transfer learning helps so much. The agent already has some spatial and visuo-temporal priors, maybe even a map-like representation (even simple DQNs learn that, according to literature). Its why sample efficiency is directly dependent on priors. Because assumptions help - a lot more than you think.

Oh, but that’s the whole point of these DTs my friend. Priors do apply to other tasks; pre-trained networks learn a) faster, are b) more accurate and often c) outperform the very demonstrators they were trained on. See MGDT I cited above.

This behavior doesn’t even require huge scale to study!

2 Likes

Should look more careful at the I (or H) labyrinth problem. Having spatial priors can’t increase an agent’s likelihood of winning above 50%.

And I don’t think decision transformer pre-trained on 1B (!!!) Atari gameplay experiences would need much fewer samples to play decent chess than the same millions of games to reach same level of performance training on chess from scratch

PS Chess is hard, just let a DT (pretrained or not) play the H labyrinth.

1 Like

A little busy these last two days so haven’t been able to follow this thread fully.

| cezar_t
March 6 |

  • | - |

robf:

This is something which the language problem constrains you to nicely. It constrains you to think about a cognitive problem as cause and effect.

Yes there-s something to causality but it might be deeper. I don’t think is our language skills that shapes our minds to conceive causality.
I think the assumption of causality is one of those important priors which shapes everything else about our minds, language included.

Agreed.

I don’t think it is that our language skills shape our minds to think casually. I just think language might be the simplest place to observe this more fundamental causal bias in our cognition. I believe it is most transparently visible in language, because language is the data set most directly generated by the brain itself. It makes sense that the brain would pare down unnecessary information as much as possible in the signal which it produces itself, to be consumed by another perceptual system of the same kind.

So I think a more fundamental bias is just more easily seen in language.

But yes, I don’t think language is fundamental at all. (I’ve actually spent a lifetime learning different languages to explore that idea for myself, subjectively!)

The big question is how could one engineer a speculative causality prior in a machine learning framework. I call it speculative because that’s what it is, we make assumptions and which/when one is confirmed we take it as “correct” for as long as it fits the data

I think it might be easy to “engineer a speculative causality prior in a machine learning framework”. We can do it with the path shown by the simple example of language in the first instance: just build a network of perceptual stimuli. Observations. Observations in sequence. In the first instance build a network of language sequences. Because it is a very simple example. Not because it is anything fundamental.

Then the “speculative causality” might just be groupings in that network of perceptual observations which (energy maxima/minima) tend to share predictions = appear to have shared causality.

Regarding the chaos discussion- that could be a fertile one … we-re better of moving it in another topic. Where I encountered it was in reservoir computing. The closer to chaotic threshold a reservoir is the better at modelling its input signals it gets, but not if it passes that threshold. They need to be almost chaotic to be useful, at least in that particular case.

Great! Yes. Reservoir computing. Total agreement. That’s my current guess at an origin story for “cognition” (along the energy minima/maxima prediction enhancing grouping) lines I’m describing. I imagine cognition started as a kind of “echo state machine” over even very simple early nervous systems, imprinting events and consequences, causes and effects, as a kind of “echo mechanism”. And then evolution simply enhanced that cause and effect echo state network, by evolving to enhance the “echo” mechanism. Some kind of simple amplification of the “echo”. Amplification by “stacking” events which shared causes and effects. And that “enhancement” is what we now call “concepts”. Proto-“concepts” might just be groupings of things which tend to share causes and effects. And found by seeking energy minima in networks of sequences of observations.

For language, those “stacks” would be hierarchies of groupings of words which share causes and effects within a text. That’s what linguists found useful for building “grammars” for unknown languages way back in the '20/‘30s. And will be what is implicitly found in transformers. But the general case might apply to any kind of perceptual sequence. Tactile, whatever. (I recall it was one of Jeff Hawkins’ insights for HTM that touch also requires the skin to move back and forth over a rough surface. Perceptual sequence, but perceptual sequence for touch, this time. You don’t perceive texture unless you move your finger… So HTM was on to this importance of sequence thing, early… Also, we don’t have any general explanation yet for why our visual system resolves images using saccades…) It would just more easily identified for language. Only olfaction might be an exception, I don’t think olfaction uses sequence. (Though my namesake Walter Freeman did a lot of interesting work on chaotic behaviour in olfaction networks… Just not sequential??)

These groupings (starting with language) will be what are “learned” by transformers. But transformers assume the energy surfaces described by such groupings are static and can (must!) be found using gradient descent. In the context of reservoir computer/echo/liquid state machines, the energy surfaces described by these prediction enhancing “stacks” of shared cause and effect observations, would not be static. They might correspond more to dynamic clusters such as those described in the Brazilian paper.

Anyway, good, yes. Some interesting similarities of thinking. Would be happy to explore it further.

Happy to contribute to a separate thread on any of these themes if you want to create one.

3 Likes

I’m also very interested to hear more of your thoughts regarding language, causes & effects, and etc., already felt a great deal of inspiration from your posts in this thread!

Maybe the semantics part can be fundamental? As the syntax on the surface and even rule-based grammar are not apparently fundamental.

I have “rational thinking” compared to “daily language usage” in my mind, e.g. one usually say “there’s a straight line sharply rising”, but with formal math language, one can give y = 5x + 3 to state it more precisely. I suppose math as a language, is one (or even more) orders/levels higher than usual “natural languages”? But semantically one can always answer that y is 48, when asked “what if x is 9”? Maybe less accurate and with more vague phrasing, but as well as the in-situ pragmatics demands.

What if “the language skill” is “causality thinking” itself? Language is but the vehicle for communication, and the mere goal of communication is to change your counterpart’s mind somehow. Then what else can it be, other than your communication be the “cause”, and the mind-change of your counterpart be the “effect”?

Further, for a successful communication, i.e. to convince your counterpart about any idea, can causal-relations be avoided entirely? I suspect not. Obviously not to tell some rules/laws, and even to tell some fact, there is an implication that “since you live in this world, you must be interested in that the world has …”. Even to tell some fictional fact, the implication can be: “in case you enjoy thinking about subject X, it appears …”.

“Causality” makes people “believe”, if not with causality semantics, I doubt one’s mind can be changed any other way.

Along this line of thought, I would suggest our (human’s) modeling of the world is not compact/precise description of the underlying (possibly chaotic) function at all, but a (good enough) approximation. We can apply Newtonian physics perfectly okay for daily living, even though we know it’s incorrect (compared to general relativity, and quantum mechanics in turn, and …), we can still believe it’s universally true where it applies.

We humans are capable to “learn” where the chaotic boundaries are, and enjoy smooth (trivially approximatable) continuums between them. I suggest that there are “deep-knowledges” which is about where/when/how (including at-what-scale) some “shallow-knowledge” applies. Contemporary AIs seem to have learnt “shallow-knowledges” well, but not at all regarding deeper knowledges.

1 Like

The 1B refers to the parameter size, not dataset :slight_smile: For comparison, 1B models are so small that you can train them on your own on gaming hardware. A free Colab can get you to ~2-3B easily.

And yes, if you do that pre-training and fine-tune on some other game you do require less demonstrations. That is “positive-transfer” and there are overwhelming results for it.

Coincidentally, there was a work published today which happened to prove all those claims for embodied multi-modal LMs: https://twitter.com/DannyDriess/status/1632904675124035585

  1. PaLM-E exhibits positive transfer, outperforming SOTA on a VQA (Visual Q/A) task by ~10% more.
  2. Scaling leads to a drastic drop in catastrophic forgetting; despite being fine-tuned on a totally different domain, it retains NLU/NLG performance proportional to the scale of the model.
  3. LLMs can handle a variety of domains and cross-transfer between them, again with scale. PaLM-E for instance handles language, images, an embodied robot, and a combination of either very well.

and the kicker is that PaLM isn’t even the best model - we’ve yet to Chinchilla scale it, or apply MoD training which further transcends scaling laws with a mere fraction of the compute (often ~0.1% of what’s employed in the base LLM)

2 Likes

I am agreeing with @complyue on this:
I’m also very interested to hear more of your thoughts regarding language, causes & effects, and etc., already felt a great deal of inspiration from your posts in this thread!

I find your proposal that HTM’s predictive properties may be a good base to construct transformer-like architecture to be very interesting. I see that the higher level connection structures of the brain to be a little like what I understand about multi-head attention so this fires my imagination.

1 Like

| complyue
March 7 |

  • | - |

robf:

Happy to contribute to a separate thread on any of these themes if you want to create one.

I’m also very interested to hear more of your thoughts regarding language, causes & effects, and etc., already felt a great deal of inspiration from your posts in this thread!

Glad you’re finding something promising in it.

I have some concrete experiments I’d like to try. There is a lot of commonality with aspects of HTM. As I say, I corresponded in (the predecessor to?) this forum a lot between 2014-2016. At that time the whole sequentiality aspect was an even rarer fit. That was before the whole transformer explosion, and the only work on any kind of sequence in AI was HTM, and some stuff with LSTM for language translation.

The experiments I would like to try can be very simple. If someone wants to explore a simple implementation with me, that would be great.

Basically, what I want to try is to build a network of observed language sequences, and then give it a good “shake”, and see how the energy surfaces cluster. Rather as was described for community detection in the Brazilian paper. Only doing it for language sequences.

I already managed to implement such a language sequence network in an open source neurosimulator. And I was able to get it to oscillate. I was surprised how easy that was. I thought I would have to try some clever feedback mechanisms to get oscillations. But it turned out networks of language naturally feedback. Common words like “the” close the loop naturally, and any propagated excitation loops back around. The only thing I needed to do to get oscillation was to implement inhibition. And even that was simple. I just needed to inhibit randomly. Then I turned the inhibition up and down until I got a balance point, and the network oscillated nicely.

The next stage would be to try and find a hierarchy in the raster plot of those oscillations.

So far it’s really simple. Just a network of observed language sequences. Apply inhibition over it fairly sovereignly. And turn the inhibition up and down until it oscillates.

The next step is to look at a raster plot of spike firings in the neurosimulator, and try to see how we might squeeze some kind of hierarchical "community’ structure in it.

robf:

So I think a more fundamental bias is just more easily seen in language.

But yes, I don’t think language is fundamental at all. (I’ve actually spent a lifetime learning different languages to explore that idea for myself, subjectively!)

Maybe the semantics part can be fundamental? As the syntax on the surface and even rule-based grammar are not apparently fundamental.

The conclusion I’ve come to is that “thinking” is an assembly of perceptual impressions. Language is a dance which is part of that, and prompts thinking by its connections to it. But it is not fundamental to it.

Fundamental are observations, examples. What Kuhn called a “paradigm”:

Kuhn SoSR, p.g. 192 (Postscript)
“When I speak of knowledge embedded in shared exemplars, I am not referring to a mode of knowing that is less systematic or less analyzable than knowledge embedded in rules, laws, or criteria of identification. Instead I have in mind a manner of knowing which is misconstrued if reconstructed in terms of rules that are first abstracted from exemplars and thereafter function in their stead.”

If you say something in two languages, the impression I’ve been able to form (from a non-native level) is that the sensation of thought beneath the two languages is much the same, and more a feeling of “being there”, than it is something connected to any one of the languages. Actual “thought” is below it, and separate. The feeling of expressing an idea is much the same.

This is quite different to the hypothesis associated with Sapir-Whorf, that language completely governs our conceptual system (and which was the idea that first seduced me to the idea of learning another language.)

This feels to me like something very similar to some comments I read of Einstein’s opinion about the same thing:

“The … elements in thought are … images which can be … reproduced and combined”

How Einstein Thought: Why “Combinatory Play” Is the Secret of Genius
https://www.themarginalian.org/2013/08/14/how-einstein-thought-combinatorial-creativity/

So, “images” and not words.

I have “rational thinking” compared to “daily language usage” in my mind, e.g. one usually say “there’s a straight line sharply rising”, but with formal math language, one can give y = 5x + 3 to state it more precisely. I suppose math as a language, is one (or even more) orders/levels higher than usual “natural languages”? But semantically one can always answer that y is 48, when asked “what if x is 9”? Maybe less accurate and with more vague phrasing, but as well as the in-situ pragmatics demands.

The thing about maths is that it was shown to be incomplete. Formalizations are incomplete. I think the basic level is combinations of examples/perceptions. And those combinations can contradict. Maths excludes contradictions. But by doing so excludes entire areas of “truth”.

My signature example for that formal contradiction, giving rise to more mental descriptive power not less, is non-Euclidean geometry. That mathematicians sought for millenia to prove Euclid’s 5th(?) postulate that parallel lines never meet. Until… Gauss, decided to simply allow parallel lines to meet, and discovered it created for him an entirely new branch of mathematical truth, the geometry of spheres (instead of planes.) Parallel lines meeting or not meeting, is a contradiction. But it is not that one or the other is not “true”, just that the two “truths” apply to different aspects of the universe.

robf:

Great! Yes. Reservoir computing. Total agreement. That’s my current guess at an origin story for “cognition” (along the energy minima/maxima prediction enhancing grouping) lines I’m describing. I imagine cognition started as a kind of “echo state machine” over even very simple early nervous systems, imprinting events and consequences, causes and effects, as a kind of “echo mechanism”. And then evolution simply enhanced that cause and effect echo state network, by evolving to enhance the “echo” mechanism. Some kind of simple amplification of the “echo”. Amplification by “stacking” events which shared causes and effects. And that “enhancement” is what we now call “concepts”. Proto-“concepts” might just be groupings of things which tend to share causes and effects. And found by seeking energy minima in networks of sequences of observations.

Along this line of thought, I would suggest our (human’s) modeling of the world is not compact/precise description of the underlying (possibly chaotic) function at all, but a (good enough) approximation. We can apply Newtonian’s physics perfectly okay for daily living, even though we know it’s incorrect (compared to general relativity, and quantum mechanics in turn, and …), we can still believe it’s universally true where it applies.

Our formal models, yes, I agree. They are “good enough” approximations. But actual thought is itself a chaotic recombination of observations. And actually more powerful for that. We have been missing that power, because we’ve been assuming the mechanism for generating formal models, is the same as the formal models themselves.

The archetypal formal description would be maths. And the thing about maths is that it was shown to be incomplete. Formalizations are incomplete. I think the basic level is combinations of examples/perceptions. And those combinations can contradict. Maths excludes contradictions. But by doing so excludes entire areas of “truth”.

My signature example for that formal contradiction, giving rise to more mental descriptive power not less, is non-Euclidean geometry. That mathematicians sought for millenia to prove Euclid’s 5th(?) postulate that parallel lines never meet. Until… Gauss, decided to simply allow parallel lines to meet, and discovered it created for him an entirely new branch of mathematical truth, the geometry of spheres (instead of planes.) Parallel lines meeting or not meeting, is a contradiction. But it is not that one or the other is not “true”, just that the two “truths” apply to different aspects of the universe.

We are capable to “learn” where the chaotic boundaries are, and enjoy smooth (simply approximatable) continuums between them. I suggest that there are “deep-knowledges” which is about where/when/how some “shallow-knowledge” applies. Contemporary AIs seem to have learnt “shallow-knowledges” well, but not at all regarding deeper knowledges.

Fair enough with modeling “deep-knowledges”. But I think it is modeling “new-knowledges” where the current tech falls down.

“deep-knowledges” you are talking about might be where the broader sensory data that neel_g talked about would start to become relevant. I don’t disagree with him that broader data sources will be important. Those will be the “images which can be … reproduced and combined” which Einstein talked about. I just think that there is still an insight to be gained from the simpler language based system first, about how a system might find new structure all the time. Once we try that with the simple sequential system of language, we might apply it broadly across all perception.

1 Like

| Bitking Moderator
March 7 |

  • | - |

I am agreeing with @complyue on this:
I’m also very interested to hear more of your thoughts regarding language, causes & effects, and etc., already felt a great deal of inspiration from your posts in this thread!

I find your proposal that HTM’s predictive properties may be a good base to construct transformer-like architecture to be very interesting. I see that the higher level connection structures of the brain to be a little like what I understand about multi-head attention so this fires my imagination.

Excellent Bitking. Very cool. I certainly welcome constructive feedback on this.

I always thought HTM was a good initial fit, with its focus on sequence. I felt it was a good fit for the work I had been doing on language at a functional level, and the way it seemed to me to form dynamic sequential structure. Corresponding on the HTM forum back then helped me to see how an actual network implementation might be carried out.

Take a look at the concrete experimental proposal I sketched above. Any suggestions welcome.

1 Like

So true! Carl Hewitt is almost working out Inconsistency Robust Direct Logic w.r.t. programming computers (contrary to machine learning), though I’m not sure it adheres to traditional math standard, or how inconsistency/contradiction come in peace with math there.

1 Like

In (computer) programming jargon, serialization is about how to encode at the sending end, a structured piece of data (e.g. a person’s profile: name, birthday, favorited things by category), into a sequence of information unit (byte or bit), for that sequence being able to be decoded (deserialization) at the receiving end, to reconstruct the original structured data.

If you “shake” 2 or more such serialized sequences together, you almost always get corrupted data, since the “syntax” of serialization language is rather strict, and checksum is usually leveraged to actively detect bit-rot and reject without tolerance.

I would imagine the sequential nature of language (syntax/grammar-wise) work similarly, though with much greater deal of tolerance. Anyway the semantics conveyed should be structured, i.e. not sequential. Numenta’s “location+feature-pairs” encoding of object identity seems a reasonable design, and quite primitive, yet I believe there should be much more schemas up to be deciphered for human comprehension.

Be it some “energy surface clustering”, I still feel it very hard to comprehend even if presented to me, in what ways you’d suggest for the underlying structure of language (structural semantics?) to be understandable/reasonable? Maybe some cause-effect pairs?

I think SDRs can be the “physical” representation within one individual brain, but obviously that schema can’t be replicated across brains (even across CCs I guess). Memes are the intuitive things get replicated, but stop at this brilliant concept, still no much practical utility.

1 Like

Thank you for sharing such interesting thoughts and concepts. I read the Brazilian paper and understood some of the gist around it. Would love to hear more about your experiment based on that paper if you could. I’m curious how you create the structure of the graph based on language sequences. I assume the vertices are individual words and if that’s the case then the edges should not be of one type, like the examples in the Brazilian paper, but should have multiple types. For example, one edge type could represent location relative to a specific word/vertex (e.g. some word X is separated by some number of words to the left (or right) of the word Y). And if you’ve tried to vary the inhibition signal, did you see hierarchical groupings form in a single sentence similar to how a linguist would draw a hierarchical compositionality diagram of a sentence?

1 Like

So, “feeling of being-there”, even not “images”, rather than “words”, I’d say.

English is my second language, and I’m fluent in a few programming languages (including ones I designed for my work) too, and I can say I have great consonance being-there.

And yet I’m not satisfied with any Cartesian Theater equivalent theory about the “feeling-there” thing, thus motivated to reverse-engineer the brain for inspirations in solution-designs of computer based systems.

Not a researcher myself, but I’ve got myself the job duty to design the system architecture of my company’s quant trading business. Python’s meta language facility, and further a home-brew programming language upon Haskell have brought us quite far, then I find the next bottleneck being impedance between business-speaking concerns and any “computer programming” language, I’m finding my way to a language workbench, with which the bridging between business/design and computer/implementation can be worked on professionally, i.e. the design of vocabulary/grammar of a “business programming” language that executable by computers, can be iterated as for development, where necessarily version-controlled, quite like an internal open-source project.

The “design” approach is even after “learn”, if some mechanism can be learnt, then it can be leveraged for designs with purposeful goals, but before that, the learnt “thing” has to be well understood. State-of-art machine-learning approaches seem to leave even the “understanding” job to machines, but so far my scenarios seem to beg human intelligence in affording that part, for practical feasibilities by date.

2 Likes

I think the issue here might be how to interpret my sense of what I mean by “shaking” a network of observed language sequences, complyue. I haven’t been clear.

To attempt to clarify, I’m not thinking of “shaking” two encodings together. I’m thinking of “shaking” more in the sense expressed in the Brazilian paper I cited. They “shake” a network to reveal which “communities” of nodes within the network are more closely connected to each other. Density of connections defining a “community”. And also determining which groupings will tend to synchronize their oscillations.

My proposal for finding a dynamically changing language structure would be much on those lines. Only implemented on a network of observed language sequences. So the “community” which would be revealed would be sub-sequences of language which tend to share context/prediction in the sequence.

I’ve started a new topic, so we can continue there or here, as you like.

Does what I suggest by “shaking” make more sense in the context of the Brazilian paper?

2 Likes

There’s a short video clip of the oscillations I get, and a basic early raster plot, in a presentation of an earlier paper I made at the AGI-21 conference. I might post that in the new thread I’ve created, to try and move the bulk of discussion there.

Currently the graph was defined with extreme simplicity. Just one kind of edge.

But you are right. There should be some kind of distinction so that relationships at a distance can be properly represented.

I think the way to do this will be in the spirit of HTM by representing the nodes themselves as SDRs. I haven’t implemented this in my current experiment. I need to look back at the notes I was making when I was looking at this in the HTM context in 2016. As I recall, using an SDR means that the connectivity is more dispersed, and sequence at a distance is also gracefully represented.

This is a good point, and should be important.

The raster plot I have available on the the neurosimulator I used (Charles Simon, Brain Simulator II) is too coarse to tell. To move forward I would need to modify Charles’ BrainSim II code so it can display the relationships which interest me more clearly. Currently BrainSim II raster plots are not built to display thousands of words and it all smudges together. So you can’t see what the words are, let alone how the display might be rearranged to hierarchically cluster them.

The exact way to extract any hierarchy would likely reveal itself when modifying the raster plot display code.

I believe such a hierarchical structure must be there, because it makes sense different sub-sequences of words would synchronize more closely, depending on the commonality of contexts/predictions they occur in.

In terms of a raster plot, I think we would find common context/prediction would “push” spike times around words connected in sequence, together. So revealing that as a hierarchical structure may be as simple as breaking out a separate display of a hierarchy “tree” for each word based on how closely their spikes are on the raster plot.

I plan to post my AGI-21 paper presentation in the new thread. In the the AGI-21 paper I talked about earlier experiments (2000!!) which clustered based on… basically overlaps between vectors. So at that time it was a vector representation, not a full network. And that did generate nice hierarchies, which corresponded well to human intuitive clusterings in tree-banks.

3 Likes

A “feeling of being-there”. Yes, I agree. Even bodily sensations. Images might be part of that “being there”, depending on the exact problem you are thinking about, and the experiences you are “playing” with, to use Einstein’s words.

Einstein, I imagine, was focusing particularly on vision, as he imagined ideas around what things would look like if he were “riding on a beam of light”, which I believe were what led to his famous insights. But I believe he says the experience of walking home in his Swiss village, and hearing clock towers ringing the hour (possibly at slightly different times? Or the comment of a friend about that?) was another inspiration.

Generally I would say you are right, yes, thinking is below language, and on a level of a “feeling of being there”, or combinations of same. Language is just a separate “dance” on top of that, which connects to it, and can prompt it. For your native language it is perhaps so closely connected to it that you can feel they are inseparable. And our language facility can become so good that it is hard to stop language appropriate to a given “feeling of being there” from appearing spontaneously. But it is a dance on top, and separate.

For instance, learning Japanese, Japanese people typically say “itadakimasu” before eating. There is no equivalent in English. Bon appetit is not the same thing. That’s a wish to another person. The Japanese is more an expression of gratitude for yourself. But having learned this, and lived it, depending on how recently I’ve been speaking Japanese, I may spontaneously feel the need to say something! The urge to say something can be great! And there is nothing to say in English! But the feeling of starting eating is the basic meaning. “Itadakimasu” just appears spontaneously on top of that.

If I understand what you are saying here, then, yes, I agree there is something about “understanding” which is more than just having something happen, such as an appropriate prediction.

I think part of that “understanding” may be the structure which generates a prediction. Transformers do not reveal this structure. I believe that is because transformers are essentially “hiding” an underlying chaos in their billions of “parameters”. If the underlying structure of “understanding” is chaotic, then that will be why we have not been able to find it up to now. It will be why GOFAI failed. And the current success of transformers will be because they effectively stuff all structure into a “black box” and parameterize the problem purely in terms of inputs and outputs, “transformations”.

So, transformers succeed in modeling chaos in practice (to a finite limit) by hiding it in a black box.

The full solution will be to allow the system to express the chaos. This should allow us to display explicit structures (just that they change all the time.) And those explicit structures may be closer to “understanding”, as we feel it intuitively.

3 Likes

This was precisely the reason I posted this thread, to begin with. I have not yet had sufficient time to study transformers in such depth, that I can assess their full implications and understand the paradigm fully. But after my first deep dive into the matter, I had this intuitive feeling as well.

2 Likes

GPT-4 GPT-4

2 Likes

The power of “attention control” and “attention management” is proving to be an incredibly valuable component of any level of cognition. I am aware, that with transformers, like ChatGPT, GPT3 and GPT4 we are very far off from our biological examples, that evolution has brought about. But in the search of better theoretical computational models that replicate our natural brains and cortical organization, we need to continuously approach this targeted model from both sides. Neuroscience provides the elements and hardware and the properties of this hardware and its networking, that we need to understand. And the artificial models like GPT4 and other transformers, provide us with a set of computational components and principles, that have certain proven performance, and we need to search for some of these within our neuroscience. Of course, we will find and develop a very different architecture and organizational principles in our neuroscience approach, like HTM, SDRs, TBT. These are very powerful and correct, while incomplete in some areas. Our search for completeness is where these principles, taken with care to distinguish their specific value, should provide help. This of course applies in both directions. But for this community, which (including myself) strives for a real breakthrough in the computational modelling of real neuroscientific biological brains, we need to observe the takeaways that arise in the artificial research areas, as I call them. The bridge between artificial cognitive modelling and natural cognitive modelling helps both sides advance. And we have evolution on our side, as a key companion. One of these new takeaways from transformers is the concept of recursive attentional mechanisms. In other words, attention focussed on the results provided by the same attentional filtering and selection process. Perhaps we can search for this in our neuroscience and incorporate this principle. It seems very likely to be a principle also exploited by nature. We need to expand our “hypothesis” and see if this hypothesis does lead to an expanded theory, also supported by neuroscientific evidence.

2 Likes

I have adapted my HTM implementation to incorporate some of these ideas from our friends in the machine learning community, specifically certain transformer logic, so far it is looking very promising.

I call my current version “Degenerate HTM”, basically it means creating degenerate SDR’s. Will take some months until results are in on this endeavour.

3 Likes

Good to see that the main question posed in this thread, has been answered with actions on Numenta’s part. The adoption of Transformers and LLMs within the highly advanced neuroscientific framework created by Numenta has taken place. The following link, answers this point very explicitly. I do not yet know the details that differentiate Numenta’s LLMs and Transformers from others based on Geoffrey Hinton’s framework, but I see plenty of potential in the HTM and TBT and SDR context of Numenta’s heritage to bring higher paradigm performance as opposed to the usual higher brute-force performance.

1 Like