"How Your Brain Organizes Information" video

roboto · April 4, 2023, 8:37am

Highly recommended and accessible short video on “How your brain Organizes Information”. Touches on things such as grid/place/splitter/etc. cells on the hippo and entorhinal cortex, as well as cognitive map, latent space, graphs, hippocampal remapping, etc.

robf · April 4, 2023, 1:09pm

Nice presentation.

The issue which interests me is how you factorize perception (and actually if you always can. Because for language you can’t.)

This bit:

Factorization into structural and sensory components:

This video claims there is evidence you can. But the evidence for this is not extensively provided. I’m guessing “grandmother cells”?

They mentioned a second part. Is that forthcoming?

roboto · April 4, 2023, 3:40pm

There should be second part to this in the future. Judging from his posting frequency it should be about a month or so.

Yeah, not easy to imagine how factorizing a language or other abstractions would look like. I’m of the opinion that languages can be learned and factorized. I’ve seen attempts to model language as a graph and coincidentally factorization in the brain leads to graphs according to the video. I think trying to represent language as a graph of any sort should involve learning meaningful representation of edges among word connections for example. Generic/unlearned edges should lead to a low quality/information graph.

robf · April 4, 2023, 6:35pm

Exactly the problem I’ve been talking about with all my posts. The belief there will be factors is hard to get past. We see factors. Objects. That’s all we see.

It is only when you try to nail them down computationally that things start to come apart.

I believe the inconsistency of the classes/“edges” you find is what leads to parameter blow out with large language models. That’s the reason they’re “large”. And, together with the fact that they hide grammar, why they have done so much better than previous attempts to “learn” grammar.

If you look for it you can trace a history of failure to find consistent grammar/“edges” for language, e.g. in this earlier thread:

Or in this later thread:

In that last post I discussed an extension to a failure to find objective categories more broadly in philosophy:

Chaos/reservoir computing and sequential cognitive models like HTM

In philosophy, closest to grounding in the physical, might be Thomas Kuhn:

Structure of Scientific Revolutions, p.g. 192 (Postscript)
“When I speak of knowledge embedded in shared exemplars, I am not referring to a mode of knowing that is less systematic or less analyzable than knowledge embedded in rules, laws, or criteria of identification. Instead I have in mind a manner of knowing which is misconstrued if reconstructed in terms of rules that are first abstracted from exemplars and thereafter function in their stead.”

Though Wittgenstein comes close, shifting to a basis for meaning in “games” later in his life. Quoted by Kuhn here:

Thomas Kuhn, The Structure of Scientific Revolutions, p.g. 44-45:
(Quoting Ludwig Wittgenstein, Philosophical Investigations, trans. G. E. M. Anscombe, pp 31-36.)

'“What need we know, Wittgenstein asked, in order that we apply terms like ‘chair’, or ‘leaf’, or ‘game’ unequivocally and without provoking argument?”

‘That question is very old and has generally been answered by saying that we must know, consciously or intuitively, what a chair, or a leaf, or game is. We must, that is, grasp some set of attributes that all games and only games have in common. Wittgenstein, however, concluded that, given the way we use language and the sort of world to which we apply it, there need be no such set of characteristics. Though a discussion of some of the attributes shared by a number of games or chairs or leaves often helps us learn how to employ the corresponding term, there is no set of characteristics that is simultaneously applicable to all members of the class and to them alone. …’

In philosophy you can find it all over the place. Even H. G. Wells!

“…My opening scepticism is essentially a doubt of the objective reality of classification.”

A Modern Utopia by H. G. Wells 1905

I can go on and on along the philosophy thread of this! As I say, after I noticed this for what was happening when I tried to learn grammar, it started popping up all over the place.

Just recently I found another nice discussion in a pure graph context:

The Many Truths of Community Detection
http://netplexity.org/?p=1261

‘It all comes down to the fact that we have mathematical ways to quantify the difference between community assignments but defining what we mean by “the best” clustering is impossible.’

It’s still what’s holding us up in AI. We believe the objects we divide the world into, are reality. They are all we can see. When in reality it seems the only way to build them computationally is as ever so many subjective constructions.

But seen from another perspective, this is actually a good thing:

roboto · April 5, 2023, 6:02am

I read some of your previous posts in other threads but many of them went way over my head. My IQ’s not high enough to allow me to agree with one philosophical school over another especially if I have no initial strong beliefs that I can relate to.

I think besides making it more accurate the other main reason for LLM’s parameter blow out is due to what Ben Goertzel mentioned is lacking which is memory. So it needs to take in more and more tokens with every upgrade to try and cover up that weakness. I think GPT4 can take in 8k tokens (or words?) at a time. That can’t even take in a whole novel of longer length. After reading partway through a novel it’s not gonna remember anything near the beginning. But many research in transformers are trying to augment memory onto it so I won’t bet against transformers at this juncture.

That totally makes sense but I think there’s a good chance that the structure of language has been honed over time for effective communication such that its structure won’t allow for too many of these ambiguous clusterings in the probabilistic sense. In my opinion there’s a way to somehow describe the structure of language in a statistical manner and it’s just that maybe we haven’t found it yet.

DanML · April 11, 2023, 9:21am

That idea is supported by Christiansen/Chater (below) who discuss the shaping of language over time to remove such interferences:
https://www.penguin.co.uk/books/441689/the-language-game-by-morten-h-christiansen-and--nick-chater/9781787633483
I’d recommend this book anyway, since it highlights many other significant dynamic language features which are often overlooked.

robf · April 11, 2023, 10:43am

Exactly what idea are you saying is supported by Christiansen/Chater, @DanML?

In the Penguin summary of The Language Game you link to they say:

“Upending centuries of scholarship (including, most recently, Chomsky and Pinker) The Language Game shows how people learn to talk not by acquiring fixed meanings and rules, but by picking up, reusing, and recombining countless linguistic fragments in novel ways.”

Compare that with my statement that “…the only way to build them computationally is as ever so many subjective constructions.”

I don’t know how you’re understanding @roboto’s statement. To me he was just confusing novelty with ambiguity. He doesn’t like the idea that language structure will be forever novel. He’s uncomfortable with it. He can see how contradictory structure leading to novelty might make sense in the network context. But it feels wrong to him. He falls back on the conviction there will be some statistically stable structure. He says it totally makes sense there are multiple ways to organize a network, but he doesn’t like it. He is “sure” there is some statistically stable structure. Not for any reason. He’s just sure.

And indeed that certainty of expectation has been the history. E.g. Chomsky and Pinker. Just what The Language Game contrasts itself with. Chomsky was sure too. He was so sure that when it became evident that rules derived from observation contradict, he insisted there must be innate rules. Why innate? Because rules derived from observation contradict!

(That’s what broke linguistics as a disciple of structure learned from observation in the 1950s. Before Chomsky American Structuralism was all about language structure learned from observation. After Chomsky linguistics split into Chomsky’s school insisting there was statistically stable structure, but that it must be innate, and other schools which shifted more to the humanities.

That’s why theoretical linguistics makes no contribution to machine learning.)

@roboto’s statement seems to me to be the very contradiction of the summary of The Language Game. And the summary of The Language Game seems to me to support my assertion that language can only be modeled as “ever so many subjective constructions”.

The Language Game: “people learn to talk not by acquiring fixed meanings and rules, but by picking up, reusing, and recombining countless linguistic fragments in novel ways.”

Me: “the only way to build them computationally is as ever so many subjective constructions”.

Without having read The Language Game, I can’t say how far they travel with me in these “novel ways” of “recombining countless linguistic fragments”. I can’t say if the “novel ways” extend to chaotic attractors.

Perhaps they go into ways the subjectivity is resolved in context.

But the core takeaway must surely be agreement that language structures itself by “recombining countless linguistic fragments in novel ways”. Note “novel ways”. Novel, implying they are unable to be learned.

And hence current transformers will be forever listing novelty, and suffer parameter blow-out.

DanML · April 11, 2023, 2:40pm

Simply that they show evidence that language frequently hits interference points, both with the sounds generated and with symbol/word usage. Over time this is resolved by using more distinct sounds and symbols, ie. increasing the difference/contrast.

More strongly they argue that all language is constantly changing, and is an extension of non-verbal communication (rather than as a serial computer protocol).
Written language is not what is spoken in practice. Phrases just become effectively symbols and shorten, words join together, sound mutate to be easier to say. Treating words as objects can miss the continuous drift process that is occurring in thousands of regions every day. You could call these dialects but that infers a true base to work from, which they argue does not really exist.

I think that is a fair summary.

You’d need to model a layer below (interaction? ToM modelling of other?) to get a grasp on the core model of language.
ChatGPT has crystalized something we recognize as an ‘output’ from a similar process.

robf · April 11, 2023, 3:45pm

I don’t know what you mean by a “similar” process.

I think ChatGPT and friends are dealing with the novelty by getting LARGE. Which might work to a point, but it is hardly the same thing. LARGE might look like novelty for a while. But it’s not.

ToM modelling (theory of meaning?) sounds like what one of the fractured branches of linguistics chased after Chomsky. One branch fractured off searching for abstracted (statistical) structure in “meaning”.

But why are you accepting novelty, and then going straight off and saying there must be another “layer below” to provide a “core model of language”? It’s like you’re saying, “Oh, yeah, good idea, language structure is always novel, but I still want to find some abstraction of it that’s not novel”! Why does there have to be some “layer below” “core model”? Why can’t it just be “recombining countless linguistic fragments in novel ways”? Bound by the body of examples, so not completely unconstrained. But not bound by any abstraction beyond a body of examples.

DanML · April 11, 2023, 4:10pm

It generates language mostly indistinguishable from another human - that means the process output is similar.

Sorry - too many acronyms. Theory of Mind. Agent/agent modelling.

Fair point.
Reductionist thinking based on years of training? Probably just because it’s hard to follow without an analogy.

robf · April 11, 2023, 7:16pm

Well you wouldn’t be the first. As I say, linguistics was shattered. Chomsky went off searching for a “layer below” in our biology. Four or five generations of theory never found it. A school split from him (acrimoniously) that centered around George Lakoff and some others who thought a “layer below” could be found in “meaning” primitives (cognitivism.) And they joined an earlier another school split off which thought a “layer below” could be found in “function”. They all failed to find any kind of stable “layer below”, and the cognitive and functional schools drifted to the humanities and became preoccupied with social abuses of subjective meaning and function (broadly speaking, e.g. Lakoff: Women, Fire, and Dangerous Things.)

You do need some kind of anchor. But I think an anchor in predictive value is enough. Group things according to the way they predict the next element in the string, and the groupings don’t need to be knowable. That “unknowable” groupings are effective predicting the next element in the string is enough (actually you can know them, it’s just those knowable groupings can contradict…).

roboto · April 12, 2023, 10:41am

@robf

I did put a disclaimer that it was an opinion

Anyways, the idea that language is unlearnable is a bit confusing. If i’m not mistaken, you were exploring the idea of creating a graph of words from a corpus and using that to invoke some sort of oscillatory property. Unless there is learning of any sort involved the edges connecting the nodes must be generic and stay unchanged over time (time means the learning period not intergenerational time where word meanings can drift) otherwise that’s a form of learning. I think you meant that language has some learnable properties and some unlearnable ones like how you said there must be some anchors of some sort?

Could you elaborate more or share some simple examples to illustrate the above point?

Also, by novelty did you mean it like how one creates new metaphors or jokes? I’m not sure I have the same idea about “novelty” as you but in my mind what I think you meant by ‘novelty’ could be argued against with the following idea: “There is a set of made up words and a simple set of combination rules such that an infinite number of unique sentences can be created.”?

And by contradictory did you mean like how the word forest “refers” / “should be grouped up with”/ “attend to” to the man or the dragon in the following sentence: “Joe is hunting the dragon in the forest.” In my mind the forest can refer to Joe, the dragon or both. It implies the both of them are in the forest but less certain for the dragon as it may be kidnapping princesses in the capital city. Also, “the dragon in the forest” may be anickname given to the dragon and so the sentence does not have any information of where the hunt takes place. Is this what you meant by contradictory groupings or is it something different? IMO transformers are really great at these kinds of things as in taking into account multiple possibilities at once with its trillions of parameters.

robf · April 12, 2023, 12:48pm

Ha. Sorry if I seemed to come down hard on you with that. I mostly wanted to draw a sharp contrast with what @DanML was saying about the book he was recommending. Your feeling that there must be some system is not unusual. Dan said he had it too. As I said, the entire history of linguistics has been a search for that system. The same search, only distinguished since Chomsky, by different reactions to Chomsky’s strong assertion that it can’t be learned.

So you’re by no means alone.

All of ML is looking for it too. The ML community ignores formal linguistics. With good reason. Because if they asked, formal linguistics would say it can’t be learned. So the ML community too ignores the history and works on pure conviction that something learnable must be there.

Since linguistics had it’s big contradictory learned structure shock though, we’ve become more familiar with complex systems, chaos, and the idea that some things are only completely described by themselves. So we should be more open to that explanation for the linguistics history. But chaos is not a big thing in science even now. I guess because science generally has to limit itself to systems which can be abstracted. Otherwise what’s the point. It’s no use saying we can’t know! So even now that we appreciate systems of this sort exist, that understanding has not proven especially useful in practice.

Stephen Wolfram was promoting a “New Kind of Science” on these lines for a bit. With the idea that some things are unknowable, so we must get used to the idea of unknowability. In practice I don’t think it’s gone far.

That’s actually where AI has an advantage. Because with cognition the idea that a system might be unknowable doesn’t matter so much. With the weather it’s a problem. With the weather you want to know the actual weather. Chaos puts a hard limit on the accuracy of weather forecasts. But with AI we’re a bit better off. With AI it’s enough to make another system of the same type.

Yes, the word “learn” is a problem here. That’s one reason I usually put it in quotes.

Obviously we all “learn” a language. We’re using one now. It’s more a question of degrees of abstraction.

I’m proposing that language can’t be abstracted more than any set of examples of its use. The examples can be “learned”. In the sense they can be recorded, and even coded into the connections of a graph. But you can’t abstract it further than that.

It’s the same problem as that other example I pulled up:

It doesn’t mean you can’t have the graph. It just means you can only have the graph. Nothing smaller will do.

The “anchors” become the kind of clustering you decide is “the best” for your problem.

I think that network example might illustrate the point. Let me know if it doesn’t.

Beyond the idea that different clusters might be relevant to different situations, you are only left with the problem of how exactly you want to cluster them.

For language I just think the relevant way to cluster networks of language sequences, is according to clusters which best predict the next element in a sequence.

And to do that I conjecture, it may be enough to set the network oscillating by driving it on that sequence, and see how it clusters about it.

It would be similar to the way transformers work. They also work by clustering sequences according to how well they predict the next element in a sequence. With particular sequences selected by “prompts”.

The main difference would be that transformers assume those clusters will be stable in some way, and that it will therefore make sense to learn them once, and use those learned clusters repeatedly.

I just suggest that the fact doing so generates… trillions (is it now?) of parameters, might actually be suggesting the clusters are better seen as subjective, generatively novel, and contradictory, similar to the clusters in the quoted article.

By novelty I mean new clusters all the time. It can be new metaphors and jokes. Or it could just be a new sentence. Or more specifically clusters about those new sentences, “across” them in a sense. which predict them.

The refutation of that would indeed be, as you say, “a simple set of combination rules such that an infinite number of unique sentences can be created”. But no such set has ever been found. And I say the history of contradiction found in theoretical linguistics when they attempted to find such rules, indicates that it cannot be found.

If it could be found, you’d be right. I say the evidence indicates it can’t be found. The history of linguistics, and trillions of transformer “parameters”, back me up.

I’m not sure. There is relevance in the multiple groupings which are possible about the different interpretations of such a sentence.

Classic examples I struggled with when I was working on machine translation systems addressing these problems were:

“He saw the girl with the telescope”
or “He hit the nail with the hammer”.

A favourite was:

“I buy the paper to read the news”
“I teach the boy to read the news”

That one wasn’t just an ambiguous association, the entire interpretation of the sentence changes from “reading the news” being the goal of an action, to being the action taught.

Or examples like “garden path” sentences, where the grouping changes in mid-phrase:

“The horse raced past the barn fell”

But such ambiguities are as you say, resolved in context.

I’m more interested in the groupings which form among the network of examples to resolve these ambiguities.

I guess it might be the same. These examples illustrate the ambiguities. And they are resolved in context. But they don’t illustrate how are resolved. So I’m interested in the same examples. But I’m interested in different groupings within a network of examples, which flip between one interpretation or another with the different interpretation of the elements in the sentence, but which are not visible in different groupings of the surface forms of the examples themselves.

So, if you imagine sets/clusters of similar words projecting up and down around the words in these ambiguous examples. Not along the strings of the examples themselves, but out, up and down, on top of and below the strings of the examples.

Something like:

I teach the boy to read the paper
                   eat his supper
                   make his bed
                   do his homework
                   ...

I’m interested in how these clusters might form and reform, as different ambiguous groupings of the surface sentence are resolved.

It is these clusters which resolve the surface ambiguity.

I agree. They do it well. All this particularity is necessary. And transformers capture more particularity than anything else up to now.

But it begs the question. What are trillions of parameters telling us about our conception of the problem?

To treat all this particularity as an abstraction seems back-to-front to me. It’s using the techniques of finding abstraction, when actually it seems there is no abstraction. Everything is particularity. They are just enumerating detail. The idea of “learning”, comes from the time when we thought there would be a small number of abstractions which would suffice. But there’s not. These things are just learning trillions of particular examples.

I’m arguing with the idea, held over from the time we were “sure” there were complete abstractions, that it is both more efficient, and can actually be complete, to try to “learn” (in the sense of abstract) all these particular clusterings, before we meet them.

So, I’m arguing for a solution much like transformers. But dynamic. Finding the predictive groupings from the dynamics of the network. So it is possible to find all of an infinite number of often contradictory groupings. Instead of trying to “learn” them all, and inevitably falling short.

roboto · April 12, 2023, 4:20pm

Thanks for taking time to clarify in length!

I agree very much with the above but I don’t get what you mean by “transformers assume those clusters will be stable”. Transformers take thousands of words into context when “forming best clusters”. I even think that transformers are replicating deep/slow thinking limited only by its context window size.

There was a research that suggests Transformers are Hopfield networks after some slight modifications and if it is allowed to oscillate at each layer of the transformer. Not sure if this is fundamentally different from your approach but there’s the oscillatory component to it. Youtube link: https://www.youtube.com/watch?v=nv6oFDp6rNQ

Darn i was so sure there could be one even if it meant an abstract language of sort like a programming language. I mean in a cheaty kind of way I can make an infinite number of unique sentences just by incrementing the number by one in this sentence: “I want to have 1 piece of apple.” But that’s besides the point and irrelevant now that I know more on what you meant.

Are you basing the above opinion more on the cluster formation side of things based on the article you linked above titled “The Many Truths of Community Detection”? Or is it more about how it’s impossible to learn how to choose best clusters based on context?

robf · April 12, 2023, 8:35pm

roboto:

robf:

For language I just think the relevant way to cluster networks of language sequences, is according to clusters which best predict the next element in a sequence.

And to do that I conjecture, it may be enough to set the network oscillating by driving it on that sequence, and see how it clusters about it.

It would be similar to the way transformers work. They also work by clustering sequences according to how well they predict the next element in a sequence. With particular sequences selected by “prompts”.

The main difference would be that transformers assume those clusters will be stable in some way, and that it will therefore make sense to learn them once, and use those learned clusters repeatedly.

I agree very much with the above but I don’t get what you mean by “transformers assume those clusters will be stable”. Transformers take thousands of words into context when “forming best clusters”.

Can you restate exactly which bits you agree with. Because to me that sounds like you saying, “I agree clusters won’t be stable, but don’t get what you mean by saying they won’t be stable”!!

Was it just the prediction of sequences using clusters bit that you agreed with?

It seems clear to me that because they start from a paradigm of “learning”, transformers, and likely Hopfield nets, will try to find clusters which predict better by following prediction “energy” gradients to try and find minima. That’s great, but once you find such a minima, that is your cluster. It won’t change. You will have trained your network with weights which embody it.

That to me is a “stable” cluster. And seeking a maximally predicting cluster in that way, seems to me to assume the cluster will be stable. Otherwise you wouldn’t try to embody it in network weights.

You might say the network learns many different clusters dependent on context (context = “attention”.) And those many different clusters might be thought of as a form of “instability”. Is that what you meant by “take thousands of words into context”? But however many such clusters you learn, even if they are contradictory (distinguished only by “attention” context?), their number will be fixed after the learning process.

Having a fixed number after learning is just an artifact of a paradigm of learning which seeks to gradually adjust weights, and follow energy gradients to prediction minima.

That “learning” paradigm has the advantage that you don’t have to assume as much about the system to be learned. You can try all sorts of things, and then just follow your prediction energy gradient down to a minima.

So it has the advantage of being dependent on fewer assumptions about what is to be learned.

But it has the disadvantage of being dependent on the idea that weights will gradually converge, and then remain static. You start from the slope, and trace it back. So the slope must be fixed (and “smooth”/“linear”?)

By contrast, I think the prediction energy “slope” flips back and forth discontinuously. That means you can’t trace back along it. (Is the ability to “trace back along it” an assumption of “linearity” in back-prop?)

So how do you find an energy minima predictive state in a network which flips from one maximally predictive cluster to another, discontinuously, as context changes?

Fortunately, the tracing of slopes smoothly down to minima need not be the only way to find clusters which maximally predict in a sequence network. If maximally predicting means having a lot of shared connections, such clusters will also tend to synchronize any oscillations. In any case, synchronized oscillations are another kind of minimized energy state of a network. And in a sequence network, a synchronized cluster will surely be a minimized energy state of sequence.

I looked at the paper, but couldn’t find any mention of oscillations. Does Kilcher mention oscillations? Can you cue the spot in the video?

I’m not familiar with Hopfield nets. But the paper was just saying they can achieve equivalent forms of learning. Which I don’t doubt.

Do they not also gradually adjust weights, to find minima along smoothly varying, “linear”, energy surfaces?

I’m basing it firstly on what I observed when trying to learn grammars when I first started working on the grammar description problem. It’s fairly simple really. You can learn grammars fine.

You just classify words based on a vector of their contexts. QED… or so it seems…

The only trick is that the vectors contradict! It’s just a element ordering problem. You order all the elements (contexts) one way, and it gives you a nice (grammatical) “class”. The trouble is, you order them another way, and it gives you another class. Both classes are correct. The word participates in both of them. But not both at the same time. And while other words will belong to those same classes… mostly. None of them will have all the same elements. Some might have some elements, others might have other elements. It all becomes a big hodge podge, and the only way to capture it all is to order them the way you want, when you want to.

So this is something I observed directly.

It’s only after, that I found other people were noticing similar “problems”. Having a physics background I felt there was an intriguing parallel to uncertainty principles, and even gauge theories of particle formation. And then I found other people noticing the same thing. The first one was Robert Laughlin, “Reinventing Physics from the Bottom Down”. Emphasizing the irreducibility of structure at each level even for physics. I mentioned in another thread the parallel Bob Coecke has drawn between language structure and quantum mechanics… etc. etc. Then I found there seemed to be a parallel in the diagonalization proof of Goedel’s incompleteness theorem (though that’s still conjecture, I haven’t nailed an equivalence down for that.) But the same parallel exists in the same applicability of category theory I found being applied to language structure. Category theory being itself a response to Goedel incompleteness in the maths context.

Oh, and then I found the observation of contradictions in the history of language learning…

The network clustering paper I linked is just the latest in a list of many examples.

But initially it was something I noticed myself, when trying to learn grammar.

Bitking · April 12, 2023, 11:13pm

Have you considered that in the search for “underlying rules” part of the problem is that much of the human network training arises from embodiment? Since an AI program does not have this grounding it is flailing around trying to recreate the portion that is part of the humans learned experience without this grounding.

Look to this paper to see that much of semantics are grounded in the parts of the brain associated with how we sense and control the lived experience.

The functions are distributed over many processing centers with significant portions split between the grammar store (motor portion) and object store (sensory portion). The filling in process of speech generation seems to me to be somewhat related to the Hopfield model mentioned above, with various processing areas providing cues and constraints as the utterance is being assembled. I see the cues as a combination of external perception, internal stored precepts, and the current contents of consciousness. The constraints being stored precepts.

If you consider the frontal lobe as motor programs generator based on subcortical commands, speech is just another goal oriented motor command sequence, guided by the connectionist feedback from the sensory regions.

As is “thinking.”

Don’t forget that since this is a motor program, the cerebellum will be providing coordination and sequencing between the various components of the motor program.

If you look at this the right way, you may notice some similarity to the major components of the transformer networks and how the brain is organized. The large number of processing areas (about 100 in the human brain) corresponds to multiple attention heads. The “loop of consciousness” roughly corresponds to the transformer deep in/out buffer.

roboto · April 13, 2023, 5:43am

Oops yes that sounds like how you said. But yes everything feels familiar other than the parts related to the clusters being stable.

Absolutely spot on! That’s what I was trying to say about taking thousands of words in context. My opinion on that is it shouldn’t be taking thousands of words into context. A bit of exaggeration. The amount of different contexts made of thousands of words could be a combinatorial nightmare. It could be trying to find general rules of combination consisting of much smaller parts.

I don’t know enough to answer or understand what you’re claiming. Would be helpful if other people that knows more about it could discuss this together. The parts I quoted above could be a point that you can elaborate and emphasize more when discussing your views with a deep learning proponent.

at 20:01 is one of the parts where he starts to explain the hopfield network’s multiple update mechanism of iteratively trying to find energy minima. I took this as being somewhat similar to the process of achieving a state of stable oscillation. Stable meaning finding the correct clusters and those clusters not changing. And yes you’re right this is similar to standard transformer. I was trying to point out some similarities but it seems it’s fundamentally different from your idea.

My thinking is that there’s only so many different clusters that can be formed in a sentence. Learning them all shouldn’t be a problem and that could be separate from learning to resolve the many different potential clusterings based on context. But I don’t think this observation is enough to refute your idea.

robf · April 13, 2023, 7:56am

I think it is. That’s what it is trying to do. And they find many such “general rules”. Many, many… And as generalizations they are useful. The system generalizes. And it generalizes in specific ways according to the prompts. Everything becomes about supplying the right prompts to isolate the generalization you want.

But to say they are successful because they have found generalizations “consisting of much smaller parts” I think we can agree is questionable. If the generalizations are in terms of “much smaller parts”, why trillions of parameters?

No-one asks these questions, because just making them bigger is working wonderfully well for now. A bit like climbing the tallest tree to get to the moon, you seem to be making marvelous progress, until you’re not.

Well the first step to understanding is to identify a question. At least you are thinking about the problem. I think most people going into “AI” just now don’t think about what’s fundamentally happening at all. They are just wow’ed by the trillions of generalizations these things suddenly make accessible automatically, and their superior memory. They dismiss the already evident limits of those trillions of ossified generalizations in “hallucination”, perhaps propose some kind of past tech bolt on to filter the “hallucination” out, and just use all these wonderful new APIs to make pretty pictures, or try to get funding engineering prompts etc.

roboto:

roboto:

There was a research that suggests Transformers are Hopfield networks after some slight modifications and if it is allowed to oscillate at each layer of the transformer. Not sure if this is fundamentally different from your approach but there’s the oscillatory component to it. Youtube link: Hopfield Networks is All You Need (Paper Explained) - YouTube

at 20:01 is one of the parts where he starts to explain the hopfield network’s multiple update mechanism of iteratively trying to find energy minima. I took this as being somewhat similar to the process of achieving a state of stable oscillation. Stable meaning finding the correct clusters and those clusters not changing. And yes you’re right this is similar to standard transformer. I was trying to point out some similarities but it seems it’s fundamentally different from your idea.

I see what you mean by a kind of oscillation. It’s in an iterative procedure of magnifying a particular key, to retrieve a particular value.

Also, what he says about not aggregating the values, but aggregate the keys, ~minute 20.40, sounds a little like the distinction I was trying to draw between “similarity” of a pixel greyscale value, and “similarity” of context defining a diverse set of pixel greyscale values, in my other thread with @bkaz.

But I even if a Hopfield net is iteratively magnifying its lookup, it is still performing a lookup…

I don’t know. Maybe if the network were completely untrained, it might equate to what I am suggesting. If you equated “keys” to sequential network connections… And then iteratively fed back the “lookup” of that key, to be a “lookup” of some structure latent in the network. It might start to come to the same thing.

Then the insight I’m promoting would be that you would not want to train the network! Any “training” would just be throwing away structure which is latent in it. In that case a Hopfield network might be an overly complex way of simply setting the network oscillating and seeking a maximum in the oscillation.

Having thrown away the training, I’m not sure how much of what was left would make sense as a Hopfield network.

But yeah, there might be a parallel there, you’re right.

I haven’t been deeply into the maths of this. And Wolfram is a self-absorbed PITA in many ways (In particular I think Wolfram has disappeared up his own immaculate conception in his ideas about language, which he seems to reduce to a list of labels, so he can equate it to his “Wolfram Language” computation library, which has made his millions, completely ignoring his own ideas about system!) But Wolfram has looked at the power of simple combination rules to generate complexity. If you google him, as an example of someone who has looked at the power of combination unconstrained by ideas of limitation by abstraction, I think you’ll find sufficient evidence that simple combination rules can grow without bound.

robf · April 13, 2023, 8:00am

The relationship to embodiment is tricky. There’s two ways to interpret embodiment in the context of intelligence, or language. One is that embodiment is the solution to the puzzle that we can’t find an abstract system. This theory says that intelligence is attached to a particular body, our human body, and therefore we need to recreate that body to recreate intelligence.

This was the direction George Lakoff took the search for a “layer below” after Cognitivism split from Chomsky in the '70s. Here’s a nice presentation of that (I think this is the one I saw):

George Lakoff: How Brains Think: The Embodiment Hypothesis

(Oh wow, look at this. Just discovered. This is more recent. Dipping into the beginning, he appears to further reifying the drift to focus on “social justice” abuses of language that I said has become the direction of the Cognitivist and Functional schools of theoretical linguistics in recent years, and aligned his idea of embodiment with current identity politics! Body = identity. He’s using it to add fire to the rejection of Enlightenment reason and say abstract truth does not exist, only the truth of identity, a particular body, exists. Wow. That would be terribly dangerous. It would further justify the current rejection of objective truth in the humanities, and fracturing of politics into identarian conflict. I want to do the opposite. I want to emphasize the idea that there is a sense of objectivity to truth, and that it is to be found in the (chaotic) process of generating truth from example. If we can accept a unifying process by which different “truths” are formed, we can escape conflict based on the identity of the person forming it. He’s going the opposite way. Wow, the push to equate everything with identity in the humanities seems strong!

I don’t see how you can have AI with this, of course. I don’t think Lakoff’s ideas have been useful for technology anywhere, only for creating more social division! Instead it is abstractions on language which have proven useful, the other direction, going up above language, instead of trying to go down below it…

The Neuroscience of Language and Thought, Dr. George Lakoff Professor of Linguistics
https://www.youtube.com/watch?v=JJP-rkilz40)

So that’s the direction Lakoff has taken the search for an objective basis for language, and thought, after he split from Chomsky’s search. Like the postmodernists which have dominated the humanities, Lakoff is equating meaning with identity.

I hold with a weaker form of that (weaker for identity dependence, stronger for an independent system of meaning.) I hold that intelligence is embodied. But that many bodies can be intelligent. So you can’t copy one or other example of intelligence without copying the exact body. But you can have intelligence.

To me it is exactly the case of chaos. Like the weather. You can’t copy the weather without copying the exact planet it is embodied in. Actually impossible. But you can create another weather system. It will of course be different to the weather of our planet, and so imperfect for weather prediction. But it will still be weather.

This is chaos. Chaos is embodied. Down to the butterfly. That’s its canonical property. But the idea of chaos is not limited to any particular body.

Identifying its embodied property with chaos means you can lift intelligence off the exact embodied substrate. You can actually model it from any level. Because the property, chaos, can be the same. The chaos it generates won’t be the same. But it will still be chaos. So it will still be a kind of intelligence.

This explains why transformers might appear to demonstrate a kind of intelligence, even without grounding in data more embodied than the “body” (corpus) they are trained on. (It is just that being trained, they abstract away that body of text, and so abstract away the ability of that text to be interpreted in novel ways. That’s where “intelligence” escapes them, not the fact of being based on text as such.)

roboto · April 13, 2023, 9:11am

The brain has many more parameters (synapses) not including the glial cells and other things including chemicals that govern its functions like how synapses move and find its way around in the physical space etc. There are also many reasons why increasing parameter size is useful. For one thing, increasing the context window size requires more parameters and longer window size could equate to more working memory if I were to draw analogy to the brain. Regarding the hallucination part, yea I think there’s still some fundamental limitation in the design like its lack of working memory and ability to determine confidence level of its output based on past memories (like maybe how certain combinations are possible but not registered in memory).

I can’t judge if a trillion parameters is big or not. What should that number be compared to? On the question why trillions or huge amounts of parameters is needed if it is learning smaller parts, I think there are huge amounts of them but finite and the AI must also encode how they interact with each other.

robf:

roboto:

My thinking is that there’s only so many different clusters that can be formed in a sentence. Learning them all shouldn’t be a problem

I haven’t been deeply into the maths of this. And Wolfram is a self-absorbed PITA in many ways (In particular I think Wolfram has disappeared up his own immaculate conception in his ideas about language, which he seems to reduce to a list of labels, so he can equate it to his “Wolfram Language” computation library, which has made his millions, completely ignoring his own ideas about system!) But Wolfram has looked at the power of simple combination rules to generate complexity. If you google him, as an example of someone who has looked at the power of combination unconstrained by ideas of limitation by abstraction, I think you’ll find sufficient evidence that simple combination rules can grow without bound.

I saw one of his videos posted on this forum recently. Funny that I somehow had similar dark thoughts about the guy and I thought I was being judgmental. But yea he had awesome things to say and opened my eyes to the greater universe, but not sure if its too helpful towards AGI.

Topic		Replies	Views
"On Intelligence" vs recent developments: What's puzzling me (and some thoughts about grid emergence) Numenta Theory	9	2693	March 31, 2018
Jeff’s talk @ Simons institute Talks and Events	13	2288	April 25, 2018
A Framework for Intelligence and Cortical Function Based on Grid Cells in the Neocortex Related Papers	60	4615	May 16, 2019
Numenta Research Meeting - Nov 4, 2019 Current Research live	6	747	November 6, 2019
Geoff Hinton and the Thousand Brains Theory Tangential Theories research	2	1039	July 31, 2023

"How Your Brain Organizes Information" video

Related topics