Which Edelman book are you recommending? I looked back in this thread and didn’t see a specific book.
I don’t know who you asked, but basically he has three books on the topic:
- Neural Darwinism: The Theory of Neuronal Group Selection
- Topobiology: An Introduction to Molecular Embryology
- Remembered Present: A Biological Theory of Consciousness
all the following books are just variations on the same subject and trying to appease the non-technical layman.
He also has a great intro to the theory in a book with Mountcastle:
- Mindful Brain: Cortical Organization and the Group-selective Theory of Higher Brain Function
This book is the basis for my hex-grid post.
Edelman’s Neural Darwinism came up in this thread because @DanML responded to my arguments - that a failure to recognize contradictory, even chaotic, structure is what has been holding us back in AI - by mentioning similar themes in György Buzsáki’s work:
I looked up György Buzsáki and said his work reminded me of Gerald Edelman’s Neural Darwinism.
@Bitking then recommended Calvin’s book on another variant of Neural Darwinism, with many of the same themes of expanding, growing, changing, structure, even chaos, that had been the basis of things he had been working on for HTM.
So basically Neural Darwinism came up in this thread because I am arguing that contradictory structure may be what we are missing in our attempts to code “meaning” in cognition. And that lead to some links to similar ideas, notably growing or contradictory, even chaotic, representation, in Neural Darwinism.
I agree with the themes of growing structure, and especially chaos, in Neural Darwinism. But I disagree with the specific mechanism it proposes for finding new structure.
I’m arguing in this thread that we don’t need to have the random variation followed by evolutionary selection of Neural Darwinism. Rather I think existing “learning” methods, specifically the cause and effect generalization of transformers and Large Language Models, is already telling us the structuring mechanism we need. The only thing which is holding us, and transformers, back, is that we still assume the structure to be found is static and can be learned.
I’m arguing that the blow out in the number of parameters “learned” by transformers is actually indicating that they are generating, growing, contradictory, and even chaotic, structure already. So the only mistake is that we are trying to “learn” all of it at once. Rather we should be focusing on its chaotic generator character, and generating the structure which is appropriate to each new prompt/context, at run time.
And I’m suggesting we can do that, by seeking cause-effect prediction maximizing structure in “prompt” specific resonances in a network of sequences, initially language sequences. This in contrast to transformers, which look for these cause-effect prediction maximizing structures, by trying to follow prediction entropy gradients to static minima, and then only selecting between potentially contradictory structures learned, using a prompt, at run time.
Maybe it is ironic (pity me at least), my verbal (both hearing and uttering) English is terribly poor , seems we’ll have to testify the limit of text based communication, anyway we are finding ways to tell a machine to do such things effectively, doing it ourselves (i.e. dogfooding) may help somehow.
Well, we could try in my Chinese. Though I would be very surprised if that provided a clearer channel!
To attempt in text… Your first question:
I’m hypothesizing “words” will be collections of letter spikes which tend to synchronize in a raster plot. As sketched in this post:
You can imagine the "x"s in the sketched raster plot above being pushed together, synchronized, in the same way.
So, to answer your question, I’m saying that might be how we could ‘identify such “elements” for “words”, out of simulated data and phenomena.’
Why would they synchronize? You made the good point that with a single node representation the letter network would immediately become completely saturated with links. It’s brought me back to the idea that we will need a distributed representation, an SDR, even for letters. I think an SDR representation for letters can mean that an entire path leading to a letter can be coded as a subset of the whole letter SDR. The same path can even be encoded in multiple such subsets, effectively coding for repetition. And the same variety of subsets can differentiate letter connections generally, so our network is not immediately saturated, and we can differentially find resonances among more tightly connected clusters associated with words.
In this I’m reminded of earlier work coding longer sequences in an SDR explored together with @floybix back in 2014. I’ve posted that in a separate thread. Perhaps Felix’s code can be a basis for our letter SDR representation now:
Your second question:
To find which sub-sequences synchronize their oscillations we don’t need to enumerate them all. We just need to set the network oscillating, perhaps with a specific driving “prompt” activation, and see which elements synchronize. The network performs the search, in parallel, for us, as it is seeking its own minimum energy configuration.
I think this is still the bit that I find the most confusing.
Let me read back a highlevel outline of what I think you are proposing:
- Build a SSN (LIF like) network for some input data (phrase/segment)
- Tune the net so it is just ‘sub-critical’ (from a feedback perspective?)
- Do some WTA/thresholding to decide the most active nodes
- Goto 1 with the next segement
Does the input to 1 require enough nodes for the whole vocab (all classes) or is it adaptive somehow?
What happens on the 2nd pass? Are we using a new or modified net for 1? Same weights for 2? What gets fed back from output 3?
Please let me know if this is close to your outline. Or change it to fit.
SSN = Spiking Sequence Network??
A network of sequences of some kind, yes. Maybe good to start with SDRs for letters in text, without any spaces, and try to find “words”.
So that it oscillates, yes. Sub-critical in the sense of not quite running away activation or inhibition.
I don’t think we need to examine the activity of nodes. I think we can just look at clusters of spike times in a raster plot.
Not sure what you mean by “segment”. Segment of text?
The network would be for a large sample of text. A corpus. As big as possible. Ideally the whole language. Then processing would submit phrases or sentences, rather like “prompts” in transformer terms, to be structured. Or to generate a prediction like transformers. So if by “segment” you mean go on to the next “prompt”, then yes.
Thanks for responding. Let me elaborate slightly so we have the same understanding.
My typo - SNN (Spiking Neural Network).
Yes, you are spot on.
Winner Takes All (top 1) or k Winner Takes All (top k).
Who/what is looking at ‘activity’? We need an algorithm here, yes?
Correct. All LLMs use tokens which sometimes equate to words or partial words (but never letters IMHO).
Sorry for the translation errors but perhaps we can look at the main questions again.
Does this represent the totality of the ‘process’? (this is still very high level)
And some of the interplay between steps:
Perhaps we need to distinguish between training and application?
You start training on some elements - this cannot be the whole language. It could be a sentance/paragraph ( and in theory page/book/library but not in practice). What is the training ‘unit’?
The “algorithm” is to find prediction energy… minima. In that sense it is the same as transformers/LLMs. The prediction energy minima LLMs seek, are those on energy surfaces formed by collections of elements, notably back along the sequence, with “attention”. They may also seek energy minima on energy surfaces formed by collections of elements representing the current state (I need to catch up with the balance there, RNN/LSTMs would be ALL the current state, and just a memory state for the prior sequence. I don’t know how much “attention is all you need” threw out collecting elements across the current state. Maybe you know.) These energy surfaces will represent different collections of elements in the current state (and/or “attention” state) and the minimum energy will be for the collection which best predicts the next state. That’s how they learn to predict. They calculate an energy surface for each collection and follow the slope of that energy surface, adjusting the collection of elements, until they find the minimum prediction energy collection, and that minimum energy collection of elements is their network.
My “algorithm” is the same. It’s seeking minimum energy prediction collections across a network. The main difference is that I don’t think the minimum energy predictive state is static across the whole data set. I think there are different best minimum energy predictive states depending on what you are most interested in. In particular I believe that minimum energy predictive state will vary from prompt to prompt. (I think this is happening in transformers too. But we don’t see it. They hide it from us in their enormous context sensitive “parameter” sets.)
So I want to find the minimum energy predictive state dynamically. And a way to do that should be to set the network oscillating and allow it to find its own minimum energy states, as resonances on the network.
The state-of-the-art would do this too. But they assume the surfaces to be found are static. So it probably never occurs to them to seek them dynamically. A lot of extra work, right! Best to do it once and be done with it!
(Also, I guess this would not have been done in neural network history, firstly because they didn’t know what to train for! For supervised “learning”, first you “supervise” to a set of ideals. Just letting the network find its own energy minima doesn’t make sense because you don’t know what “meaning” is. You only have a bunch of examples you consider “meaningful”. So the only thing you can think of is to tie your system to those. Allowing the network to find its own minima only makes sense if the energy minima goal is more general. This is “unsupervised learning”, and it’s never been obvious what energy to minimize for this. What “objective function” to seek. It just so happens that for language the goal is obvious. It’s just prediction along a sequence. I would argue that this simplicity of language is revealing a deeper truth about the relevant parameter of “meaning” for cognition, that the deeper truth about “meaning” in cognition is also cause-effect prediction. And that deeper truth is the reason LLMs seem to be capturing so much of general cognition. So language helped transformers stumble onto a deeper truth of cognition. But even when looking at language, everybody assumes the “truth” will be static. Nobody imagines the “truth” of minimum energy prediction will be multiple, and flip from one structure to another, chaotically. Hence the current LLM state of the art also only seeks static energy prediction surfaces.)
Perhaps the distinction between training and application can become confusing with this change. I guess LLMs don’t have any intermediate stage where the network only represents the language as observed. From the get go their networks will be formed into collections representing the prediction energy surface of the data. “Training” incorporates incorporates both the addition of data AND the minimum prediction energy search across that data.
In a system which seeks minimum prediction energy collections dynamically, the minimum prediction energy collection search is postponed to run time. “Training” is only the addition of data.
So the sense of “training” and “application” will be different between the two.
Exactly what “training” will require for the dynamic option, I’m not sure. Naively you just add a synapse for each time step of the sequence in the data. But as has been pointed out in this thread, to capture longer distance relationships, we will need to represent each time step with an SDR. The way that’s done now in HTM, my memory is refreshed, there is a “training” just to encode the sequence. Each element of the SDR at each time step, needs to be given a “map” of all the elements of the SDR at the preceding time step.
Such a time step to time step “map” between states may be necessary when defining “states” as dynamic minimum energy prediction collections too. I’m not sure. It’s possible the “map” will also be found dynamically as part of the whole minimum energy prediction collection thing. Even what should be considered a “state”, or even a “time step”, becomes part of the same search. A “time step” in the cognitive sense becomes just one or other partition of spikes in the raster plot. It may be sufficient to just “train” by adding synapses across random subsamples of SDRs for raw sequential sensory observations (which sequences of sensory observations can actually be asynchronous and just as the data arrives from the sense organs. No central clock…)
Thanks. I think you’ve outlined the concept well but, without the ability to break it into any smaller chunks, there is no effective way even to do EDA (Experimental Data Analysis), which precludes writing any algorithms/code.
I think there is a dichotomy between using HTM (synchronous queues) and a resonating network capable of ‘relaxation’ into the key states you describe (on multiple time and size scales).
If you want to look at using HTM with SNNs (Spiking Neural Nets) then there are few references and I’m sure other people here could point you at others/better ones.
I don’t think it is hard to break it into smaller chunks:
- Code as much of a language as possible into sequences in a network.
(1a. Likely subtlety, allocate an SDR to each observed sequential element, and code network as sequences of sub-sets of that SDR, to distinguish longer context.)
- Apply inhibition to the network at varying levels until you achieve oscillation on perturbation with a “prompt” (network of language sequences is naturally recursive, so to achieve oscillation you only need inhibition.)
- Observe how the raster plot of recurrent spikes for the prompt sequence, clusters into “pooled” sub-sequences.
(3a. Bonus for keen players. Observe how a raster plot for the broader network predicts continuations/completions of the prompt.)
Is HTM really ONLY synchronous queues? Was that the original goal? Synchronous queues? The solution for AI is synchronous queues? “Meaning” is a queue?
Do you think there is a dichotomy between what I’m suggesting and HTM in the original broad sense of ANY hierarchical temporal memory? The original sense of hierarchical temporal memory to my understanding was a general direction to explore cognitive processing in the brain. A general research focus. One which hypothesizes a hierarchical organization of temporal information as a promising initial direction to explore. Temporal, sequences. And forming a hierarchy… Nothing more. An inquiry.
Or do you only mean a dichotomy with HTM as instantiated in current theory?
A dichotomy with current theory, I don’t doubt. That’s how you move forward.
What is the goal here? Is it to explore a problem or enshrine a doctrine? The later was why I left before. I was told that “Jeff’s ideas must dominate here”. That’s why I left. If it is still a case that Jeff’s ideas are more important than a solution for AI, and the canon of HTM doctrine as it happens to be at this moment must remain forever sacrosanct, then this may not be an appropriate place to seek solutions, true.
I don’t know what the culture is here now. I only got drawn back in to posting because a thread asking why transformers are so successful somehow resurfaced in my email.
Clearly not. But there is a core framework which is the basis for other experimentation.
In essence (only my opinion) HTM is modular unit still waiting for an architecture. It is the chunk, or unit. But it’s quite a complex chunk at that with many implicit hyperparameters.
No, I think many here would support this expansion in to a dynamic systems approach. It may be closer to a redefinition from first principles, which infers a lot of work rather than tweaking.
However I would also suggest that Spiking NN are a different skill set - and probably held by fewer people. This suggest people with Control Theory and Physics backgrounds rather than just Comp Sci. Or if you are lucky polymaths… there are certainly some around here.
I would clearly vote for the former, but since this is on equipment paid for by the Numenta, I’m sure there will also be a bit of the later
No but it’s quite far from Numenta’s implementation such is rather more efficient to implement your ideas from scratch than to convince someone to have a major overhaul (both conceptual and functional) of existing HTM code base.
And I don’t think anyone will oppose you here as long you keep it clear it is a different thing than Numenta’s HTM. Lots of (more or less) related stuff has been discussed and showcased here without problems. Specially when it was posted in the right sub-forum.
I’m not quite sure they don’t imagine it as being dynamically chaotically situated, “a lot of extra work” by computers? That’s excessive computation power beyond affordability even by big techs.
I feel you meant you have better offers, for startups possibly to engage into the wondering, but haven’t figured out the practical approach yet.
I think here lacks details, “allocate” randomly then fixed so on? Could the SDRs for letters dynamically adjust/adapt? How if so?
E.g. the SDR schema for letters: 2000 mini-columns, each of per 100 neuron, that’s 200K dots along the y-axis on the raster plot, if each neuron renders a dot in the y-axis, I don’t think you are going to observe meaningful clustering/patterns there rather than randomness.
In needing of a micro-algorithm to translate neuronal-topology back into letters they represent, if the y-axis of the raster plot would render a dot for each letter, instead of neurons.
Well, it will seem “a lot of extra work” to find structure anew each time if you think it need be done only once. But if you imagine the set of solutions grows infinitely, then it will be much less work to find only the one you need, when you need it.
It doesn’t seem plausible that the current compute cost is the true minimum. In excess of 10’s of millions of dollars in compute time ($12M for GPT-3?) and taking months (Reddit says “It would take 355 years to train GPT-3 on a single NVIDIA Tesla V100 GPU”) represents the true minimum. And that’s just the compute time. It doesn’t mention the data. I read that the new Whisper, speech recognition engine required the equivalent of 77 years of speech data. Somehow children get by with about 2 years.
If the limiting factor on your compute time turns out to be actually the size of the solution set (infinite?) and not the cost of computing each individual solution (which if it can be completely parallelized might have compute cost only in number of neurons, so actually next to zero time), then only computing the small sub-set of infinite potential solutions which any individual is capable of encountering in their lifetime, will be by far the cheaper option.
You might add to that the cheapness of a spiking networks, where activity is very sparse. Unlike traditional weighted networks where the whole network is always active. Spiking networks are generally seen as much cheaper, and possibly the reason the brain gets by on 20W or so of power. (By contrast towardsdatascience.com says: “It has been estimated that training GPT-3 consumed 1,287 MWh”. So that’s what? ~1000 x 1e+6/20 = ~50 million human hours? About 70 human lifetimes of power consumption at 20W.)
Does it seem like the current compute cost is actually the cheapest solution?
Or does it seem possible the current compute cost is actually telling us, screaming at us, that current solutions are chasing an infinite solution set?
And our compute cost for each of an infinite number of solutions, might be the number of parallel compute elements only, plus the time to synchronize their spikes each time.
Yes, this lacks details. Short answer, I don’t know.
What I think, is that the ultimate answer will be to sample neural firings in the cochlea and form a network from the time series of those.
Such cochlea nerve firing representations might randomly vary from individual to individual. Maybe even changing over time for a single individual (I think Walter Freeman notes that the rabbit representation for the smell of carrots changes.)
But for initial experiments I think the allocation of SDR to letters could be whatever we like. And we experiment to see what sample size is needed to code what length of prior sequence. It might require some fiddling to get the number and sample size to achieve resolution of different paths over sufficiently long sequences.
Why 2000 mini-columns? For letters wouldn’t that be 26 or so? Maybe 100 neurons/column to represent sequences of letter though, yes.
I don’t envisage us clustering on the y-axis, but the time/x axis.
If we’re allocating letter representations, we won’t need to plot those in clusters.
For the deeper cochlea neural activation encoding, they yes, we would need to plot structure below letters, so we could find the letters too.
But in a first approximation, if we allocate SDRs for letters, those can easily be translated back to letter labels (though we might allow the subsets of each letter SDR used to trace sequence to vary. Note that allowing different subsets to code paths would be different to HTM. Using different paths of subsets for the same sequence could provide some sense of “strength” to a sequence. This would be in contrast to HTM where the “strength” of a sequence is represented by the “training” over multiple repetitions, but only one path through the column cells represents each sequence.)
Anyway, in the case where we allocate letter SDRs, you may be right (if I understand you) that we may be able to get away with just plotting “letter” spike times. The clustering underneath would still be using the information of sequence at a distance coded by the path through different subsets of each letter SDR. But we might not need to plot that. Just a plot for each letter might be enough to see structure appear at a level greater than the letter level.
I don’t think that’s even close to “cheap”. But neither is evolution in forming our brains. Our brain is the result, not the obtaining process. Earth has afforded weathering for billions of years, to develop the program (in form of DNA) that to conduct the growing from a single cell (fertilized egg) into a full fledged human being.
You are comparing the cost of running a final product, against the cost to manufacture it, that’s not fair. We are not talking about having a machine to use-its-head, but to make a “head” for the machine. That’s to find appropriate “dynamic structures” (resonating connections among the neurons, as well as the topology) in plausible physical forms.
This describes our existing brain at work, but now we’d like some practical routine to make one out of silicon.
That adheres to “sparse”, but failed “distributed” as for an SDR, why/how acceptable is that?
Only “sparse” in the sense the cell representation for the state participating in different sequences would be sparse, that’s true.
Maybe you’re right. Maybe we’d need multiple columns even for assigned letter representations.
And that makes more sense. We do expect to have a large number of compute elements. Just 26 x 100 is not going to cut it!
We can choose however many columns for our letter representations we wish. It comes down to having the coding depth to represent all paths through those letters, both internally to those closely interconnected paths which will come to define “words”, and “externally”, to those less closely interconnected, but still relatively tightly interconnected paths, which will come to define meaningful (meaningful = sharing predictions/contexts) groups or words, which we will call a (hierarchy of) phrase structure.
I’ve sort of (maybe not) related these points here in the past, albeit from a simpler context.
TLDR; If “the existence of a variable” is a metric, then some variable’s existence cause another variable(s)'s existence metric to decrease. At some point, the diminished variable will be forgotten, and in some architectures, they compensate for this issue by minimizing decrement values and assigning parameter values to more or increasing number of variables. The smaller these values are the longer it takes a variable’s existence to disappear with the cost of more variables (e.g. billions of parameters).
I do think that the contradictory parameters increase in number as the context becomes “high-level” or up the hierarchy. Hence in low-level, pattern recognition is relatively easily solved (e.g. sequences) but as it goes up the hierarchy, these patterns become contradictory depending on context.