Chaos/reservoir computing and sequential cognitive models like HTM

I don’t think it is hard to break it into smaller chunks:

Chunks:

  1. Code as much of a language as possible into sequences in a network.
    (1a. Likely subtlety, allocate an SDR to each observed sequential element, and code network as sequences of sub-sets of that SDR, to distinguish longer context.)
  2. Apply inhibition to the network at varying levels until you achieve oscillation on perturbation with a “prompt” (network of language sequences is naturally recursive, so to achieve oscillation you only need inhibition.)
  3. Observe how the raster plot of recurrent spikes for the prompt sequence, clusters into “pooled” sub-sequences.
    (3a. Bonus for keen players. Observe how a raster plot for the broader network predicts continuations/completions of the prompt.)

Is HTM really ONLY synchronous queues? Was that the original goal? Synchronous queues? The solution for AI is synchronous queues? “Meaning” is a queue?

Do you think there is a dichotomy between what I’m suggesting and HTM in the original broad sense of ANY hierarchical temporal memory? The original sense of hierarchical temporal memory to my understanding was a general direction to explore cognitive processing in the brain. A general research focus. One which hypothesizes a hierarchical organization of temporal information as a promising initial direction to explore. Temporal, sequences. And forming a hierarchy… Nothing more. An inquiry.

Or do you only mean a dichotomy with HTM as instantiated in current theory?

A dichotomy with current theory, I don’t doubt. That’s how you move forward.

What is the goal here? Is it to explore a problem or enshrine a doctrine? The later was why I left before. I was told that “Jeff’s ideas must dominate here”. That’s why I left. If it is still a case that Jeff’s ideas are more important than a solution for AI, and the canon of HTM doctrine as it happens to be at this moment must remain forever sacrosanct, then this may not be an appropriate place to seek solutions, true.

I don’t know what the culture is here now. I only got drawn back in to posting because a thread asking why transformers are so successful somehow resurfaced in my email.

2 Likes

Clearly not. But there is a core framework which is the basis for other experimentation.
In essence (only my opinion) HTM is modular unit still waiting for an architecture. It is the chunk, or unit. But it’s quite a complex chunk at that with many implicit hyperparameters.

No, I think many here would support this expansion in to a dynamic systems approach. It may be closer to a redefinition from first principles, which infers a lot of work rather than tweaking.

However I would also suggest that Spiking NN are a different skill set - and probably held by fewer people. This suggest people with Control Theory and Physics backgrounds rather than just Comp Sci. Or if you are lucky polymaths… there are certainly some around here.

I would clearly vote for the former, but since this is on equipment paid for by the Numenta, I’m sure there will also be a bit of the later :wink: :grinning:

2 Likes

No but it’s quite far from Numenta’s implementation such is rather more efficient to implement your ideas from scratch than to convince someone to have a major overhaul (both conceptual and functional) of existing HTM code base.

And I don’t think anyone will oppose you here as long you keep it clear it is a different thing than Numenta’s HTM. Lots of (more or less) related stuff has been discussed and showcased here without problems. Specially when it was posted in the right sub-forum.

3 Likes

I’m not quite sure they don’t imagine it as being dynamically chaotically situated, “a lot of extra work” by computers? That’s excessive computation power beyond affordability even by big techs.

I feel you meant you have better offers, for startups possibly to engage into the wondering, but haven’t figured out the practical approach yet.

I think here lacks details, “allocate” randomly then fixed so on? Could the SDRs for letters dynamically adjust/adapt? How if so?

E.g. the SDR schema for letters: 2000 mini-columns, each of per 100 neuron, that’s 200K dots along the y-axis on the raster plot, if each neuron renders a dot in the y-axis, I don’t think you are going to observe meaningful clustering/patterns there rather than randomness.

In needing of a micro-algorithm to translate neuronal-topology back into letters they represent, if the y-axis of the raster plot would render a dot for each letter, instead of neurons.

1 Like

Well, it will seem “a lot of extra work” to find structure anew each time if you think it need be done only once. But if you imagine the set of solutions grows infinitely, then it will be much less work to find only the one you need, when you need it.

It doesn’t seem plausible that the current compute cost is the true minimum. In excess of 10’s of millions of dollars in compute time ($12M for GPT-3?) and taking months (Reddit says “It would take 355 years to train GPT-3 on a single NVIDIA Tesla V100 GPU”) represents the true minimum. And that’s just the compute time. It doesn’t mention the data. I read that the new Whisper, speech recognition engine required the equivalent of 77 years of speech data. Somehow children get by with about 2 years.

If the limiting factor on your compute time turns out to be actually the size of the solution set (infinite?) and not the cost of computing each individual solution (which if it can be completely parallelized might have compute cost only in number of neurons, so actually next to zero time), then only computing the small sub-set of infinite potential solutions which any individual is capable of encountering in their lifetime, will be by far the cheaper option.

You might add to that the cheapness of a spiking networks, where activity is very sparse. Unlike traditional weighted networks where the whole network is always active. Spiking networks are generally seen as much cheaper, and possibly the reason the brain gets by on 20W or so of power. (By contrast towardsdatascience.com says: “It has been estimated that training GPT-3 consumed 1,287 MWh”. So that’s what? ~1000 x 1e+6/20 = ~50 million human hours? About 70 human lifetimes of power consumption at 20W.)

Does it seem like the current compute cost is actually the cheapest solution?

Or does it seem possible the current compute cost is actually telling us, screaming at us, that current solutions are chasing an infinite solution set?

And our compute cost for each of an infinite number of solutions, might be the number of parallel compute elements only, plus the time to synchronize their spikes each time.

Yes, this lacks details. Short answer, I don’t know.

What I think, is that the ultimate answer will be to sample neural firings in the cochlea and form a network from the time series of those.

Such cochlea nerve firing representations might randomly vary from individual to individual. Maybe even changing over time for a single individual (I think Walter Freeman notes that the rabbit representation for the smell of carrots changes.)

But for initial experiments I think the allocation of SDR to letters could be whatever we like. And we experiment to see what sample size is needed to code what length of prior sequence. It might require some fiddling to get the number and sample size to achieve resolution of different paths over sufficiently long sequences.

Why 2000 mini-columns? For letters wouldn’t that be 26 or so? Maybe 100 neurons/column to represent sequences of letter though, yes.

I don’t envisage us clustering on the y-axis, but the time/x axis.

If we’re allocating letter representations, we won’t need to plot those in clusters.

For the deeper cochlea neural activation encoding, they yes, we would need to plot structure below letters, so we could find the letters too.

But in a first approximation, if we allocate SDRs for letters, those can easily be translated back to letter labels (though we might allow the subsets of each letter SDR used to trace sequence to vary. Note that allowing different subsets to code paths would be different to HTM. Using different paths of subsets for the same sequence could provide some sense of “strength” to a sequence. This would be in contrast to HTM where the “strength” of a sequence is represented by the “training” over multiple repetitions, but only one path through the column cells represents each sequence.)

Anyway, in the case where we allocate letter SDRs, you may be right (if I understand you) that we may be able to get away with just plotting “letter” spike times. The clustering underneath would still be using the information of sequence at a distance coded by the path through different subsets of each letter SDR. But we might not need to plot that. Just a plot for each letter might be enough to see structure appear at a level greater than the letter level.

1 Like

I don’t think that’s even close to “cheap”. But neither is evolution in forming our brains. Our brain is the result, not the obtaining process. Earth has afforded weathering for billions of years, to develop the program (in form of DNA) that to conduct the growing from a single cell (fertilized egg) into a full fledged human being.

You are comparing the cost of running a final product, against the cost to manufacture it, that’s not fair. We are not talking about having a machine to use-its-head, but to make a “head” for the machine. That’s to find appropriate “dynamic structures” (resonating connections among the neurons, as well as the topology) in plausible physical forms.

This describes our existing brain at work, but now we’d like some practical routine to make one out of silicon.

That adheres to “sparse”, but failed “distributed” as for an SDR, why/how acceptable is that?

1 Like

Only “sparse” in the sense the cell representation for the state participating in different sequences would be sparse, that’s true.

Maybe you’re right. Maybe we’d need multiple columns even for assigned letter representations.

And that makes more sense. We do expect to have a large number of compute elements. Just 26 x 100 is not going to cut it!

We can choose however many columns for our letter representations we wish. It comes down to having the coding depth to represent all paths through those letters, both internally to those closely interconnected paths which will come to define “words”, and “externally”, to those less closely interconnected, but still relatively tightly interconnected paths, which will come to define meaningful (meaningful = sharing predictions/contexts) groups or words, which we will call a (hierarchy of) phrase structure.

1 Like

I’ve sort of (maybe not) related these points here in the past, albeit from a simpler context.

TLDR; If “the existence of a variable” is a metric, then some variable’s existence cause another variable(s)'s existence metric to decrease. At some point, the diminished variable will be forgotten, and in some architectures, they compensate for this issue by minimizing decrement values and assigning parameter values to more or increasing number of variables. The smaller these values are the longer it takes a variable’s existence to disappear with the cost of more variables (e.g. billions of parameters).

I do think that the contradictory parameters increase in number as the context becomes “high-level” or up the hierarchy. Hence in low-level, pattern recognition is relatively easily solved (e.g. sequences) but as it goes up the hierarchy, these patterns become contradictory depending on context.

Yes, I think some of the themes are the same.

Thanks for the comment.

Someone else replied to an essay I published on substack recently, and pointed out similar ideas also by John Boyd, a somewhat original thinking Air Force fighter pilot! How’s that for an off the wall application in another field. And Boyd’s interpretation is maybe closer to yours in a slightly destructive sense of a new interpretation deconstructing the old.

I see the tension as less deconstruction, and both more instantaneous, and reversible. I think presented elsewhere as a “phase change”
(especially Walter Freeman talking of phase changes?)

Here’s my substack essay anyway, and the John Boyd essay linked in the comments:

Incidentally, looking at Boyd’s essay I saw he listed his influences as Edward deBono. Originator (I believe) of the term “Lateral Thinking”. Someone I have also come across talking about instantaneous changes in perspective. DeBono I believe compared it to the varying water channels of a braided river.

For the proof of curve fitting you mentioned in your original post, perhaps Goedel’s proof that this is not possible for mathematics might be the one you were looking for. With appropriate reinterpretations and definitions. I sometimes think a parallel proof in the machine learning, or at least the natural language grammar, domain, might make a nice paper for someone with a mathematical bent.

1 Like

Do you know or can you suggest any algorithm to implement and test this idea?
I have problems understanding how it works.

1 Like

Me too, and I just drafted a simulation according to (my understanding of) @robf 's ideas (about a first stage to go), described in another thread: The coding of longer sequences in HTM SDRs - #27 by complyue

I believe runnable code / simulation should clarify things faster, and running that with Gitpod is just a few clicks away with a browser. Please have a look, and let us continue the discussion and drive our ideas / understandings with simulations.

4 Likes