Chaos/reservoir computing and sequential cognitive models like HTM

Well, it will seem “a lot of extra work” to find structure anew each time if you think it need be done only once. But if you imagine the set of solutions grows infinitely, then it will be much less work to find only the one you need, when you need it.

It doesn’t seem plausible that the current compute cost is the true minimum. In excess of 10’s of millions of dollars in compute time ($12M for GPT-3?) and taking months (Reddit says “It would take 355 years to train GPT-3 on a single NVIDIA Tesla V100 GPU”) represents the true minimum. And that’s just the compute time. It doesn’t mention the data. I read that the new Whisper, speech recognition engine required the equivalent of 77 years of speech data. Somehow children get by with about 2 years.

If the limiting factor on your compute time turns out to be actually the size of the solution set (infinite?) and not the cost of computing each individual solution (which if it can be completely parallelized might have compute cost only in number of neurons, so actually next to zero time), then only computing the small sub-set of infinite potential solutions which any individual is capable of encountering in their lifetime, will be by far the cheaper option.

You might add to that the cheapness of a spiking networks, where activity is very sparse. Unlike traditional weighted networks where the whole network is always active. Spiking networks are generally seen as much cheaper, and possibly the reason the brain gets by on 20W or so of power. (By contrast towardsdatascience.com says: “It has been estimated that training GPT-3 consumed 1,287 MWh”. So that’s what? ~1000 x 1e+6/20 = ~50 million human hours? About 70 human lifetimes of power consumption at 20W.)

Does it seem like the current compute cost is actually the cheapest solution?

Or does it seem possible the current compute cost is actually telling us, screaming at us, that current solutions are chasing an infinite solution set?

And our compute cost for each of an infinite number of solutions, might be the number of parallel compute elements only, plus the time to synchronize their spikes each time.

Yes, this lacks details. Short answer, I don’t know.

What I think, is that the ultimate answer will be to sample neural firings in the cochlea and form a network from the time series of those.

Such cochlea nerve firing representations might randomly vary from individual to individual. Maybe even changing over time for a single individual (I think Walter Freeman notes that the rabbit representation for the smell of carrots changes.)

But for initial experiments I think the allocation of SDR to letters could be whatever we like. And we experiment to see what sample size is needed to code what length of prior sequence. It might require some fiddling to get the number and sample size to achieve resolution of different paths over sufficiently long sequences.

Why 2000 mini-columns? For letters wouldn’t that be 26 or so? Maybe 100 neurons/column to represent sequences of letter though, yes.

I don’t envisage us clustering on the y-axis, but the time/x axis.

If we’re allocating letter representations, we won’t need to plot those in clusters.

For the deeper cochlea neural activation encoding, they yes, we would need to plot structure below letters, so we could find the letters too.

But in a first approximation, if we allocate SDRs for letters, those can easily be translated back to letter labels (though we might allow the subsets of each letter SDR used to trace sequence to vary. Note that allowing different subsets to code paths would be different to HTM. Using different paths of subsets for the same sequence could provide some sense of “strength” to a sequence. This would be in contrast to HTM where the “strength” of a sequence is represented by the “training” over multiple repetitions, but only one path through the column cells represents each sequence.)

Anyway, in the case where we allocate letter SDRs, you may be right (if I understand you) that we may be able to get away with just plotting “letter” spike times. The clustering underneath would still be using the information of sequence at a distance coded by the path through different subsets of each letter SDR. But we might not need to plot that. Just a plot for each letter might be enough to see structure appear at a level greater than the letter level.

1 Like