I suspect where this would break down in HTM is that we’d require an accurate mapping of which predicted columns would activate which bits in the input space, which isn’t necessarily the case.
So time step 1, input space would encode some token, TM would look at that and choose winning columns, SP would look at winning columns and predict which columns are going to be active in the next time step… the result is that we’d have a prediction for winning columns. But then we’d need to get from these predicted winners and translate that into the next most-likely input space encoding, which would have quite a bit of noise potential depending on how you choose the winning representation bits for a given column, and assuming you’d then be able to take those winning bits and transform that back out into something human understandable for introspection.
Or maybe the alternative is that we have a trained and frozen SP/TM system, then for every token in our corpus, we’d check its resulting SDR from the SP, and create a giant dictionary and map that SDR to the given input. Then at runtime, we simply lookup a given token’s SP SDR, check the resulting TM calculation’s predicted SP SDR, and feed THAT (next timestep’s predicted token representation SDR) into the generator. That’d take memory, but by acting as an encoding cache, we’d save on column competition calculations.
Potentially have multiple independently trained SP/TMs working like this together so that you have some probability distributions based on differing predictions, and you’d basically have an HTM Forest.
Feel free to poke holes, as I just threw these thoughts out here .
Even with single SDR prediction I think TM will output a SDR matching multiple possible “futures” in case it thinks they have similar occurring chances.
I don’t know much about transformers either but my guess similar process goes with transformers - they do not produce several “next word” embeddings but one “fuzzy” output vector which in vector space can be “close” aka “similar” with several “pure” word embeddings, and there are API options to either print out the closest matching dictionary word or few words in its neighborhood each with its own “probability” which is just a measure of how well the respective word vector matches the model’s output.
What is unfortunate about transformers is the unreasonable huge (from biology perspective) amount of training data they need in order to get a convincing language model.
The OpenAI API hides the embedding vectors which are the actual inner representation of words the transformer uses for input, output and intermediate layers.
e.g. in GPT-2 every dictionary word is a vector of 1600 floats. Think of every word as point in a 1600 dimensions space.
In GPT-3 is almost an order of magnitude larger.
Output is not a perfect match of any dictionary word, but another point which is “somewhere” within the same space. Given the “vastity” of the representation space, actual words spit out by API are chosen by how close their respective positions are to the output point.
What I’m trying to say is gpt/transformer do not make multiple predictions, they make only one and what API presents as multiple choices are several words closest (in the high-dim space) to the actual predicted vector, by K-NN or some radius metrics.
I feel like this should be surmountable if you check the degree to which different columns are predictive. Issue then would be how to break ties, for equally predictive columns at the bottom of the list … Maybe k-means would still help? It’d require experimentation regardless.
To be more specific, let’s assume your “temporal memory” (could be any series predictor e.g. **Markovian) receives the series:
["It", "is", "time", "to", "ride", "my"]
And predicted outputs are both “car” and “bike” with equal probability, then I expect its output SDR to have roughly half of its bits overlapping the SDR encoding “car” and half the one for “bike”.
Since it won’t have an exact match for anything on the record, all possible exact combinations is huge.
querying TM’s prediction for nearest neighbors however, IF we assume a default orthogonality between (SDRs of) different concepts, then it should return both “car” and “bike” with high likelihood. Yeah, it would spend 0.1-1ms but we didn’t account for scale/parallelism and 1ms is quite faster than a single neuron can spike.