Has anyone tested large TM … The question is what amount of data you can throw at TM before you need to create more of them.
What inspired me to ask this is GPT3… I’m reading about it and it seem to me it is giant TM.
what it does is :
- predict next tokens (with probabilities)
- then pick one of them (based on cfg)
- use the predicted token OR user provided token to predict next ........
- ............repeat
Havent experimented with “multiple-choice”, user assisted VOMC i.e. TM with multiple SDR predictions !! so i’m not sure if the current arch. can handle it ?
I suspect where this would break down in HTM is that we’d require an accurate mapping of which predicted columns would activate which bits in the input space, which isn’t necessarily the case.
So time step 1, input space would encode some token, TM would look at that and choose winning columns, SP would look at winning columns and predict which columns are going to be active in the next time step… the result is that we’d have a prediction for winning columns. But then we’d need to get from these predicted winners and translate that into the next most-likely input space encoding, which would have quite a bit of noise potential depending on how you choose the winning representation bits for a given column, and assuming you’d then be able to take those winning bits and transform that back out into something human understandable for introspection.
Or maybe the alternative is that we have a trained and frozen SP/TM system, then for every token in our corpus, we’d check its resulting SDR from the SP, and create a giant dictionary and map that SDR to the given input. Then at runtime, we simply lookup a given token’s SP SDR, check the resulting TM calculation’s predicted SP SDR, and feed THAT (next timestep’s predicted token representation SDR) into the generator. That’d take memory, but by acting as an encoding cache, we’d save on column competition calculations.
Potentially have multiple independently trained SP/TMs working like this together so that you have some probability distributions based on differing predictions, and you’d basically have an HTM Forest.
Feel free to poke holes, as I just threw these thoughts out here .
Even with single SDR prediction I think TM will output a SDR matching multiple possible “futures” in case it thinks they have similar occurring chances.
EDIT:
I don’t know much about transformers either but my guess similar process goes with transformers - they do not produce several “next word” embeddings but one “fuzzy” output vector which in vector space can be “close” aka “similar” with several “pure” word embeddings, and there are API options to either print out the closest matching dictionary word or few words in its neighborhood each with its own “probability” which is just a measure of how well the respective word vector matches the model’s output.
What is unfortunate about transformers is the unreasonable huge (from biology perspective) amount of training data they need in order to get a convincing language model.
In TM i think on prediction u can select multiple cells from the same column, but how do u form multiple SDRs !! May be all permutations ! still which is most probable ?
But I’m not yet convinced by Keyvi’s capabilities, while 500ns exact match search sounds impressive, I would account that after first retrieval %timeit measures how fast accessing CPU cache is.
Also, how fast is fuzzy pattern matching on large strings or bit arrays, which could go down orders of magnitude.
The OpenAI API hides the embedding vectors which are the actual inner representation of words the transformer uses for input, output and intermediate layers.
e.g. in GPT-2 every dictionary word is a vector of 1600 floats. Think of every word as point in a 1600 dimensions space.
In GPT-3 is almost an order of magnitude larger.
Output is not a perfect match of any dictionary word, but another point which is “somewhere” within the same space. Given the “vastity” of the representation space, actual words spit out by API are chosen by how close their respective positions are to the output point.
What I’m trying to say is gpt/transformer do not make multiple predictions, they make only one and what API presents as multiple choices are several words closest (in the high-dim space) to the actual predicted vector, by K-NN or some radius metrics.
I feel like this should be surmountable if you check the degree to which different columns are predictive. Issue then would be how to break ties, for equally predictive columns at the bottom of the list … Maybe k-means would still help? It’d require experimentation regardless.
Self promotion visit my site : https://myriad.websites , still working on the basics no HTM stuff yet
not 500ns, but 50ns
And Keivy does not support fuzzy matching we need ;( … only Levinstain distance… so only exact matches, like a lexicon (symbol => SDR) or exact Classifier (indexed-SDR => indexed-SDR)
the good news it should handle billion entries w/o degradation in speed.
%timeit ro.Get('brum').GetValue()
591 ns ± 3.12 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
A realistic test should store e.g. 1000000 key/value pairs, shuffle all keys, query the whole key list in order to get a hint of how long it takes to retrieve data which is not in cache.
Regardless, even 5000ns (200k queries/sec) would still be amazing if it would be able to search matching patterns. e.g. SDR overlap.
What you/we need would be more an nearest neighbor search using an overlap metrics, e.g. Jaccard index - Wikipedia
A library to fit this profile could be pynndescent
Yep orders of magnitudes slower, but even thousands of queries/sec instead of millions might be very useful, and faster than what you can accomplish by tweaking upon keyvi for similarity search.
To be more specific, let’s assume your “temporal memory” (could be any series predictor e.g. **Markovian) receives the series:
["It", "is", "time", "to", "ride", "my"]
And predicted outputs are both “car” and “bike” with equal probability, then I expect its output SDR to have roughly half of its bits overlapping the SDR encoding “car” and half the one for “bike”.
Since it won’t have an exact match for anything on the record, all possible exact combinations is huge.
querying TM’s prediction for nearest neighbors however, IF we assume a default orthogonality between (SDRs of) different concepts, then it should return both “car” and “bike” with high likelihood. Yeah, it would spend 0.1-1ms but we didn’t account for scale/parallelism and 1ms is quite faster than a single neuron can spike.