Large Temporal Memory

Has anyone tested large TM … The question is what amount of data you can throw at TM before you need to create more of them.

What inspired me to ask this is GPT3… I’m reading about it and it seem to me it is giant TM.
what it does is :

   - predict next tokens (with probabilities) 
      - then pick one of them (based on cfg)
   - use the predicted token OR user provided token to predict next ........
      - ............repeat

Havent experimented with “multiple-choice”, user assisted VOMC i.e. TM with multiple SDR predictions !! so i’m not sure if the current arch. can handle it ?

2 Likes

I suspect where this would break down in HTM is that we’d require an accurate mapping of which predicted columns would activate which bits in the input space, which isn’t necessarily the case.

So time step 1, input space would encode some token, TM would look at that and choose winning columns, SP would look at winning columns and predict which columns are going to be active in the next time step… the result is that we’d have a prediction for winning columns. But then we’d need to get from these predicted winners and translate that into the next most-likely input space encoding, which would have quite a bit of noise potential depending on how you choose the winning representation bits for a given column, and assuming you’d then be able to take those winning bits and transform that back out into something human understandable for introspection.

Or maybe the alternative is that we have a trained and frozen SP/TM system, then for every token in our corpus, we’d check its resulting SDR from the SP, and create a giant dictionary and map that SDR to the given input. Then at runtime, we simply lookup a given token’s SP SDR, check the resulting TM calculation’s predicted SP SDR, and feed THAT (next timestep’s predicted token representation SDR) into the generator. That’d take memory, but by acting as an encoding cache, we’d save on column competition calculations.

Potentially have multiple independently trained SP/TMs working like this together so that you have some probability distributions based on differing predictions, and you’d basically have an HTM Forest.

Feel free to poke holes, as I just threw these thoughts out here :grin:.

1 Like

Even with single SDR prediction I think TM will output a SDR matching multiple possible “futures” in case it thinks they have similar occurring chances.

EDIT:
I don’t know much about transformers either but my guess similar process goes with transformers - they do not produce several “next word” embeddings but one “fuzzy” output vector which in vector space can be “close” aka “similar” with several “pure” word embeddings, and there are API options to either print out the closest matching dictionary word or few words in its neighborhood each with its own “probability” which is just a measure of how well the respective word vector matches the model’s output.

What is unfortunate about transformers is the unreasonable huge (from biology perspective) amount of training data they need in order to get a convincing language model.

1 Like

giant lookup table is doable with Keyvi prj

https://vsraptor.github.io/book/docs/misc/keyvi-index.html

if i remember correctly i tested ~1mln kv-pairs and access time was 50ns !!1

1 Like

didnt knew that about Ts…

In TM i think on prediction u can select multiple cells from the same column, but how do u form multiple SDRs !! May be all permutations ! still which is most probable ?

may be a Temporal pooler (play the role of CFG, with diff algos!) can have feedback to the TM and play somehow the role of selector. hmmm…

You share interesting ideas there on vraptor.github.io

But I’m not yet convinced by Keyvi’s capabilities, while 500ns exact match search sounds impressive, I would account that after first retrieval %timeit measures how fast accessing CPU cache is.

Also, how fast is fuzzy pattern matching on large strings or bit arrays, which could go down orders of magnitude.

And most importantly, how feasible is complex pattern similarities in large vectors, for which I think the technologies are a bit different, e.g. GitHub - erikbern/ann-benchmarks: Benchmarks of approximate nearest neighbor libraries in Python

The OpenAI API hides the embedding vectors which are the actual inner representation of words the transformer uses for input, output and intermediate layers.
e.g. in GPT-2 every dictionary word is a vector of 1600 floats. Think of every word as point in a 1600 dimensions space.
In GPT-3 is almost an order of magnitude larger.

Output is not a perfect match of any dictionary word, but another point which is “somewhere” within the same space. Given the “vastity” of the representation space, actual words spit out by API are chosen by how close their respective positions are to the output point.

1 Like

What I’m trying to say is gpt/transformer do not make multiple predictions, they make only one and what API presents as multiple choices are several words closest (in the high-dim space) to the actual predicted vector, by K-NN or some radius metrics.

1 Like

An elegant explanation of Transformers and Attention, including a basic python implementation:

I feel like this should be surmountable if you check the degree to which different columns are predictive. Issue then would be how to break ties, for equally predictive columns at the bottom of the list :frowning:… Maybe k-means would still help? It’d require experimentation regardless.

url is : What is iHTM ? — My book

Self promotion visit my site : https://myriad.websites , still working on the basics no HTM stuff yet :wink:

not 500ns, but 50ns

And Keivy does not support fuzzy matching we need ;( … only Levinstain distance… so only exact matches, like a lexicon (symbol => SDR) or exact Classifier (indexed-SDR => indexed-SDR)

the good news it should handle billion entries w/o degradation in speed.

1 Like

hmm… but as @MaxLee mentioned how do u map symbol-small-vec <==> sym-large-vec

how do u think we can emulate that with SDR

I think I got it … let say we have TM 10x1000

so input is 1000/2%

On prediction multiple predicted cells per column is allowed.

The cells can be selected by criteria

 - WTA  >> 2% of 10_000,  say 5%

with options
- by permanence thresh
- by feedback of TPooler if available

Using the predicted 5% TM cells generate the possible 1000 bit SDRs (where only 1bit in column is allowed in the same SDR).

Having all possible permutations of 1000 bits SDR, rank them by the formula :

  Score = Sum( bit * cell.synapse.permanence)     

Now that they are ranked use your selection algorithm to pick the PREDICTION.

  • OUTPUT is 1000/2% so no lookup table is needed

?? Did I miss something ??

What does “WTA” mean here?

Still trying to read your thoughts a few dozen more times to make sure I understand it before I given a more lengthy, probably wrong reply :smiley: .

Quoting your notebook:

%timeit ro.Get('brum').GetValue()

591 ns ± 3.12 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

A realistic test should store e.g. 1000000 key/value pairs, shuffle all keys, query the whole key list in order to get a hint of how long it takes to retrieve data which is not in cache.

Regardless, even 5000ns (200k queries/sec) would still be amazing if it would be able to search matching patterns. e.g. SDR overlap.

What you/we need would be more an nearest neighbor search using an overlap metrics, e.g. Jaccard index - Wikipedia

A library to fit this profile could be pynndescent

Yep orders of magnitudes slower, but even thousands of queries/sec instead of millions might be very useful, and faster than what you can accomplish by tweaking upon keyvi for similarity search.

2 Likes

To be more specific, let’s assume your “temporal memory” (could be any series predictor e.g. **Markovian) receives the series:

["It", "is", "time", "to", "ride", "my"] 

And predicted outputs are both “car” and “bike” with equal probability, then I expect its output SDR to have roughly half of its bits overlapping the SDR encoding “car” and half the one for “bike”.
Since it won’t have an exact match for anything on the record, all possible exact combinations is huge.

querying TM’s prediction for nearest neighbors however, IF we assume a default orthogonality between (SDRs of) different concepts, then it should return both “car” and “bike” with high likelihood. Yeah, it would spend 0.1-1ms but we didn’t account for scale/parallelism and 1ms is quite faster than a single neuron can spike.

1 Like

WTA == Winner takes all

i.e. let say the ACTIVE cells ON are (93,15,456,…) of the range 1 … 10_000
match this iSDR against every cell ;

sum[c] = Sum( cell[c].syn[i].perm for i in [93,15,456,…] )
predicted_winners = WTA(sum, 5%)

something like this

ouch ;0 … remembed wrongly

I asked the author match by overlap … not available, but may be possible.

I was excited about Keyvi cause on large sets it seem to be faster than numpy,bitarray or redis.

It seem at some datasets FST are better than Hashes.