Large Temporal Memory

mraptor · January 16, 2022, 9:39pm

Has anyone tested large TM … The question is what amount of data you can throw at TM before you need to create more of them.

What inspired me to ask this is GPT3… I’m reading about it and it seem to me it is giant TM.
what it does is :

   - predict next tokens (with probabilities) 
      - then pick one of them (based on cfg)
   - use the predicted token OR user provided token to predict next ........
      - ............repeat

Havent experimented with “multiple-choice”, user assisted VOMC i.e. TM with multiple SDR predictions !! so i’m not sure if the current arch. can handle it ?

MaxLee · January 17, 2022, 1:03am

mraptor:

- predict next tokens (with probabilities) 
      - then pick one of them (based on cfg)
   - use the predicted token OR user provided token to predict next ........
      - ............repeat

I suspect where this would break down in HTM is that we’d require an accurate mapping of which predicted columns would activate which bits in the input space, which isn’t necessarily the case.

So time step 1, input space would encode some token, TM would look at that and choose winning columns, SP would look at winning columns and predict which columns are going to be active in the next time step… the result is that we’d have a prediction for winning columns. But then we’d need to get from these predicted winners and translate that into the next most-likely input space encoding, which would have quite a bit of noise potential depending on how you choose the winning representation bits for a given column, and assuming you’d then be able to take those winning bits and transform that back out into something human understandable for introspection.

Or maybe the alternative is that we have a trained and frozen SP/TM system, then for every token in our corpus, we’d check its resulting SDR from the SP, and create a giant dictionary and map that SDR to the given input. Then at runtime, we simply lookup a given token’s SP SDR, check the resulting TM calculation’s predicted SP SDR, and feed THAT (next timestep’s predicted token representation SDR) into the generator. That’d take memory, but by acting as an encoding cache, we’d save on column competition calculations.

Potentially have multiple independently trained SP/TMs working like this together so that you have some probability distributions based on differing predictions, and you’d basically have an HTM Forest.

Feel free to poke holes, as I just threw these thoughts out here .

cezar_t · January 17, 2022, 3:05am

Even with single SDR prediction I think TM will output a SDR matching multiple possible “futures” in case it thinks they have similar occurring chances.

EDIT:
I don’t know much about transformers either but my guess similar process goes with transformers - they do not produce several “next word” embeddings but one “fuzzy” output vector which in vector space can be “close” aka “similar” with several “pure” word embeddings, and there are API options to either print out the closest matching dictionary word or few words in its neighborhood each with its own “probability” which is just a measure of how well the respective word vector matches the model’s output.

What is unfortunate about transformers is the unreasonable huge (from biology perspective) amount of training data they need in order to get a convincing language model.

mraptor · January 17, 2022, 5:19am

giant lookup table is doable with Keyvi prj

https://vsraptor.github.io/book/docs/misc/keyvi-index.html

if i remember correctly i tested ~1mln kv-pairs and access time was 50ns !!1

mraptor · January 17, 2022, 5:28am

didnt knew that about Ts…

In TM i think on prediction u can select multiple cells from the same column, but how do u form multiple SDRs !! May be all permutations ! still which is most probable ?

mraptor · January 17, 2022, 5:36am

may be a Temporal pooler (play the role of CFG, with diff algos!) can have feedback to the TM and play somehow the role of selector. hmmm…

cezar_t · January 17, 2022, 6:51am

You share interesting ideas there on vraptor.github.io

But I’m not yet convinced by Keyvi’s capabilities, while 500ns exact match search sounds impressive, I would account that after first retrieval %timeit measures how fast accessing CPU cache is.

Also, how fast is fuzzy pattern matching on large strings or bit arrays, which could go down orders of magnitude.

And most importantly, how feasible is complex pattern similarities in large vectors, for which I think the technologies are a bit different, e.g. GitHub - erikbern/ann-benchmarks: Benchmarks of approximate nearest neighbor libraries in Python

cezar_t · January 17, 2022, 7:08am

The OpenAI API hides the embedding vectors which are the actual inner representation of words the transformer uses for input, output and intermediate layers.
e.g. in GPT-2 every dictionary word is a vector of 1600 floats. Think of every word as point in a 1600 dimensions space.
In GPT-3 is almost an order of magnitude larger.

Output is not a perfect match of any dictionary word, but another point which is “somewhere” within the same space. Given the “vastity” of the representation space, actual words spit out by API are chosen by how close their respective positions are to the output point.

cezar_t · January 17, 2022, 7:40am

What I’m trying to say is gpt/transformer do not make multiple predictions, they make only one and what API presents as multiple choices are several words closest (in the high-dim space) to the actual predicted vector, by K-NN or some radius metrics.

MaxLee · January 17, 2022, 5:19pm

An elegant explanation of Transformers and Attention, including a basic python implementation:

MaxLee · January 17, 2022, 5:26pm

I feel like this should be surmountable if you check the degree to which different columns are predictive. Issue then would be how to break ties, for equally predictive columns at the bottom of the list … Maybe k-means would still help? It’d require experimentation regardless.

mraptor · January 17, 2022, 7:36pm

url is : What is iHTM ? — My book

Self promotion visit my site : https://myriad.websites , still working on the basics no HTM stuff yet

not 500ns, but 50ns

And Keivy does not support fuzzy matching we need ;( … only Levinstain distance… so only exact matches, like a lexicon (symbol => SDR) or exact Classifier (indexed-SDR => indexed-SDR)

the good news it should handle billion entries w/o degradation in speed.

mraptor · January 17, 2022, 7:50pm

hmm… but as @MaxLee mentioned how do u map symbol-small-vec <==> sym-large-vec

mraptor · January 17, 2022, 7:52pm

how do u think we can emulate that with SDR

mraptor · January 18, 2022, 12:26am

I think I got it … let say we have TM 10x1000

so input is 1000/2%

On prediction multiple predicted cells per column is allowed.

The cells can be selected by criteria

 - WTA  >> 2% of 10_000,  say 5%

with options
- by permanence thresh
- by feedback of TPooler if available

Using the predicted 5% TM cells generate the possible 1000 bit SDRs (where only 1bit in column is allowed in the same SDR).

Having all possible permutations of 1000 bits SDR, rank them by the formula :

  Score = Sum( bit * cell.synapse.permanence)

Now that they are ranked use your selection algorithm to pick the PREDICTION.

OUTPUT is 1000/2% so no lookup table is needed

?? Did I miss something ??

MaxLee · January 18, 2022, 5:32am

What does “WTA” mean here?

Still trying to read your thoughts a few dozen more times to make sure I understand it before I given a more lengthy, probably wrong reply .

cezar_t · January 18, 2022, 9:24am

Quoting your notebook:

%timeit ro.Get('brum').GetValue()

591 ns ± 3.12 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

A realistic test should store e.g. 1000000 key/value pairs, shuffle all keys, query the whole key list in order to get a hint of how long it takes to retrieve data which is not in cache.

Regardless, even 5000ns (200k queries/sec) would still be amazing if it would be able to search matching patterns. e.g. SDR overlap.

What you/we need would be more an nearest neighbor search using an overlap metrics, e.g. Jaccard index - Wikipedia

A library to fit this profile could be pynndescent

Yep orders of magnitudes slower, but even thousands of queries/sec instead of millions might be very useful, and faster than what you can accomplish by tweaking upon keyvi for similarity search.

cezar_t · January 18, 2022, 9:49am

To be more specific, let’s assume your “temporal memory” (could be any series predictor e.g. **Markovian) receives the series:

["It", "is", "time", "to", "ride", "my"]

And predicted outputs are both “car” and “bike” with equal probability, then I expect its output SDR to have roughly half of its bits overlapping the SDR encoding “car” and half the one for “bike”.
Since it won’t have an exact match for anything on the record, all possible exact combinations is huge.

querying TM’s prediction for nearest neighbors however, IF we assume a default orthogonality between (SDRs of) different concepts, then it should return both “car” and “bike” with high likelihood. Yeah, it would spend 0.1-1ms but we didn’t account for scale/parallelism and 1ms is quite faster than a single neuron can spike.

mraptor · January 18, 2022, 9:17pm

WTA == Winner takes all

i.e. let say the ACTIVE cells ON are (93,15,456,…) of the range 1 … 10_000
match this iSDR against every cell ;

sum[c] = Sum( cell[c].syn[i].perm for i in [93,15,456,…] )
predicted_winners = WTA(sum, 5%)

something like this

mraptor · January 18, 2022, 9:36pm

ouch ;0 … remembed wrongly

I asked the author match by overlap … not available, but may be possible.

I was excited about Keyvi cause on large sets it seem to be faster than numpy,bitarray or redis.

It seem at some datasets FST are better than Hashes.

Topic		Replies	Views
The Principle of Temporal Memory Numenta Theory	21	788	April 25, 2021
Project to compare mraptor's bbHTM to biology Engineering	21	2098	June 22, 2016
Another HTM test implementation Engineering	18	1761	April 24, 2017
TemporalMemory for prediction Engineering question	35	1771	September 24, 2019
Raw TM Test (no SP) NuPIC encoders , temporal-memory , category-encoding	30	1378	June 10, 2018

Large Temporal Memory

Related topics