Triadic Memory — A Fundamental Algorithm for Cognitive Computing

@POv - i think there-s a very simple way to overcome memory size limitations.
“Canonicallly”, for e.g. DiadicMemory, one needs to allocate a 2D byte array of N*(N-1) rows and N columns.

During mem.store(X, Y) - the bits of X are expanded into P*(P-1) row addresses and in each row Y is added as 00010011 … bits to the respective row.

One problem with this is for large-ish SDRs e.g. for 10kbit SDR one would have to allocate 0.5 trillion bytes of memory.

There is a simple trick to allocate an array with as many rows as you like: simply divide each row address modulo number of available rows.

Yes errors will happen sooner but:

  1. user has the option of balancing/negotiating between memory space and store() capacity.
  2. with a flat, 1D array (as in the C code) there is virtually no restriction in SDR length. They can be e.g. P=10, N=100000 long, and semantics becomes very rich, because each bit can represent a certain unique property.

@DanML

I had numba core dumping on me without other messages while rewriting an older similar module

I tracked it to the fact I had inadvertently accessed numpy array out of its range (addressing mistake).

1 Like

That’s a brilliant idea, Cezar.

Well, it might be. :stuck_out_tongue:

One can imagine a “global” space of 1M bit SDRs and

  • can create arbitrary amount of smaller (1-100kbit) sub-domains allocated randomly to represent each … whatever. Speech, Chinese, math, feeding, home town, Ukraine, chess, driving and so on.
  • not only they will seldom collide (since occasional one-pair collision between two different 1% narrow spaces is quite unlikely and low relevant) but it allows cross-domain correlation SDRs (contexts/concepts/ideas) whenever and wherever they are needed or appropriate.

Maybe this is how the brain works…

well… some sort of equivalent mechanics for “wide spreading” and retrieval of “knowledge” across a huge space makes sense.

I would wager that for every bit out of the very large N there-s also a small processing unit (could be a cortical minicolumn?) which makes predictions by looking at the rest of active bits within its context SDR.

I got a bit carried away - I thought it could discard the need for a fixed N. It is needed to sum() all retrieved “rows” - N has to be the row length.

1 Like

I tried this - Diadic is fine since it uses 2D memory array.
Triadic however - with flattened 1D storage (as in C) gets excruciatingly slow.
Even the cubic 3D TriadicMemory gets very slow when searching for X instead of Z


PS currently pushed sdrsdm.py can’t retrieve X and Y at all, due to misspelling a function call.

1 Like

While thinking of scaling TM I figured out an even simpler trick can be applied to the small family of Di-/TriadicMemory and my SDR Indexer.


Here-s how it works:

  • assume a 10kbit TriadicMemory. That’s 1000 x bigger and slower than a puny 1kbit one, a whooping 1TByte, and at 1% sparsity each search/update accesses 1 Million locations (bytes actually, that’s 1M/8 memory transfers)
  • we get ambitious & want to spread it on 10 machines.
  • For conceptual simplicity, assume there is a Dispatcher node which is the API front-end receiving storage & retreive requests for full 10kbit SDRs
  • to recall a TriadicMemory storing (X,Y) → Z is simply cube matrix with dimension NNN where N is the encoding size of the SDRs it operates with.
  • Now, every node out of 10, instead of storing the whole cube it stores only a slice of it. Let’s say it slices it by Z.
  • Each node, instead of storing 10kbits of Z, stores only 1kbit. Node 1 stores first kbit, node 10 the tenth. As easy as that.
  • When querying the “cube” for patterns X, Y: each node restores its own 1kbit of response, sends it back to the Dispatcher node which simply concatenates the 10 responses to restore original Z
  • the slicing can be done on any axis, not only Z.
  • the difference when slicing by e.g. Y, is every node instead of resolving a 1kbit of the response, it computes a 10kbyte partial response
  • which means the Dispatcher node has to receive and add up all 10 x 10kbyte partial responses in order to restore the original Z. Which is sucks up ~100x more bandwidth than receiving 10x1kbit responses to concatenate.
  • That doesn’t mean slicing by Z is the best thing to do - Because each node has to do its own 10k x 10k address decoding and at 1%s parsity means it has to retrieve from memory 10k x 1kbyte (=1Mbyte) of data / query which still might be prohibitive. while slicing by X, Y or both diminishes computing/memory bandwidth demand per node proportionally with the number of slices in XY plane.
  • so probably the best way to optimize the whole behemoth is to slice in two stages - first stage by Z and every slice cut in 4x4 “cubes” in XY plane, each with its own node.
    Every slice in Z has its own sub-dispatcher that’s a front-end to its 16 node cubes.

It sounds massive (and it is) but it is conceptually/architecturally very simple and what is most beautiful about it is there-s full flexibility in scaling both by node capacity (compute, memory size & bandwidth) and network bandwidth.

Imagine a 512 bit AVX cpu adding 64 bytes at a time per clock.

This seems like a good approach to scale up Triadic Memory.

Incidentallly, I just pushed a memory-optimized C implementation of the Dyadic Memory
algorithm (effectively a Kanerva SDM) which works for dimensions up to 20,000.

2 Likes

Interesting, it looks like it is dynamically allocated?

Right, here the hard storage locations are dynamically allocated sparse data structures.

If you pull out of main() two C functions - one for writing and one for querying probably more people would get interested to play with it as a library instead of command line which for large tests gets tedious and cripples its performance

2 Likes

Scheme translations of triadicmemory.c and triadic_test.py: triadicmemory.ss, triadic-test.ss

$ triadic-test.ss x       
Testing TriadicMemory with 100000 SDRs of size 1000 and solidity of 10 ON bits
100000 vectors generated in 145ms
100000 triples stored in 1237ms
100000 queries in 8826ms

Query speed is similar for {_,y,z} and {x,y,_} - about 12K/sec; {x,_,z} is about 4K/sec

4 Likes

That’s cool! I can’t talk scheme but the short explanatory text of the algorithm is quite well written too.

Great to have a Scheme version of the algorithm! The timings look really impressive.

I made some modifications so that the same code can be used as a library or command line program.

2 Likes

a quick implementation in Julia … results not that great but I’m still working on it. GitHub - alior101/TriadicMemory: Triadic Memory Algorithm (pull request sent to POv)
If there are Julia master here - please comment… I’m seeing an unexpected high number of memory allocations on querying.

julia> include("triadicmemory.jl")
inserting 10000 vectors
preparing 10000 vectors took 0.467523515 seconds: 21389.3 per sec
inserting 10000 vectors took 0.130208133 seconds: 76800.12 per sec
  5.879034 seconds (3.47 M allocations: 2.144 GiB, 3.52% gc time, 4.33% compilation time)
querying z 10000 vectors took 5.879033064 seconds: 1700.96 per sec, 0 errors
  5.884058 seconds (2.86 M allocations: 2.111 GiB, 1.55% gc time, 2.31% compilation time)
querying y 10000 vectors took 5.884057576 seconds: 1699.51 per sec, 0 errors
  0.657309 seconds (2.86 M allocations: 2.111 GiB, 11.48% gc time, 19.37% compilation time)
querying x 10000 vectors took 0.657307676 seconds: 15213.58 per sec, 0 errors
2 Likes

Happy to see a Julia implementation … thanks Lior!

Can you please give more details on the " In the default nybble mode both xyz counts and zyx counts are stored in each byte." trick in your scheme implementation ? I’m curious to understand how your code achieves better speed than C …

When storing {x,y,z} the standard nested loops (do x… (do y… (do z…))) are followed by (do z… (do y… (do x…))) incrementing the top 4 bits (“nybble”) of the byte: store code (lines 177-199)

Then to recall x {_,y,z} corresponding (do z… (do y… (do x…))) loops are used (lines 257-265). The indexing calculations use Scheme R6RS Fixnum operations, which Chez Scheme (with appropriate compiler options lines 81-91) produces good code for.