Temporal Memory Performance/Scaling Tricks

I’ve seen the performance and memory size questions popping up a few times.

One main cause is the TM footprint can grow quite a lot by multiplying column cell area x height x segments/cell x synapses per segment. (and x 8bytes per synapse which is discussed elsewhere)

Since every synapse needs to be checked for upstream activations, it means all memory is fed into CPU every timestep. That alone puts a very serious constraint on speed regardless on what computing optimisations are made. An 4GByte space can not be processed (read in to check for updates elsewhere) much faster than ~10 times/second.

While in neurons we can assume a time step at least an order of magnitude faster.


One trivial “solution” for scaling on e.g. 10 computers is to simply instantiate one TM memory on each one, all receiving the same input SDRs and summing their column predictions/activations in order to compute “global” output state.

While it could allow for spotting different pattern correlations on different machines (which is useful) it would also make them all converge into same … most obvious patterns that are easy to recognize. Which is bad because they become redundant in a great proportion which sabotages our scaling idea.


The trick I want to propose is supposed avoid redundancy by forcing each column in any instance of the temporary memory to make connections only with (cells from) a small subset of different other columns

Here-s an example
2000 columns
each column sees only input from a random subset of 400 other columns.

On every other machine the same column will “read” a different subset of presynaptic columns so:

  • synaptic overlaps between any two machines are 5 times less likely (400/2000)
  • overall, across the 10 machines any column has a “full perception field” since 400 x 10 is significantly larger than 2000.

PS since few cores, 4 GB memory machines can be cheap, it might even not be that expensive to scale to dozens of machines and > 100 cores.
Remember, IF memory bandwidth is the bottleneck then 8 x (4core ,dual channel memory) machines might be cheaper AND faster than a single, high end 8 channel memory server.

2 Likes

With memory (in the example of DDR4 PC4-23466) the trasfer caqpacity is 2933 million transers per second and the resulting 64bit bandwidth is 23.5GB/sec “per channel”.

One issue with performance is where the memory is not being read in sequence and then needs to read a separate memory address as this then requires the DIMM to access a different address, which can take 21 clock cycles or around 14nS. The next address in the DIMM is read pre-emptively (by most BIOS defaults) so when the CPU request the data it’s already there.

If your reading random addresses that 14nS then throttles the transfer rate right back to 71.5 million requests per second or at a 64bit addrerss size that is then only 571 MB/sec. Per channel / process thread.

The drawback with using multi dimension arrays is that the result is only one dimension is actually aligned within memory to make the most out of the DIMM pre-emptive read ahead cache. Every other dimension hits the change of address 14nS loockup.

The 14nS lookup time can be worked with by multi-threading which then allows multiple memory channels to be used (2 in a typical laptop and 4 in a typical desktop - AMD server types have 8 channels). The other option is to then scale out by adding computers into the mix and then network latency get’s interesting.

Channels per CPU win out only if multi threaded peoperly. GPU’s win on bandwidth because they read say 512bits at a time but that means all your data has to be in the 512bit chunk (and also spread across many threads). They still have that CAS latency hit on address changes that are not pre-emptive read ahead cached.

With the way that HTM is written at the moment if each full layer was separated out onto a separate computer (to have each layer at say 2GB) then each itteration (layer pass) would require a network hop of say 1MB of data or around 1-2mS per layer if using 1G.

If you are then separating the layers, why then not create branching layer type splits (1 in to say 5 out) or merges from separate smaller layers with different inputs (keeping the layer size to say 1GB each). Just a thought as I have not thought throught the implications for this yet.

Updating would then become a staggered pipeline type operation across all the machines… which could result in performance per input around 2mS or 500Hz. Pesemistically taking a 10x hit we still end up at 50Hz.

2 Likes

Wait what layers are we talking about?

If I look at Matt’s lesson I see there is no assumption of horizontal connecting. It is only lateral and the so-called columns are simple sets of cells. Probably named that way after their neurological … ancestors.

…Or did I got it wrong?

Appologies, I’m looking at the process from the aspect of each full HTM model is made smaller and then link together these smaller HTM nodes. Does one model have to recognise all of the parts of the pattern together in one huuuuge model or signal outputs from smaller models into an aggregator ?

I’m thinking along the lines of adding H by layering full models.

Within Matt’s lesson 9 for example, that is what I would class as a model. Keep it smaller to reduce the permutation explosion and then process only the winners into the next layers / hierarchy. i.e. all the green cells stop at that model layer and do not pass through.

Picture in episode 11 at 3:13 shows that for the type of distal arrangement breaking up that type of high density cross connectivity across machines (huge network latency impact) would not really work well. The point I’m mainly looking at is the deep learning route went for a biologically implausibly deep architecture route to gain skill… Why not go the biological route and keep the models small (as per a column in the brain) and then build those together. A finger sense learns the pattern of a surface, but that output sense of knowing that pattern is then used in the hierarchy to match / LTP / correlate to other patterns. One column does not learn everything, which is the deep learning route of a workaround.

2 Likes

Ok the idea is to keep each local TM local, in the sense they do not make (hence do not broadcast) connections across machines.
Only activation results are broadcasted after each step.
E.G if each instance has 16cells/column, the aggregated columns would have 160 cells.

So in a certain sense yes it is a pool of smaller temporal memories spread across machines, but instead of dividing the input space (e.g. the 2kbit SDR) they divide the lateral connections while still inter-meshing them very tight.

Maybe a shuffling example makes it more comprehensible:
Let’s say each column has limited “radius” within which it can spread synapses.

In the 2000 columns example radius is 200 which means column 600 can connect to columns 400 to 800, etc.
Column 0 connects to columns 1800 to 200 - the columns 0 and 1999 close in a sort of … “ring”.

Now this is indeed limiting but each instance of TM generates its own hidden shuffler. When they receive a new SDR its bits are shuffled with a different random permutation that is unique and local to every instance of temporal memory.

Because the “radius” above is only 200 (=400 column wide lateral “window”), any column on one machine will not have the same neighbors as another machine except for 20% of them.

After each TM instance computes its activations it re-shuffles the result back to the correct order applying the inverse permutation, This way all of them operate on a consistent view of the input space.

I hope this makes more sense.

I think I see where your going, although would that approach still end up bottlenecked before you scale to 10 machines ?

In a 10 machine group the network would have to deal with 10 x 10 lateral message packets, which for say a 1000 synapses (say 64bit each) 80Kb or 80MB/sec cross lateral messaging bandwidth per 1Hz / itteration.

10gbit network would then scale to around 100 lateral connections / second ?

Still gives a scalability boost of around 10x if my thoughts are in line with what your intending ?

Ok, first there are no synapses transfered, only input, global SDR and output local SDRs
let’s look at a single node:

  1. Packages in and out
  • It receives 1 SDR for input
  • sends out 9 SDRs to the other nodes
  • receives 9 SDRs from the other nodes
  1. What is passed as message is the list of columns (out of 2000) becoming active (1-5%)
    On average that’s <200 bytes/packet in this case
    Which in a sense is 20 packets / time step.

  2. And one can have a dispatcher node that:

  • broadcasts the input SDR
  • receives 1 output SDR from each node
  • broadcasts back the result of XOR-ing the 10 SDRs so each node knows which columns need a boost.
    In which case only the dispatcher needs special bandwidth.

Wider SDRs, e.g. 20000, more machines - e.g. 100
that’s the dispatcher handling 201messages x ~2kbyte per timestep.
Per each time step.
My math tells me under 5 ms at 1 Gbit link, I doubt a “normal machine” (which I’m targeting as nodes) would handle so big SDRs in 50 ms.

PS just to be clear the neural network isn’t “stretched” across 10 nodes it is “cut” in 10 smaller networks.

Of course there is no way to be sure that works until it is tested.
What I think it could make it work is each node’s network has a different and specific … connection pattern.

e.g synapses column 8 will listen at cells on column 1200 only on machine 1 and 5
Only these two machines can make “associations” between these two columns, so overall they will make totally different links.

1 Like

Got it.

Broadcast traffic is great for some applications, but every machine on the switch then ends up with the packets to filter (decode and process to determine if they are an intended recipient).

This is the scatter gather pattern, where scale is then throttled with the gather process. Separately you then have to work out how the syetm “fails” gracefully if one of the nodes fails to respond or has a longer delay than expected (due to say the underlying OS deciding to do something as a realtime priority and use the CPU for a few mS more than expected). Recovery / synchronised saves also starts to enter into the process.

Network switches are great bit’s of kit for being a direct 1-1 path for directed traffic and most (even 10+ year old - e.g. 3com 4800G from circa 2009) 1gb 48 port switches can handle around 100 million small packets per second (fewer with jumbo frame size) with only 10uS latency. This makes the switch a neat option for scaling initially to at least 48 “nodes” - vs >100 soft nodes attached to a cluster of threads on the machines at 2+ per processor (memory channel balanced).

Get an old Infiniband switch (20/40GBit) and they are limited to 36 ports, but the latency and throughput can change the perspective quite a bit. This is still OLD kit as current Infiniband is in the 400Gbit ballpark, or 400x the typical Gbit link we have played with in the past and ingnoring shared PCI-x type options of the near future that end up at Tbit level. The network at a 1gbit level looks like it may be an issue but if the code works someone would very happily run it on 400Gbit network with 100’s of nodes… hardware is not the problem, the right code is.

Scatter gather with 100+ nodes becomes an increasingly interesting problem.

China’s latest 37 million processor exa scale multi peta byte memory machine cluster is the type of mult-magnitude hardware step that is possible for the right code. Try scatter gather on 37 million nodes…

Have a look at what Meta platform is doing with CXL memory as a different perspective. Gaming network approaches for virtual environments are quite interesting as well.

You could set up say Proxmox or Hyper-V and run it up on a single machine and implement a virtual cluster of say 2-8 nodes locally to experiment with (sort out the network logic - as it can get very difficult). This is the route I went initially and no cloud overheads (internet speeds, time, etc.).

Cloud hosting is ok, but with a local host type setup you can throttle the network speed and hardware for testing a lot easier and also have consistency as to performance (monitoring is easier). Initially 1-2GB per host can work well. Performance as a whole may be appauling but it’s great for resolving unforseen issues with the code and logic.

Well, they are. Every node needs to know if an active column was predicted by at least one other node in order to decide whether to boost(*) or not.

(*) boost or burst? I don’t know, at this stage I feel HTM terminology is a mess. Boosting is a certain machine learning algorithm, bursting is a certain neuron (or group of neurons) behavior, and Numenta uses them interchangeably as if they were the same.

Returning to the question of Temporal Memory performance (speed and permanence space), the HTM-scheme data structures may be of interest.

The HTM-scheme Temporal Memory algorithm uses 4-byte synapses, and implements:

  • multiple basal and apical segments per cell
  • multiple cortical columns (with parallel compute, thread/cc)
  • “hexagonal” minicolumn/cc topology, enabling use of distance between pre/post synaptic cells
  • multiple connectivity (presynaptic cell of any synapse can be anywhere, any cc or layer)

Memory requirements are 4B/synapse (24-bit pre-synaptic cell, 8-bit permanence),
48B/segment, plus 30% (guesstimate) for “AxonTrees” (connectivity Hashtables).
So for a TM layer modelling L4 of a cortical column, with 3K cells, 20 segments/cell, 1K synapses/cell, memory use is ~20MB.

1000 cortical columns (TBT :slight_smile: ) => 20GB

4 Likes

To have a huge TM is biologically plausible? The probability distribution of axon lengths is logarithmic, not constant. Many loosely connected TM with 10s of mini-columns seems more plausible.

That said, I think you can “annotate” in each origin cell the remote connected dendrites. Then, you don’t need to do a massive snoop to compute the overlaps. It’s much faster (at the expenses of extra memory to keep the allegedly small number of remote connected dendrites). When a cell is active, you increase a counter in the remote dendrites. If the counter is above the threshold, then the owner cell of the dendrite is set to predicted.

Maybe not but biology might have a very different scale for what huge means. HTM needs 1 Gbyte to encode a puny model of 4000columns x 32 cells x 32 dendrites x 32 synapses.
And that didn’t touched the performance issue.


But what you suggest is nevertheless interesting, for this exact reason.

2048 minicolumns/TM (Numenta default) seems huge to me. For biological cortex, ~100 minicolumns/cortical column is often suggested (Numenta used 150 in some experiments reported in the “Columns” paper). My guess is that (a) 2048 works for ML, where several input sources are encoded into a combined SDR (b) 2048 compensates for the absence of inter-cc connections in TM algorithms. A TM instance is proposed as a model of L4 of one cortical column: in the Numenta experiments inter-cc connections were only in the “output” (L2/3) layer, modelled by the Column Pooler algorithm.

Yes, this is essentially the “AxonTree” data structure in HTM-scheme: for each TM instance it maps all “Sources” (origin cells, intra- and inter-cc) to the “Segments” (dendrites) containing the synapses. And indeed the overlap counts are accumulated in the Segments, not in a separate array.

2 Likes

I’m not sure about the mini-column term in HTM matches the biological counterpart.

My understanding is that in HTM cell terms are more or less a “branch”. A HTM mini-column is a pyramidal neuron. A biological mini-column is just an HTM TM. The cortical columns 100-150 of these working together (using the same input). Then you have 100s of small densely connected “things” loosely interconnected. This fits better with the axon length probability distribution.

I know this view is completely off from numenta’s … just I like it more :slight_smile:

2 Likes

Well what I recall is 100-150 neurons per minicolumn and ~2000 minicolumns to build a column.

Beware that 2000 value is a bit arbitrary, probably was considered as an average of how large a surrounding area of a minicolumn is covered by its distal dendrite segments

PS if my math is right 2000 minicolumns would cover a tiny ~2mm diameter patch on the cortex surface. Or within that order of 1-10 square mm

Some cortical columns are only like .15 mm across (so like 1 to 25 minicolumns I guess). I think just do whatever works. Biology probably needs fewer minicolumns anyway, because it can use other forms of coding plus specializations unique to the sense.