Memory layout strategies for a Cell

Paul_Lamb · June 25, 2016, 12:37am

The biggest problem that I’ve had so far with the HTM theory, from an implementation perspective, is the large amounts of RAM that are required (or alternately rely heavily on disk swapping, which drastically impacts performance).

A couple of initial optimizations ideas that came to mind (which I’m sure everyone starting out has had):

SDRs can be greatly compressed by representing them as dense arrays of indexes to the 1’s
If I can define a fixed memory size for each cell, I can eliminate need to index columns. A certain memory step could be utilized to sequentially access each cell in a column (such as when checking for cells in the predictive state or to trigger bursting)

With these initial ideas in mind, I’ve been putting some thought into efficient, fixed memory layout strategies for a single Cell in the system. The memory layout will need to include the cell’s state, some historic and predictive information, its dendrites, synapses, and permanence values. I thought I would run these by everyone here to get some feedback on things I might be missing, or other optimizations I haven’t considered.

I am using the values from the HTM Cheat Sheet to help in determining the number of bits necessary for representing each component of a cell:

Num Columns (N): 2048
Num Cells per Column (M): 32
Sparsity : 2%
Dendritic Segment Activation Threshold (θ): 15
Initial Synaptic Permanence: 0.21 Connection Threshold for Synaptic Permanence: 0.5
Synaptic Permanence Increment and Decrement: +/- 0.1
Synaptic Permanence Decrement for Predicted Inactive Segments: 0.01
Maximum Number of Segments per Cell: 128
Maximum Number of Synapses per Segment: 128
Maximum Number of New Synapses Added at each Step: 32

From this list I can make a few conclusions:

Total number of cells in the system will be 65,536
If each cell can connect with up to 2% of the other cells, that makes 1,310 potential connections (synapses) per cell.
Indexes up to the number 65,536 require at least 16 bits per index.
Permanence can change at a granularity of 1%, so there are 100 different possible permanence values
Encoding a number up to 100 requires at least 7 bits
To group synapses by segment, each synapse will need room to store a value up to 128
Encoding a number up to 128 require at least 7 bits

With this in mind, there are a variety of memory layout strategies that could be used. After a few iterations, I think I am at a pretty good layout (listed in order from first bit onward)

16 bits for active history (will explain below)
1 bit for active state
1 bit for predictive state
14 bits for predictions (will explain below)
20,960 bits (16 bits X 1,310) for synapse indexes
9,170 bits (7 bits X 1,310) for segment grouping of synapses
9,170 bits (7 bits X 1,310) for permanence values

Total: 39,332 bits per cell

This means the total layer (65,536 cells) would come to 2,577,661,952 bits (307 MB)

As you can see, the main candidates for optimization are the bits necessary to encode elements of the synapses. One obvious abstraction (would deviate from the biology, but not sure what impact it would have) would be to eliminate the segment grouping of synapses, and instead treat each synapse as a direct, independent connection between cells. This would reduce memory in two ways: first, it would eliminate the 9,170 bits per cell needed to store the grouping information, but also it would reduce the granularity of permanence changes to 10% (so 9,170 bits for permanence values could be reduced to 5,240 bits) This optimization would reduce the total layer memory usage to 205 MB.

Another possible optimization would be to reduce the number of bits for the synapse indexes. For example, you could instead use 11 bits for indexes, and use the middle value (1024) as the cell’s own index. This would mean that a cell could only connect to 2048 potential other cells (that are within range 1024 of its own position). Used in conjunction with the previous optimization, this would reduce the total layer memory usage to 154 MB.

Obviously the more abstractions applied that deviate from the biology, the less likely the system will behave as expected, so would require some testing to determine the cost.

To explain my thoughts on the first 32 bits of the memory layout, I was thinking from the perspective of a 32 bit x86 register. For reference:

Placing the bit for active state in the most significant bit of AH, and the bit for predictive state in the next most significant bit, this arrangement would allow for some fast operations for storing history of the cell’s active and predictive states.

For example, with the above arrangement, a couple of shift operations will push previous history back one slot, move the current state into the first history slot, and clear the bits for the next cycle:

shl eax, 1
shr ax, 2

Of course, a history of predictions is probably not very useful, so instead shifting ax right one slot then walking through the cell’s connections, AND’ing the active bit from T-1, T-2, T-3, etc would be much more useful (for predicting the cell’s state 1, 2, 3, etc steps into the future).

Anyway, I’m curious to get some ideas from other folks on memory layout strategies.

Casey · June 25, 2016, 7:44pm

Some feedback:

Cells probably need to be able to connect with well over 2% of all other cells. It’s not an issue for memory if you only include synapses from sources which have associated to. For example, you could use a dendritic segment class. When a column bursts, add a segment to a random cell* with, say, 40 synapses, where the source cell for each synapse was active last cycle and thus predicts the current column.
It needs to use dendritic segments. Each segment roughly corresponds to one transition (I say roughly because segments have a lot of noise tolerance and a single segment can deal with multiple similar transitions). Segments need to learn for multiple reasons. If you don’t differentiate synapses by segments, the cell will teach all synapses for each transition, and that will interfere with the synapses for other transition.
I’m pretty sure HTM only makes predictions one step into the future (in the temporal memory, not the classifier). Predicting the next input based on states from multiple steps in the past could cause problems. For example, in the sequence ABCD, it would try to predict B based on whatever came before A, which would be random and cause instability in the connectivity, perhaps leading to instability in the connectivity for predicting C and D as well.

*The cell doesn’t need to be randomly selected, but e.g. selecting the cell which turns on least frequently would help just a little.

Here are some optimizations I use.

Start without any segments and add them over time. This allows it to use as many segments as needed and no more, but it requires keeping track of which segment belongs to which cell.
Use a number of cells per column based on the number of transitions and length of sequences. If it works well enough with 10 cells per column, just use 10. It doesn’t need to be perfect if you are trying to learn rather than use HTM to do something.
Use less columns than 2048 if possible. For a program which predicts the next letter in a random word out of 25 different words, 972 columns with 16 cells each worked fine.
Wait until the next input arrives before predicting the input. Only predict cells in active columns. That way, you don’t need to save predictions until the next input arrives and you don’t need to compute as many predictions.

ddigiorg · June 25, 2016, 8:19pm

@Paul_Lamb Doesn’t there also need to be 1 bit for the learning state of a cell?

@Casey > -Wait until the next input arrives before predicting the input. Only predict cells in active columns. That way, you don’t need to save predictions until the next input arrives and you don’t need to compute as many predictions.

Hmm, I hadn’t considered this optimization before but that makes a lot of sense. Thanks!

Paul_Lamb · June 25, 2016, 11:04pm

@Casey, thanks for the feedback. I’m not sure I follow your explanation why each cell connecting to well over 2% of the other cells is not an issue for memory. If each connection requires storing a value for strength, the more connections, the higher the memory usage per cell.

So in the HTM cheat sheet, it indicates “sparsity 2%”. Sounds like I have misinterpreted that element (sorry if that was obvious – I am still new to the theory). I probably should do more studying and less cheating If this does not refer to sparsity of connections to other cells, is 2% then just a reference to sparsity of the sensor input, or something else?

Since the 2% sparsity doesn’t refer to sparsity of connections with other cells, I suppose the max connections then is 3,584 (128 max number of segments per cell times 128 max number of synapses per segment) That does have quite a significant impact on the total potential memory usage (close to triples the figures I posted above). I may need to re-think the concept of fixed memory layout for cells as a strategy of reducing memory usage for column indexing (cheaper to do as you say and not consume the memory for the segments until they are formed).

@ddigiorg, I do not think any space is needed for the learning state, as long as the sequence of steps in the process are followed in a particular order (for example, not changing the active states until after the process of strengthening/ weakening the permanence values has been completed). Not that one bit per cell makes a big difference the grand scheme

Casey · June 26, 2016, 5:57pm

Sorry, that reply was a bit of a ramble.

It’s okay for each cell to be able to connect to any other cell. This doesn’t mean it has synapses for every other cell, so long as you allow for the addition of new synapses (or, alternatively, new segments which are each initialized with fixed sets of synapses.)

2% refers to the sparsity of the spatial pooler. So 2% of the columns are on.

Paul_Lamb · June 27, 2016, 8:23pm

I have been re-thinking the problem from the perspective of fixed memory layout for certain elements, but other elements grow dynamically as new segments are created. My thought is to implement a kind of “linked list” structure for the segments, where each cell holds a pointer to its first segment, and that segment holds a pointer to the next segment and so-on. More segments would be added by reserving memory and then adding a pointer to it from the last segment in the list. (Alternately, this could be a reverse-linked list where newer segments point back to older segments, and ultimately back to the cell – memory consumption would be the same in both cases)

Running some of the numbers for how this strategy could be implemented:

Cell

16 bits for active history
1 bit for active state
1 bit for predictive state
14 bits for predictions
32 bits for pointer to first segment

Segment

2048 bits for indexes of other cells – 128 synapses, 16 bits each (65,536 possible values)
896 bits for permanence values – 128 synapses, 7 bits each (100 possible values)
32 bits for pointer to next segment

System will start with a memory footprint of 512 KB (assuming every cell started with 0 segments), and initially grow at a rate of 14.5 KB per cycle (new segments added to a cell from each bursting column). System could grow to a maximum potential memory footprint of 2.9 GB (the total cost of 128 segments for every cell). That 14.5 KB per cycle would be lower when the system encounters sequences it has learned (0 when no new segments need to be created).

I’ll need to run the system through some realistic scenarios to get a feel for the actual memory cost of this strategy. One possible optimization, if the system grows too large over time, might be rules for “de-growth”, where if all synapses of a segment reduce to zero for a certain number of cycles, the memory used by the segment could be recycled.

Casey · June 28, 2016, 6:31pm

Here are some rules for removing segments I think I’ve seen. Not sure how well they work because I haven’t tried them.
-Reduce all permanences a tiny amount each cycle, and remove segments based on permenances.
-Punish segments which predicted the cell if the cell’s column doesn’t turn on.
I think both of these would have issues.

If you want to reduce the number of segments, you can also use this algorithm:

Find the excitation value for each segment in the column
If the highest value is above threshold, that segment learns (i.e. active synapses get +permanence, inactive synapses -perm) and all cells corresponding to segments above threshold turn on. (Multiple cells can turn on in case there are multiple sequences which share a column occurring at the same time.)
If the highest value isn’t above the threshold:
a. Find the overlap value for each segment in a column (i.e. excitation values as if all synapses had high enough permanence)
b. If the highest overlap is above a slightly smaller threshold than the last one, the corresponding segment learns. All the cells in the column still turn on.
c. If the highest overlap isn’t above that threshold, add a new segment with inputs from cells which were active in the last cycle, using low permanence values (so it can wait and see if the new segment got added because of noise, for instance, when another cell should be used for the transition). All cells in the column turn on.

This algorithm probably has pros and cons. I haven’t tested other algorithms. I suspect the overlap step would help reduce the number of segments though.

sunguralikaan · July 7, 2016, 10:56am

Hello,

In my experience, doing a fixed memory layout is not a way to reduce the memory footprint but a way to increase the running performance as it allows direct memory access to any cell or column. For memory concerns, it would be better to have a dynamically allocated system so the resources/memory is allocated only when it is needed. As Casey pointed out, the system converges as the time passes and you simply cannot precompute the memory requirement of its stabilized state. So if the memory footprint is the main concern, try to design it around dynamic allocation.

I currently have a fixed memory layout implementation in C++, optimized to run real-time with 65535 cells. I think you shouldn’t design the layout around some preset sparsity or connection percentages. Because you would need to try different parameters at some point and I would rather not handicap myself for future experimentation. It is almost guaranteed that you will add new data per cells or columns if you want to accomplish some tasks with it.

The best optimization that I found (again, my concern is being real time) is to propagate inputs to the higher layer directly from the active cells of the lower layer. Nupic implementation checks the activity of the input field/lower layer from the proximal dendrites of the higher layer. Let’s say we have 1024 columns. The input field is 1024. If the columns have a high potential coverage at the input field/lower layer, there is a huge amount of redundant computation. For every column, it checks the activity of its connected synapses in all 1024 input bits and calculates the overlap. So same bits are checked whether they are active over and over again. Going the other way around, update the overlap of the connected column from the active input bits seems better. For a sparse system, this would mean processing only a fraction of the input field. In short, do bottom up propagation for computing spatial pooling overlaps, rather than top down checking. This would mean having some sort of mirror feed forward synapses originating at the individual input bits/lower layer cells. The targets of these synapses would be the receiving columns.

The memory footprint is rather large because of using a fixed memory allocation but the performance gain is worth it for me, around 20 times faster.

Hope it helps.

Paul_Lamb · July 7, 2016, 12:45pm

Thanks, sunguralikaan. I was actually concerned with both memory footprint and running performance (idea was to think of an efficient layout that still allows direct memory access and walking the structures). You make a good point though that trading for an increased memory footprint in the right areas can provide significant performance gains.

Ultimately, I think a lot of memory gains can be made by abstracting elements of the theory (of course requires testing to ensure the impacts on behavior of the system are acceptable). I have several ideas for using HTM concepts (in some cases using more than one cortical column) in robotics projects, so finding the right balance of performance and memory will be important for running on lighter hardware. The obvious area in the system for memory gains is of course the synapses and segments. I have several ideas for abstractions in this area which I am testing.

mraptor · March 18, 2021, 3:51am

I was playing the same games with memory and libraries.

I used numpy, pandas and bitarray but you always hit a roadblock doing it this way.
I thought dask or vaex could do the work, they can handle billions of rows sized numpy arrays and are fast, one problem though they are row-readonly. So they can be good for Long-term memory solutions, but not for dynamic neurons.

C | Rust are bad for prototyping. May be ok for v2.

So I decided my next implementation to go in totally opposite direction once I understand better TBT and have speculative idea of how the next step will work. (thats why i’m fishing for more ideas before I start again

So the idea is go SQL or Graph database. 350 MB are nothing, remember the activations are sparse i.e. you don’t touch/update all the data on every cycle.

The benefit of using database is that you can lunch multiple processes even with scripting language. Helloooo multiple CCs

Best candidates so far are :

RadisGraph : this is the fastest graphdb because it uses matrix ops (BLAS) to handle everything. neo4j is more full featured but slower. BTW I looked also other python graph modules, seems radis is the best. You have to compile it but its easy

SQL: thought to start with sqlite cause its easy, but lacks types. So now I’m on PostgreSQL because it support ARRAYS (indexed SDRs ) and has a companion module where you can use both OR either graph Cypher or SQL. And in addition the graph module may be faster! that RadisGraph.

Here is the overlap function :

create or replace function overlap_count(arr_a integer[], arr_b integer[]) 
returns integer language sql immutable as
$$
 select count(*)::integer from unnest(arr_a) a inner join unnest(arr_b) b on a = b;
$$;

postgres=# select overlap_count('{9,2,3,5}', '{2,3,9,7}');
 overlap_count 
---------------
             3

So I spilled the beans thats half of my plan.

Hope it helps other ppl trying to implement TBT/HTM and I saved you some research time.

markNZed · March 18, 2021, 6:45am

Do you know the sparsity of connectivity from the biology within a macrocolumn? Thanks.

Casey · March 23, 2021, 10:15am

It depends on the layers and cell types. 10% connectivity ratio is reasonable, meaning 10% chance a given cell firing produces a EPSP in another cell’s soma. I don’t know how often it reaches the soma or instead is fully localized to distal dendrite. It also varies with age, maybe a lot or little.

Supralinear increase of recurrent inhibition during sparse activity in the somatosensory cortex (Christoph Kapfer, Lindsey L Glickfeld, Bassam V Atallah, and Massimo Scanziani, 2007)
-L2/3 cells had 10% connectivity

Infrabarrels Are Layer 6 Circuit Modules in the Barrel Cortex that Link Long-Range Inputs and Outputs (Shane R. Crandall, Saundra L. Patrick, Scott J. Cruikshank, and Barry W. Connors, 2017)
-in L6a, CC → CT 9%, 0% other direction.

I couldn’t find other articles in my notes which recorded neighboring cell pairs. Other articles which measure connectivity ratio differently give values mostly under 10%, up to 20%.

Topic		Replies	Views
Can you use just 1 segment per cell in temporal memory? Engineering sequence-memory	16	2023	October 16, 2016
What is the best way to represent a set of synapses per dendrite per cell? Engineering	7	1208	January 24, 2018
HTM learning algorithm - temporal memory Getting Started question	17	1227	August 25, 2021
Basic / Specific Questions about coding HTM Engineering	11	1506	March 26, 2019
HTM process optimizations Numenta Theory htm-implementations , theory , optimization	38	2766	February 16, 2018

Memory layout strategies for a Cell

Related topics