Size in memory of the temporal memory permanence values

I was exploring the temporal memory algorithm and I decided to compute the size of storing every permanence value.

For the parameters in this section of the temporal memory paper I got the following:

Number of Cells: 65536

Number of permanence weights: 549755813888
Size in memory (Float64): 4398.046511104 GB
Size in memory (Float32): 2199.023255552 GB

Number of permanence weights (sparse): 4.194304e8
Size in memory (Float64): 3.3554432000000003 GB
Size in memory (Float32): 1.6777216000000001 GB

This suggests that storing all permanence weights is prohibitive. I was considering computing segment activation via a dot product on a GPU but these numbers suggest that it would be hard to do so. I would need a huge amount of RAM and would have to do it in batches. Does this sound right or did I make a mistake in my computations?

Julia code I used to compute this:

num_cols = 2048
col_size = 32
segment_per_cell = 128
synapses_per_segment = 50

num_cells = num_cols * col_size
segment_sparsity = synapses_per_segment / num_cells

# Each cell has x segments and each segment has one permanence (including zeros) per cell
total_floats = num_cells * segment_per_cell * num_cells
non_zero_floats = segment_sparsity * total_floats
bytes_to_gb = (1 / 1_000_000) * (1 / 1000)
GB_per_float64 = Base.summarysize(1.0) * bytes_to_gb
GB_per_float32 = Base.summarysize(1f0) * bytes_to_gb

println("Number of Cells: $(num_cells)")
println("\nNumber of permanence weights: $(total_floats)")
println("\t Size in memory (Float64): $(total_floats * GB_per_float64) GB")
println("\t Size in memory (Float32): $(total_floats * GB_per_float32) GB")
println("\nNumber of permanence weights (sparse): $(non_zero_floats)")
println("\t Size in memory (Float64): $(non_zero_floats * GB_per_float64) GB")
println("\t Size in memory (Float32): $(non_zero_floats * GB_per_float32) GB")

@marty1885 @hsgo I saw in past posts that you have done GPU implementations. Thoughts?

Yes, your sparse approach is close, there’s a limit to how many synapses is allowed for each segment.in all piratical implementations (Including NuPIC/HTM.core and my Etaler). For example

num_cols = 2048
cells_per_col = 32
segment_per_cell = 16
max_synapses_per_cell = 128

# 256M  permanences
perm_buffer_size = num_cols * cells_per_col * segment_per_cell * max_synapses_per_cell 

This however also means you have to store both the permanence and the cell index that the synapse connected to at the same time. And add new connection to empty slots in the array. Then select your strategy to handle a full synapse. In the above example you actually need 1G to store permanence in float32 and 1G to store cell index in int32

Also Etaler made an simplification of forcing 1 segment per cell. In the past I argued that multiple segments really is just a deeper column in disguise. So my implementation forgoes that complexity. Just letting you know in case it confuses you.

Another trick you can do is to store the permanence as float16 for even smaller size. Etaler supports this. It turns FP16 is not well supported on OpenCL on Nvidia GPUs (It does work on AMD and Intel!). You can see the definitions on the GPU kernel.

Let me know if I can help, Also feel free to DM me/email if you prefer private conversation.

1 Like

Wow thank you for this fantastic response! This is just the kind of things I needed to know.

I am still exploring GPU options and trying to decide what to do. I’ll reach out if I have more questions!

1 Like

Since the permanences are essentially a slider with a binary output (connected or disconnected), the floating point value is not actually used for any computation. So you could use a single byte integer storage for the permanences which would give you a range of 256 values for each slider. This integer can then be incremented or decremented and compared to an integer threshold value.

The addressing requirements for synapses can also be reduced if the dendrites are restricted to only connecting to neurons in a local neighborhood. For example, for a 16x16 local neighborhood, a 2D index could be stored in a single byte. A 256x256 neighborhood can be indexed with two bytes. Etc.

Another possible solution would be to forgo the addressing altogether and simply generate a family of permanence templates with various distributions of potential synapses. Then the column need only store one or more integer index(es) referring to the template(s) it uses to sample the surrounding columns (distal) and/or input layer (proximal). Each column would share one or more permanence weight vectors for the proximal inputs (e.g. one per minicolumn). However, each time a minicolumn bursts, a new distal segment can be added by finding the template that has the greatest overlap with the set of nearby columns/neurons that were previously active. If the best match does not contain a sufficient amount of overlap to exceed the desired threshold then a new template can be generated with better alignment.

2 Likes

The template trick carries the risk several cells will sync their predictive state which…is it harmless? If there are as many templates as cells then templates are rendered useless.

Also have to check two cells in a column do not share their template since they become redundant.

Maybe many shorter 1/4 size sub-templates? Each bursting cell/segment
searches a combination of them - e.g. 4 x 32 addresses templates (out of e.g. 64k pool) to generate a segment of 128 cell addresses.
So will decrease space spent with addressing 32x

But… it could make bursting expensive by searching templates for matches for current activation.

Not necessarily. If the templates are addressing only the neighborhood of a column (i.e. the template addresses are relative offsets with respect to the position of the column), then each application of a template will be addressing a different set of distal or proximal nodes. The application of the template along with the permanence array form a sort of convolution over the neighborhood of the column / neuron.

This does not automatically make them redundant. The template describes a set of potential synapses. There would still be an array of permanence values associated with those potential synapses which would determine which ones are currently participating in the dendrite’s activation.

I wouldn’t be so sure. My intuition is that the total number of templates would likely be fairly small. The flexibility provided by the permanence weights should allow for a significant amount of reuse of templates.

As the network gets trained, then it might be possible to merge or split certain templates in order to create specialized templates that use fewer addresses. This would ultimately reduce the storage requirements for the permanence arrays of all columns/neurons utilizing these templates.

2 Likes

Sure, a shifting template addressing makes sense. But that isn’t direct address pointing, it has to be preceded by a simple operation e.g. addition with some constant (like combining columns/segment’s id or even random seed) .

Which is fine, should not be too expensive, but then one asks whether a simple-ish operation could be randomized enough to deterministically generate addresses of upstream cells on the fly.

The considerations regarding long times for searching templates would apply here too.


Regarding redundant segments per column… even with different permanences they will evolve towards the same permanences since rewards/punishments will be the same

1 Like

Other idea is to encode permanence in the order of each synapse address. Computers have the nice thing called lists.
Reward would move a synapse up on the (list of synapses of the) segment. Punishment down.
Segment needs to hold only 1 byte counter to know where permanent synapses start , e.g. 42. Promoting a candidate, high scoring synapse to permanence would be made moving it to position 41 (or more likely it is already there) and decrease the segment’s permanence counter.

Unfortunate synapses draining at the bottom of the list are culled and replaced with new, more promising connections.

1 Like

It should be possible to directly address a specific set of addresses algorithmically using some form of modulo arithmetic.

Again, not necessarily. The winner cell would be the one with the greatest predictive overlap in the distal dendrites. The permanences associated with the active presynaptic neurons would be strengthened on it’s dendrite segment. Meanwhile the corresponding ones on the other cells would remain fixed if they were not in a predictive state, or be decremented if their segments had sufficient overlap to exceed the predictive threshold, but not enough to become the winning neuron in the column.

2 Likes

Yup, that’s what pretty much what I was thinking of.

Still not happy with forcing synapse connecting within whatever the seeded algorithm picks for it, instead the much more opportunistic way of direct targeting previously active cells.

A compromise would be to have the segment memorize only last 1-2 bytes of each synapse’s upstream cell address and have the seed/algorithm pick which pool (or “random template”) of 256 or 64k to pick from for renewal.

That together with 1 byte permanences would reduce the synapse memory cost from 8 bytes(float32+int32) to 2-3 bytes, which is significant. It might even improve performance too in implementations where bottleneck is memory bandwidth not cpu speed.

You could still increase the number of possible permanence values by using your list idea. 256 might not be enough for the right amount of plasticity, especially if some things change permanence more than others. Also, keeping the synapses sorted by permanence might improve performance in other ways.

Templates shouldn’t increase the number of unused synapses too much, because that’d increase the memory used. I don’t know if they would. Maybe it’d be better to basically reuse redundant pieces of segments.

You could break segments into shared quarter-segments. There’s a single shared list of addresses, and each segment has its own list of corresponding permanence values. When making a new segment, you can reuse quarter segments whose presynaptic cells are all active.

That can also make it faster to compute sums for predictive states. For example, if a quarter-segment’s presynaptic cells are all active, you can just add the number of connected synapses. Similar situation if a segment is fully connected to the quarter-segment’s presynaptic cells. If not all are active or connected, it might be possible to subtract from the sum rather than summing from scratch.

You could maybe improve performance further by organizing that in a hierarchy of sums. So like break the quarter-segments into 1/8th segments etc. (tree structure). Smaller ones are more likely to be fully active/connected and more reusable.