HTM process optimizations

The way I see it, it’s about what a cell or column essentially represents. If a cell represents 1 input that can occur in N contexts, then intuitively you’d want 1 proximal segment and N distal segments (union tolerance notwithstanding).

1 Like

My feelings so far:

#1 should be functionally identical to my current implementation, and no reason not to use it.

#2 is overall a bad idea given the impact on capacity and stability/ ability to adapt to change. I’ll ditch this one.

#3 is different enough that it may be difficult to theorize what negative impacts it might have. I’ll need to run some comparative experiments to see how it impacts the behavior of the system.

2 Likes

Also, #3 as written above kind of implies that #2 is also implemented. To keep the concept of “segments” around, we lose some of the simplification that comes from the original idea behind the abstraction. I suppose the simplest way to do this would be to have the cell’s charge mirror whichever of its segments has the highest charge. Thresholds could be placed on the cell only, eliminating the need for “activation threshold” at the segment level.

I’ll write up a quick test implementation and describe it in more detail before I commit this to HTM.js.

Just a note, #3 is similar in principle to Fergal Byrne’s proposal a few years back of a “prediction-assisted cortical learning algorithm” (paCLA). In his proposal, distal activations would add to a cell’s proximal activity, and therefore prediction can bias the column choice in addition to the cell choice.

I wouldn’t consider this much of an optimization though, because it’s much cheaper to compute the K-winners over columns than over the whole cell population, and K-winners at the column level also avoids computing the distal activations of cells in inactive columns.

In practice, I would probably only use this strategy after computing winning columns through normal SP process (which already does something like this to pick winning columns – just calls it “score” instead of “charge”). I might also use it for cell-to-cell proximal connections in certain pooling strategies (the only place I could see using this would be in a forward-looking pooler used in my RL process)

To clarify how SP and this concept might work together, a charge equal to the activation threshold would be transmitted to all cells in the winning columns selected via the normal SP process. This step would be synchronized for all 32 cells in the column, so that any cells with a greater charge than the activation threshold (equivalent to “predicted active”) would activate and inhibit all other cells in the column. Otherwise, all cells in the column would activate.

Alternately, upon selecting winning columns during SP process, any cells with an existing charge would have their charge increased to the activation threshold, otherwise all cells in the column would receive an activating charge.

Hi @Paul_Lamb

I know you kind of eliminated suggestion 2 but I just wanted to say that this may be a lot more frequent depending on the layer configuration. As @alavin also explained, when the segment number increases quite high and start using the layer’s capacity to its fullest you see that segments start to overlap their input. At 4:03 you can see 34 distal segments of a cell which I picked randomly. The activation of a segment is the first number starting from the left of the corresponding line (look in 1080p). There are 3 segments with activations over 3. Non are over the threshold but together, they are and I did not cherry pick this.

Oh and if you are interested in another optimization, check out section 5.4.3 Mirror Synapses Optimization in the thesis. It has a figure which compares the number of synapses iterated when computing overlaps with the optimization I present and with the default algorithm.
Default: 13933
Mirror Synapses: 582

Hi Paul!
Regarding #2:
As I currently understand the htm model, the essential concept is to organize each segment, which initially represents an arbitrarily selected fragment of an input with “no meaning”, into an “educated” segment, that now would represent a “meaningful” pattern .
Beyond the first layer of processing (first SDR as a function of the sensoring input) where the input has no established pattern, Ι couldn’t find out how this is going to happen under the optimization that you propose.
Since you reduce the cell to have only “one segment”, I suppose that then you ll have to increase the number of cells in a column in order to achieve learning. Since I am not a programmer, I can’t tell if the latter has significant computational advantages over the former.

I optimized this in HTM.js during the cell activate process a bit differently. In pseudocode:

FOREACH cell.axon.synapses AS synapse
    synapse.segment.column.score++

Then a simple function for performing “winner takes all” columns to activate:

FOREACH columns AS column
    FOR c = 0 TO config.activeColumnCount
        IF !( c IN bestColumns ) OR bestColumns[c].score < column.score
            bestColumns.splice( c, 0, column )
bestColumns.length = config.activeColumnCount

Thus I do not need to iterate over inactive cells in the input space or over any synapses on the column dendrites which are not connected with the active input space.

Yes that is quite similar. You did that for distal segments as well (depolarization phase) or did you do something different?

I disagree with this observation. In order for HTM to work, the bits in the input pattern must have semantic meaning. They don’t necessarily require sparsity (the Spatial Pooling process results in a sparse representation of active columns). The job of Temporal Memory is not to create meaning, but rather to create temporal context (input in one context is represented by different cells in the column than the same input in another context).

The idea behind the proposed optimization (which doesn’t hold up when the layer capacity starts to be utilized) was that the likelyhood of a cell with multiple segments experiencing significant overlap between segments is low. Thus the segments could be ignored entirely.

Yes, similar for distal segments. Pseudocode:

FOREACH cell.axon.synapses AS synapse
    synapse.segment.score++
    IF synapse.segment.score >= config.activationThreshold THEN
        IF synapse.segment.active == false THEN
            synapse.segment.active = true
            activeSegments[t].add( synapse.segment )
1 Like

I see, we are doing functionally the same stuff. Have you conducted some kind of CPU time sampling on your code to identify the bits that take up the most cycles? If so, what did you find as the bottleneck? My bottlenecks are heavily on temporal memory (visualstudio analysis), specifically depolarization stage and the segment adaptation of temporal memory. Is it the same with you?

The main problem with depolarizing this way is that, the synapses you are accessing are not stored contigious in memory. Because if you check for the postsynaptic targets of a cell, you are essentially iterating individual synapses belonging to cells scattered on RAM. As a result, each iteration will potentially be a cache miss. Without the optimization, you are iterating over distal synapses in a contigious way but you are checking inactive ones too and that is even slower according to my timings. So the optimization is faster but I could not find a way to perform depolarization in a contigious way which would immensely speed up things for me. Do you have any input on this? Some tips maybe?

I haven’t done any detailed analysis like this (I probably should…) My initial implementation would frequently crash the browser, so I started optimizing how I did various actions. I refactored the code a few times to get to the current implementation, but it is still not quite where I want it (difficult to do multi-layer interactions in the current design – related to how things are reset during each subsequent timestep)

You have me thinking on point #3.

The “here and now” is a window of time that seems to consist of short automatic recognition events that build to a theater of the present.

Consider vision - as the eye scans the scene different areas are presented to the fovea; say that is a face. Over a space of several scans, you look at many parts of a face. The section of cortex that processes these parts gets snapshots of mini-images in turn, along with the feedback from higher level visual cortex and motor signals from the frontal eye-field as to the location of the gaze. Somehow this builds to the identification of one person and the facial cues of expression.

The idea of charge in dendrites feeds into the idea of a build-up of parts of SDR activation over a time course.

In a setting like a meeting, you may repeat this process many times, That suggests that at a higher level the same over-laying process is going on for parts of the scene like the bundle of information from each person.

General questions:
At some level is there a “reset” as you move from face to face?
Same question on orienting to the room?
Is the longer time course of the “here and now” a result of these maps interacting?

This same general process seems to happen in hearing. In the spirit of the Gedanken listen to that same meeting -these is the same general process in recognizing connected speech in a conversation. There are short-term time events that build into longer-term time events.

I often wonder about the dynamics of brain waves - how much is done with each cycle and how much bleeds over from cycle to cycle. A partial activation (charge) adds an interesting theoretical avenue in considering these questions.

Do you have any thoughts on this?

1 Like

Regarding Optimization #3.

I see some correlation to Numentas paper: Porting HTM Models to the Heidelberg Neuromorphic Computing Platform

I think fundamentally HTM-theory is highly compatible to run parallel and it is also thought to do so but on classical computer structures this would add to much complexity and the “timing signal to synchronize” that you describe seems to me a major issue. In the paper they also translating to “a membrane voltage” of each cell that depends on the inputs, which in some sense relates to your “charge”.

However it is highly time-based and plastic and as said attempts to do this on traditional architectures might more easily end up in inaccurate/modified versions of the theory, highly complex systems and not necessarily better performance.

This is also one of the reasons mentioned from Sabutai ahmad why the networks API did not include concurrency.

I’m interested in Optimization #1 as it was not discussed much.

It occurred to me that connecting a column with 50% of the input cells is equivalent to connecting each input cell with 50% of the columns.

Is it? We imagine a random distribution so it might be the same on average but the variance can be important.
E.g. before we had each column connect to 50% of the input cells resulting in e.g. 512 potential connections for an input size 32x32. However as we connect each input bit, it can happen that some of the input bits do have more connections and others less.(variance) Not every input bit will be connected to exactly the same amount of columns. However on average they would be connected to 50% of the columns. (1024 if we think of 2048 columns)

If we now reverse it as you suggested it would result that each input is connected to 50% of the columns but the columns have a variance in their number of connections. This would lead to an “unfair competition” for winning columns if we do not e.g. build a ratio connections/potential pool.

However I do think that it is an important topic as if you want to move to a network with multiple layers with different scale of perceptive fields it would be to costly doing it the other way around.
In my thoughts I did not include topology. It would still have the same effects but mitigated to a smaller area and could then maybe be balanced better.

1 Like

Good catch. Would be interesting to do some comparisons to see what impact this would have on behavior. My intuition tells me that it would reduce overall capacity of the system to some degree (since some columns would be reused more often), but could be that with thousands of input cells and columns that the effects would be negligible.

That said, this could be addressed by adding logic where columns with the least number of connections are chosen first with a random tie breaker and a maximum connection count of 50% the number of input cells.

1 Like

A post was split to a new topic: Number of distal dendrites a neuron grows