HTM process optimizations

When implementing the multi-layer interactions needed for the new sensory-motor functions, I found that I had coded myself into a corner with HTM.js. The way I implemented the process makes it very difficult for more than one layer to interact with each other at different steps of the process. As a result I am refactoring.

Since HTM.js needs to be as lightweight as possible (designed to run in a browser), I thought I would take the opportunity to discuss and possibly implement some optimizations that I have been thinking about for a while. I wanted to discuss them here to get some input.

A couple of these optimizations modify core parts of the theory, so thought Tangential Theories might be the right forum category for this, versus HTM Hackers, but please move if it fits better over there.

Optimization #1
SP connect to input space on the fly

In the current implementation, upon creating the spatial pooler, a segment is created for each column, which in turn creates potential synapses with configurable 50% of the input cells (with configurable 50% of them being above the connection threshold).

It occurred to me that connecting a column with 50% of the input cells is equivalent to connecting each input cell with 50% of the columns. Thus it is possible to not create the segments and synapses upon instantiating the spatial pooler. Instead, each time an input cell is activated for the first time, its connections to the columns could be established on the fly.

This wouldn’t have much of an impact on memory usage for a typical single-layer sequence memory system (since in most cases all of the input cells will be used). However in multi-layer systems, this can have a huge impact. A lot of the cells in a layer are never utilized, so there could be significant memory savings by not creating unused potential synapses.

Optimization #2
Abstract concepts of “segment” and “synapse” into just “connection”

Implementing the segment and synapse features adds complexity to the logic. The idea for this optimization is to abstract these into cell-to-cell connections. These connections would still have a permanence and connection threshold, so would not have a very significant impact on memory usage, but would simplify various logic steps.

The basic idea is to move away from an implementation like this:
htm_segments
To an implementation like this:
htm_direct_connect

This would be functionally equivalent to each cell having a single segment with more synapses on it. I proposed this some time ago when I was first learning HTM, and ultimately decided against it since the two are not functionally equivalent. There are two main functional differences that I am aware of (if there are more, please point them out):

Firstly, partial activations across multiple segments under the activation threshold could result in an activation. For example, say the activation threshold is three. The following would not put the receiving cell into predictive state:
htm_segments_input
But it would in the optimized version:
htm_direct_connect_input

This was my original reason for not using this optimization. However, as my understanding of HTM theory has improved since then, this doesn’t seem like such a huge deal. What this scenario is basically depicting is a cell which is connected to two semantically distinct features now sensing a new feature that has semantically similar elements of both the original features.

This would be something like a cell which has a segment connecting it to “dog” and another connecting it to “cat”. Then it encounters “fox” which shares semantics between the two. It doesn’t seem like a huge problem that the cell is sensitive to this type of scenario, given the semantic similarities between the inputs across multiple segments. The likelyhood of this scenario happening by random chance due to noise is very small – it could only happen when a feature shares semantics with multiple other features that the cell has connected with.

The other functional difference is in the learning step. In current implementation, training happens in the scope of a single segment. Thus other segments that are not activated are not impacted by training on a different segment for the same cell.

In my above example, if the cell learns “fox” in the original implementation, it would grow a third segment and train it to recognize “fox”. This would have no impact on its current connections to “dog” and “cat”. In the new implementation, however, the cell would train itself to better connect to “fox”, and its connections to “dog” and “cat” would be degraded.

Ultimately, this would mean that things could be forgotten more quickly. On the flip side it would mean the system could be more efficient in memory usage. In the example above, I didn’t need to grow a whole new segment for “fox”, but I did degrade my memory of “cat” and “dog”.

I’m curious if anyone has explored the “one segment per cell” setup and has some comparative data they could share.

Optimization #3
Abstract concepts of “proximal”, “distal”, “apical”, “active”, and “predictive” into just “charge”

One of the issues with current implementation of HTM is leveraging concurrency, due to the rather mechanical three-phase TM process of Activate -> Predict -> Learn, and the classification of inputs as “proximal”, “distal”, and “apical” used differently through the process. It is also difficult to ever move away from discrete time.

The idea for this optimization would be to have dendrites which transmit a charge to a receiving cell. The receiving cell would accumulate charges from all transmitting cells connected to it. The higher the charge, the better connected a cell is with the current input (equivalent to “predictive state”). A cell’s accumulated charge would degrade over time. When the cell’s charge reaches a certain threshold, it transmits a charge to other receiving cells and its own charge is depleted.

A transmitted charge would attenuate the further the distance from the transmitting cell. So by giving different lengths for proximal, distal, and apical dendrites, we can get the desired behaviors

A further optimization on top of this would be to assume all proximal dendrites have the same length, all distals have another length, and all apicals have a third length, we can eliminate the need to calculate attenuation during the process, and simply use constants for the transmitted charges.

Cells would no longer have a state. Instead they would have a charge. Based on the charge, we can determine how well they connect with the current input, if they are predictive, whether they should activate, and so-on. With proper settings for charges and thresholds, distal dendrites could be made to only ever put a cell into a predictive state but never activate it, apical and distal inputs could be combined to generate activations, and so-on. Concurrency would also be possible (with removal of discrete time, every cell could in theory perform its functions concurrently with every other cell – would require some type of timing signal to synchronize though)

This is obviously a big deviation from HTM theory, so curious if anyone has thought if this before, and any potential issues you can see cropping up with this strategy.

5 Likes

I have been working on a similar but still little quite different implementation of htm

1 Like

I’m not sure about #3, but #1 and #2 make sense and I think that’s how it works in NuPIC.

I don’t think NuPIC uses the second optimization, though. I see the following default parameters:

"maxSegmentsPerCell": 128,
"maxSynapsesPerSegment": 128,

Optimization #2 is an abstraction of setting the above to this:

"maxSegmentsPerCell": 1,
"maxSynapsesPerSegment": 16384,

BTW, my example of “cat”, “dog”, and “fox” is meant to try and explain the basic concept for Optimization #2, not depict a realistic scenario. Besides the fact that “cat” and “dog” in reality share semantics with each other, a third segment probably would not have been created for “fox” either. Even if this same cell had been chosen for learning, the “dog” segment probably would have been modified (since it is the “best matching segment” in this example).

Interesting. I wonder what @mrcslws or @scott would think about this.

If this were the case, every time a cell decided whether to become predictive, it would need to query every cell in the layer? What if the distal segments are not coming from the same layer? Would it need access to an entirely different layer? Or are you only talking about proximal input here?

I would just transmit this information during the Activate step. I have something similar already in HTM.js. When I perform an activation, at that time I crawl all synapses connected to that cell’s axon and increment a connection score on the segment they are attached to. When the score is above the activation threshold, I activate the segment and add it to a cache. The “Predict” phase then consists merely of cycling the cache of active segments and setting the cells they are connected to as Predictive. This optimization would simplify that process even further by removing the concept of “segment” entirely.

Right now, I’m not aware of a use case for distal input coming from multiple layers. The only multi-layer use cases I am aware of are distal from one layer and apical from another. I would model distal and apical as separate connection types (unless I also implement it in conjunction with optimization #3, in which case the problem is simplified through abstraction)

No, this would be specifically used for distal and apical input. A single-segment optimization is already built into the normal SP process.

maxSynapsesPerSegment is just the max, it will only query the synapses that have actually formed. That said, if cells learned many connections then it could get slow (and cell predictions could have many false positives)

Getting rid of separate segments (#2) is a reasonable idea and the algorithm should still work, although I wouldn’t set the max synapses that high. There is some ability to handle unions without confusion - i.e. if there are three sets of presynaptic cells that put a post-synaptic cell into a predictive state, it is unlikely that the cell will have a false positive (going into predictive state when none of the three patterns are active but some small subset from each are active). Once you get up to 10, 15 patterns unioned together you will start to see false positives. You can manage this somewhat with the right learning rates and limiting maxSynapsesPerSegment but it is much easier when you have multiple segments.

2 Likes

Ah yes, this is the difference between an async platform and python. How fast does this work for you per cycle?

Think about the SMI case, where distal input is coming from “elsewhere in the cortex”. If you assume the input comes from the same layer, you already know the layer’s dimensions. But if you don’t know where it is coming from, you cannot assume anything about the dimensionality.

By assuming the dimensions of the distal connections are the same as the current layer, you’re limiting yourself.

1 Like

Excellent point. I had misinterpreted what you were saying – you are talking about distal input from another layer (not distal input from multiple layers).

I’ll have to think on this one a bit. My initial reaction is to have the config parameters specific to each layer, and use the parameters for the transmitting layer you are connecting from (not the layer receiving input).

Moved from #htm-theory:tangential-theories.

This doesn’t necessarily require an asynchronous process (in fact most browsers do not yet have truly asynchronous javascript). It only requires pointers or objects. Every Cell object has a reference to a single Axon object. Every Axon object has a collection of references to Synapse objects. Each Synapse object has a reference to a Segment object. Each Segment object has a reference to a Cell object. Upon activation of a Cell, I can do something like the following pseudocode:

FOREACH cell.axon.synapses AS synapse
    synapse.segment.score++
    IF synapse.segment.score >= config.activationThreshold THEN
        IF synapse.segment.active == false THEN
            synapse.segment.active = true
            activeSynapses[t].add( synapse.segment )
1 Like

Regarding Optimization #2, getting rid of segments would cause a pretty large drop in capacity.

Consider if every cell in a minicolumn is part of 10 SDRs. This could definitely happen with a common feature. If that feature appears in a lot of sequences, or at a lot of locations on different objects, the feature’s minicolumns would quickly learn tens or hundreds of contexts. And, on top of this, the minicolumn is part of multiple feature SDRs. If a cell connects to 20 cells of each SDR, then each cell is now connected to 200 cells, with a threshold of ~13. Depending on the parameters, the odds of a random 40-cell SDR matching 13 of these cells are non-negligible, and it will become increasingly likely as the cells learn more contexts. And, as Scott mentioned, it’s worth considering unions. Combining unions with one-segment-per-cell would cause a lot of false positives. Having multiple segments totally avoids this problem, and it mimics biology better.

And yes, without segments, the learning is now less capable. You lose the ability to use a large “permanence decrement” – i.e. the punishment of inactive synapses on a correctly active segment. You have to keep this value very small. If it’s too large, any cells appearing in multiple SDRs will be in an unstable state, trying to get back to their “happy place” of representing one thing. Having multiple segments allows cells to stably represent multiple things and be capable of quickly forgetting bad synapses.

6 Likes

Really good points – I hadn’t considered the impact on capacity. This optimization is starting to look pretty bad at this point… glad I brought it up again for discussion.

Regarding Optimization #3:

Would this add any functionality, or is it purely an optimization? Would this improve learning? Would it introduce any new data structures or is this just a logical change? Is this the classic argument against the binary state of HTM?

This is purely a logical change. Basically a single concept (charge) can be used to abstract multiple concepts to simplify logic and enable concurrency.

Not in my opinion. I would probably use a threshold for knowing when something is predictive and another threshold for knowing when something is active. I don’t see a need for knowing “how predictive” or “how active”, other than it could be useful when doing a winner-takes-all process to perform inhibition.

1 Like

Again, interesting. :slight_smile: Thanks for bringing these topics up, it could help anyone building their own HTM system in the future.

Thinking about this some more, couldn’t a similar argument be made against the current SP implementation? Would it be worth exploring a multi-segment implementation of the SP process? Something similar to how TM works, only with proximal connections instead of distal.

I’m sure you’d get something interesting out of it, but it would change the functionality. Each column would learn to respond to multiple distinct patterns and be unable to exploit their overlap for noise tolerance and so on, whereas the single proximal segment version can respond to a union of similar patterns, increasing its robustness to variations on each. Could still work, as with all of these changes it would be worth doing a thorough empirical evaluation.

To play devils advocate, I would think the same argument could be made for using a single segment in the TM process (response to similar patterns). It is interesting that one strategy was used for SP and the other for TM.