Extrapolating and interpolating quantities with HTM

The human brain doesn’t need to build individual interpretations of all the numbers we encounter in daily life. We can utilise our understanding of numbers across domains, 1 million dollars, 5 hours, 500m etc. and also learn common sense understanding of each individual quantity in context. To the best of my understanding, the HTM model learns a mapping between encoded representations of numbers with some level of semantic overlap between bits (as shown in the hot gym example), and their predictive patterns. When running a HTM on a cartpole environment, it fails hopelessly to interpolate between values and actually make predictions for values it hasn’t seen. There are also four separate quantities encoded individually, but the HTM should make use of the shared quantities to learn better mappings. Why do we need to pass in quantities through separate streams? This makes it difficult to extrapolate or even interpolate between values that have been previously unseen, not only for individual quantities, but between quantities. Is there some form of representation that can make predictions about new quantities based on learned relationships from existing ones? For example, I can imagine that my room can’t fit 100 people in it, just because I can imagine that quantity, even though I’ve never seen that specific example. I can apply that logic to any number of objects, regardless of if I’ve seen that many before.

Furthermore, my representation of quantities is (roughly) infinitely divisible and (roughly) infinitely scalable, but the HTM needs a quantised representation of those numbers, it requires semantic overlap between bits, but you can’t have 10 million bits representing all the numbers between 0.000001 and 0.000002

So my question is, how can we reuse learned patterns for quantities across domains, and how do we allow the HTM to interpolate and extrapolate examples of unseen quantities?

Notes: This paper appears to indicate that the frequency of spikes to encode numeric information (Single Neurons in the Human Brain Encode Numbers - ScienceDirect), and maybe the brain reuses some kind of columnar structure to pass along object information together, such that it can pass objects and quantities to the same models to learn more robust representations of quantities for all types of inputs, instead of learning a new mapping for each input as the HTM appears to do. I’m only vaguely familiar with how this columnar view applies to and is used by HTM theory, but it’s similar to something like a capsule network which encodes object property information as a vector, instead of individual distributed values.

Another key thought I had is that the brain actually never receives encoded representations of numbers, it always encodes it as, for example, the visual lines of text which is transformed into a higher level representation, or the individual objects which can be passed in as SDRs. Maybe this is actually a non-problem which only occurs because we can directly pass in quantities, instead of symbols of quantities.

4 Likes

What? who/where was trying to use HTM for cartpole? My tests with SDR encodings (which are HTM inspired) using bit pair value map (which aren’t) to record state-action values, are quite successful, even SOTA I would claim.

Well this indeed is a thing. On the CartPoleChallenge repository there are two solvers - one with separate channels the other with one channel that overlaps all four state values. Beware it is not the same thing as SDR Overlap. It adds four dense embeddings and then extracts a SDR from the result.

The result in these two cases are:

  • the one with separate channels can learn with very small SDRs (even 12/48 bit work) but needs more trials. 25-30 failed trials on average, yet with occasions when it learns in 6-7 trials.
  • the overlapping single channel needs about double size SDR but occasionally - 5-10% of sessions - ls able to solve the cartpole with only two failed trial.

Computing-wise however the separate channel encoder is far less expensive, overall it is ~3 times faster, even if it fails much more often before solving the environment.

1 Like

Me, I was just testing it by encoding the 4 values as 10 bits each for a 40 bit SDR with the standard number encoder with 2 bits of overlap between values. This seems to fail for me (just with prediction accuracy, not control) because it needs to see a bunch of combinations of values for it to learn their corresponding next value. If it hasn’t seen a certain bit active, it has no idea what its supposed to mean.

I tried looking through your code but I’m a bit confused as to how you’re encoding the 4 values (especially as two of them go from -infinity to +infinity). Could you go into a bit more detail about how the four dense embeddings work?

1 Like

cp_vmap_ovenc.py (sample efficient) uses the “fancy” additive dense vector encoder while cp_vmaps_fast.py (compute efficient) uses a self written almost “standard” encoder - N bits size, P bits of 1, for each state value, resulting in a Nx4 long SDR with Px4 active bits. N=12 and P=4 achieves a decent result.
Both are adjustable (contrary to Numenta’s advice) in the sense they dynamically adjust their specific interval between min/max limits and any current value is mapped to a [0,1] interval first, and actual encoding is made for this sub-unitary converted value. If new state value

This is visible in min_max_adjust() in cp_vmap_ovenc.py

Now if you look in the following function - CycleEncoder you’ll noticed on line 74 I just lied you above - it initializes max values to some already “learned” values. But you can comment #74 and uncomment #72 (assumes -inf initial “maxims”) and the agent will learn almost as fast. I think it needs extra 2-3 failed episodes on average. But I’m not sure which.

So in the “cp_vmap_fast” version the maximum values are initialized in lines 77 or 78.
Its encoders are simple slider blocks of P 1 bits, left to right within the N bits space. So the number of possible discrete values is N-P+1


Now the additive (overlapping) encoder.
Let’s say we have a target N of 100 bits with P = 25 on bits.

  • Initialisation: each state measurement gets assigned a static, random dense vector within [0,1] interval. So we have a 4x100 array - angle, v_angle, pos, v_pos
  • all state values will be mapped to [0,1] interval each with (optional) automatic [min/max] adjustment
    (well it will be downsized to something smaller eg [0, 0.7] but let’s ignore this)
  • Add the obtained state value to its corresponding static vector, and divide the result modulo 1 → that outputs the static vector “rotated” by its corresponding observation value.
  • Add all 4 rotated vectors resulting in a single 100 size dense vector (summation or dense overlapping vector)
  • Pick the P maximum (or minimum, does not matter as long you are consistent) points in this vector as the SDR encoding.

Now why 0 to 0.7 instead of 1? 0 to 1 covers a whole cycle (like a circle) which means point 0 and 1 are very close to each other. To avoid overlapping between minimum and maximum values (that would be confusing) I pick a maximum value of (N-1.25*P)/N which is 0.6875 for N,P = 100,25

There-s a whole can of speculating on why such a … mashup would work better than separate channels, but I won’t open it (unless provoked :slight_smile:)

1 Like

Regarding your initial failure.
In order to predict future state you need both current state and action encoded as input for a predictor to have a chance to be accurate.

What kind of predictor did you use? I would try first with a vanilla sklearn regression, since that should work decently. Do not expect amazing precision, a sliding 2/10 bit scalar encoder has only 9 possible values.

Also I noticed there is a “sweet spot” on solidity (P/N), a generous overlap between nearby values actually helps. For cartpole 25-30% solidity works well, at least in these implementations.

1 Like

I don’t know if you have experimented with input encoding with meaningful spatial distribution and local inhibition, but I think in case of processing multiple distinct inputs, they might matter quite a lot.
Otherwise, for the case of global inhibition, the SP might encode two input vectors with a single entry difference very distinctly, leading to poor interpolation. e.g. (a, x, y, z) and (b, x, y, z) might end up almost completely distinct, likely causing the curse of dimensionality.
Granted, some entanglement might be desirable as you would want to model relationships between dimensions, but I’d assume similarity between close values of a common dimension should be more pronounced than similarity between a correlated pattern across dimensions and a marginal variation of that, though I don’t know if that’s the right way of putting it.

Also, there’s another way the brain encodes number-like concepts: grid cells. :slight_smile:

2 Likes

One caveat with “numbers” in general (about the original message) is that we, since we acquired the abstract concept of numbers/quantities we tend to force it back into an universal representation. Which might not be the best idea.

I would test assumptions applicable to animal cognition and avoid conceptual abstractions.
One reason would be the grounding - when we learn abstract concepts we anchor them on preexisting, simpler perceptions and experiences.

First, if there is a sense (as opposed to concept) of quantity in animals I don’t think it is universal.

E.G a louder sound (sound, quantity) does not share the same encoding of quantity as in (apples, quantity)
I think we later learn they both are incarnations of the same abstract idea of quantity.

But even having equivalent representation of similar things with ability to extrapolate into e.g. “as many eggs as how many apples I see” would be a big leap into covering the gap between perception and abstract thinking.

Probably even the mighty transformers do not really developed a proper sense of quantity, even though it might appear so from sophisticated word prediction.

2 Likes

Yep, even with action it doesn’t do amazingly. My predictor is just… a HTM. I thought that was clear, hence why I wanted to know if quantities are inherently terrible with HTMs. By prediction, I don’t mean mapping back to the original quantities, it just needs to use some temporal-memory-like algorithm to predict abstractions of those quantities (like those that spatial pooling identifies). When you say solved in only a few episodes, are you saying that it can make good enough moves not to lose, not actually predict the motion of the cart accurately? What would it take for your model to predict how the input representation changes over time, is that a much harder problem than “just” learning how to correct motion, which is basically left if the angle is negative and right if the angle is positive (plus keeping the cart in frame obviously)

1 Like

If we are talking cartpole by prediction I understand that:

  • given a state (4 floats) at t0 and an action (0 or 1)
  • you predict the state at t1 (another 4 float vector)

HTM predicts at most SDRs 100 or 1000 bit long vector of 0s and 1s. What do you expect from a SDR to be “good”?

Yes.
In order to predict actual values for speeds, position and angles you need to train a regression model using HTM’s output SDR at t as “X” and actual value (as computed by environment) at t+1 as Y

I tested a regressor built upon ValueMap (same structure as the agent uses to chose best action), it works decently but you can find a few in Sklearn e.g. PassiveAggressiveRegressor which can be trained continuously (on-line) at every time step.

PS you can train the regressor using SDR encoded values as input (X) and actual float values as output (Y) .
Then indeed you might check how well the values predicted by the regressor match actual values at following time step.

1 Like

I mentioned specifically that it doesn’t need to be in the input format. It just needs to predict the next spatial pooling, as if that was the SDR generated by that particular pattern (which is what temporal memory does). The SDR doesn’t need to be perfect, just vaguely correct or at least not 0 all the time (and especially not with 1000s of iterations to learn all combinations of values).

Whats the point of ever using a HTM then? You might as well just train a normal model from the inputs, forget the whole SDR idea and train it using the actual values like a regular cartpole AI. Is a valuemap actually any better than just the values themselves? I struggled to understand how exactly it works in your first explanation or what the benefit is.

I’m really looking for a highly sample-efficient predictor, that can learn simple relations like angular velocity is predictive of the new angle, and the action and position are predictive of the next position.

Could a HTM ever be adapted to model relationships like the “next angle = old angle + angular velocity”? maybe with a grid-cells/displacement cells type system? Not really sure how to approach the problem, or if I’m even making sense as I’m fairly new to all of this :sweat_smile:

1 Like

A good way to test whether HTM is making reliable and meaningful predictions is to check whether a classifier or regressor can “understand” its output.

By “HTM” I understand you pair a TM with a SP, the pipeline being:
values → SDR Encoder → SP → TM → next step predictions

If you-re keen on sample efficiency you can skip the SP above if the SDR Encoder is outputting decent SDRs (constant sparsity). That would avoid SP’s “learning” stage which implies inconsistent encodings till it stabilizes to consistent representation.

2 Likes

I can’t help but feel you are trying to fit a square peg in a round hole.

The “prediction” part of HTM is just to recall the T+1 potential match(s) to the T pattern. The anomaly part is just that when T+1 becomes the new T, the sensed pattern does not match the recalled pattern(s)/prediction(s).

Note the (s) part. HTM might try to predict many possible outcomes; not a desirable property in a control algorithm.

For your “next angle = old angle + angular velocity” example this is essentially a lookup table. This is not really what HTM does.

HTM might tell you that the cart is not following a pattern it has seen before. There are many good applications where signalling that something is “not right” is a very desirable behavior, but a very simple conventional neural network would be a better fit for driving a “cart pole” kind of problem.

You should have many tools in your toolbox, and know how and when to use each one.

2 Likes

I appreciate the feedback, but I thought this was foundational to numenta’s theory of the brain. The neocortex clearly models the world using learned abstractions which help to predict upcoming inputs. As I understand it, HTM is (one small section of this overall algorithm) derived from a (rough) understanding of neuroscience and neocortical principles, so is that not what we’re eventually aiming for? To get better at prediction and modelling arbitrary input?

The HTM has some amazing properties, like being robust to noise, local learning rules, not suffering from forgetting old values, modelling sequences of patterns with only a few repetitions, not needing to rely on ground truth labels for anything, simultaneous predictions like you said, and many more. There seems to be an immense potential to use (and extend) the HTM model for prediction. I don’t see simultaneous predictions as an error, if it can identify multiple possible interpretations of data then fantastic! It’s less about the absolute prediction error for the next timestep, and more about building up a sample efficient method of finding useful abstractions from arbitrary data. An internal model of the input that can capture relations at a sufficiently high level.

As that’s not currently possible with the existing HTM model, I’m attempting to probe the knowledge of the experienced members here and see if there are other insights or areas that could lead to more prediction-like systems, instead of just anomaly-detecting ones. I’ve seen (extensions of) the HTM algorithm used for reinforcement learning and controlling an agent with reward, so my impression so far is that it is far less limited than that one domain. If there are any other resources you think I should know about I’d love to hear your recommendations

2 Likes

The cortex is doing a lot of things.

Classic HTM focuses on both the temporal aspect and the sparsifying behavior of the spatial pooler.

TBT starts to move on (baby steps) to a system view. At this level we start to see the pattern completion behavior of the cortex. The cortex takes the noise buzzing/blooming world in and pattern-completes that to a region/map wide sparse representation of a prior learned pattern.

At some point the H of HTM will come into focus that then we will start to see the more interesting things that the cortex does - associating a pattern in one map/region to patterns in other maps/regions. This chaining of a perception in one map to remembered solution (guided by needs from the subcortex) is where the magic happens. If this interests you see my numerous posts on the “loop of consciousness” for the kind of processing done at this level. See the example below.

This is the level I have been looking at for many years - the system level behavior that does all the interesting things that the brain does.

Knowing what a single transistor does, even if in great detail, does not really explain how a collection of transistors in a computer draws a cat picture from the internet. Even stepping up a level to making gates and flip-flops is still a long way from kitten pictures. You need higher-level concepts and thinking to understand the system level behaviors.

2 Likes