Is temporal memory prone to catastrophic forgetting?


Online machine learning algorithms are prone to catastrophic forgetting. Is TM also prone to CF? Has anyone seriously investigated this issue?



Yes, in fact, this is a subject that has been visited in several of the Numenta papers.

They are usually more concerned with noise immunity and damage tolerance of the network but now and again they talk of capacity and uniqueness

This is covered in a “drive by” way in sections D and G here:

Also here, the part that is the “drive by” bit is unintentional unions. (Part of section 3.3)

The don’t specifically call it catastrophic forgetting but that is what is under consideration.
Numenta is calling it false positives.



I may be using the wrong definition of the term, but I normally think of catastrophic forgetting as the behavior you sometimes see when you take an ANN that is well-trained on one data set and start training it on some new type of data, and it completely forgets everything that it learned on the original data. For many classical algorithms, you have to combat this phenomenon by intermixing the different types of training data together rather than training on them separately.

I’ve not experienced that particular behavior with the TM algorithm myself. That said, there is a capacity for any given network, and the closer you get to it, the worse it gets at making accurate predictions. Overcoming the capacity limits usually involves adding a global forgetting rate to clear up old unused connections. If this is done too aggressively (such is if you are overworking a system that is really too small for your use case), then the system will have catastrophic forgetting.



Taking this definition of catastrophic forgetting it seems that TM should be generally robust against it, given the high capacity of the memory afforded by sparse activation and connectivity. In order to forget patterns they would need to be overwritten right? For instance if TM learned ‘A,B,C’ early on, it would only forget it by seeing enough overlapping patterns which would decrement away certain connections (like seeing ‘A,B,D’ a bunch of times more recently). Does this intuition sound valid?



It would generally be unlikely to get fully overwritten, unless the capacity was getting maxed out (which as you mentioned is typically pretty high). It depends on the parameters too, though (for example, you can set the activation threshold low, which could result in segments being retrained in cases where a new segments should probably have been formed instead).

Were I usually run into problems with forgetting is with the global permanence decrement setting (it can be exacerbated if the connection threshold and permanence increment/decrement are also set too high). You typically want to balance the parameters so that things which are experienced very frequently reaching the highest levels of permanence take a lot of decrements to forget, while at the same time maintaining the ability to learn new things quickly and maintain accuracy by not maxing out your capacity. It is something you get a feel for with experience.



Right ok, the global perm dec makes it so that old patterns are eventually forgotten for their age alone without needing to be overwritten.

1 Like


The brain grows and shrinks connections dynamically every day.
There is more than simple linear forgetting going on.

I imagine that there is a floor where you don’t completely drop a connection so it trains back up again very fast.



Agree with Paul. TM is robust against catastrophic forgetting. TM still forgets thing after it has grown to it’s full capacity. But never forgets everything after a few epochs of training on other dataset.



This would take a really huge amount of data with a huge amount of overlapping patterns right?



In my experience. TM’s capacity works in a per-bit basis. A 64 column x 24 cell TM can have 64 input bits and the capacity of ~24 (can be larger due to active threshold) ways to each column to be on. After that TM can only tries it’s best to learn.
Its virtually impossible to use up the entire capacity with proper Spatial Pooling/Grid Cells. But you might run into it if you got a lot of overlapping patterns in your sequences.



In fact, learning (LTP) and forgetting (LTD) are tightly coupled (via neuromodulators) [1]. The biochemistry behind is amazingly complex (especially LTD). Looks like when you are not learning, somehow you are forgetting.

[1] S. Y. Huang, M. Treviñoo, K. He, A. Ardiles, R. dePasquale, Y. Guo, A. Palacios, R. Huganir, and A. Kirkwood, “Pull-Push neuromodulation of LTP and LTD enables bidirectional experience-induced synaptic scaling in visual cortex,” Neuron , vol. 73, no. 3, pp. 497–510, 2012.

1 Like



This is one of the gray areas in HTM implementations where many practitioners will have different conclusions or methods of measurment to answer such question. Mostly the only definitive answer which Im thankful about is coming from an experience, but then we cannot simply assume all inputs/sequences are tested by experience only.

I hope there is some formula or standard way of calculating capacity and etc.

My 2 cents is that a TM can catastrophically forget what it learned in the past, but the question of “proneness” is hard to quantify. In theory a cell can have disjoint connections and these connections will compete with each other. Their fate is at the mercy of what’s seen so far, therefore its possible to decay one connection because of some other preferred connection.

1 Like


This should be easy to test: train TM based model on one task, then on another, then test on the first task. Do the same with a standard NN and compare the results.

There are lots of methods to reduce CF in NNs. Here’s a good overview.



Are there any good tests that I can run in python to check if a network has overcome catastrophic forgetting? There are lots of papers there, but I don’t see a database to check a network on.

I might’ve set something similar up by trying to get a spatial pooler to learn all of unicode though. I set up a regular pytorch autoencoder with one hidden layer, and then added boosting and kwinners to the middle to make it sparse. This ended up selecting different neurons for representing different inputs, so chinese-like inputs would always select the chinese neurons, and emoji-like inputs would always select the emoji neurons, and there would be some variations for other neurons to cover.

However, that doesn’t mean it completely overcomes catastrophic forgetting. It could never quite recognize all the letters. I could see it retraining even after hours and hours of training, smoothly moving from one character to the actual input, but I could see many different classes of characters ‘guessed’ at first, so it may be more robust to catastrophic forgetting than other networks.

Temporal memory similarly uses different subsets of the total neurons, so I’d think it would be similarly be more robust to catastrophic forgetting.



I know the original question was about temporal memory, but it is probably worth mentioning that there is a difference between the TM algorithm and the SP algorithm which is probably pretty relevant here. In SP, each minicolumn trains a single segment. This increases the chances of things being forgotten when the same segments are retrained on some new input. This is purely intuition, but I suspect the TM algorithm is going to be more robust against this problem than the SP algorithm for this reason.