So, I don’t know how we forget things. But the memories can be restored (patterns activated) through the right context.
I don’t know how the brain works, but there are some obvious issues to understand due to the nature of the problem.
My interest is in real-world learning problems such as a robot learning to interact with it’s environment – trying to duplicate human and animal learning skills basically – general AI stuff.
In this type of environment, we have sensors that produce massive amounts of very noisy data. we can see 1000 images of a cat with the robot “eyes” and no two images will ever be the same – not even “close” to the same.
So in your network, you seem to be looking at actual binary patterns as activation for your nodes. So if the inputs are “100110101” that’s the pattern a node is activating on. In smaller toy (low dimension) environments, an approach like that can work well. In a high bandwidth high noise environment, we will see small patterns all the time (a pixel value of 40 will happen a lot, but so will all the other 255 possible pixel values for a single color channel). If you try to build nodes to recognize precise pixel patterns (say 3 RGB values 8 bits each) you will have a 24-bit pattern that shows up in something close to 2^24 different combinations of 1’s and 0’s. You would need something near 2^24 neurons (16 million nodes) just to recognize all the patterns that come from a single pixel of a video image (robot eye) – and the number of combinations of these patterns would be massive and grow out of control very quickly if you tried to create nodes dynamically to recognize new patt3erns as they showed up.
This is a classic scaling problem – we don’t have enough hardware in the universe to make it work. Not even “fill up up your disk” with virtual nodes will even begin to touch the nature of the problem of recognizing a real-world cat walking in front of a real-world robot using video data as “eyes”.
So we must generalize. We must compress in some way. We must take a massive fire-hose stream of data from the external sensors and compress it down to internal representations of N nodes. Where N is some number that represents how much hardware we willing to throw at the problem. Our hardware is always limited.
Real-world learning is an inherent lossy compression problem. We must throw most the data away. Guaranteed. The brain has to be doing this as well. What we can store in the brain, is infinitely smaller than what the sensors are sending to the brain over our lives.
All this relates to your “forgetting” issue. We must “Forget” almost ALL of what we receive in our sensory stream and these learning systems must do that as well. Forgetting is not the hard problem. We can’t remember everything in real-world problems, so we must answer the question of what do we choose to remember – how do we know which of the 1 bit out of billion we “remember”?
Our learning system must implement some system for forgetting just by the fact it’s impossible to remember everything.
The way I like to look at the general problem is that we must use these learning networks to represent the state of the environment as accurately as possible, using the limited N nodes we have to work with (N could be a 100 nodes, or a trillion nodes – the problem is the same either way). If you have a high bandwidth sensory feed like a video stream how would you compress it down to only 10 internal signals in 10 nodes of a learning network so that those 10 bits of data (node activations) at each time step, best represented the 100 million bits that come in for each time step?
If you have a general algorithm to compress any large raw sensory input stream to some small internal repetition, then you just pick the size of your internal network to give you the resolution needed to solve a given learning problem. This is much like how as an engineer we can pick what number of pixels to use a camera to give us the resolution needed to solve some problem at the lowest cost. The light entering the lens has massive amounts of data, but we reduce that massive data down to X pixels of information the camera hardware where we can choose any number X we want to build a camera for.
The learning network needs to work the same way. We pic the number of nodes we want to use to set the resolution of “understanding” the system can have – and feed it any massive stream of data we want to, and it compresses that data down to N bits by throwing MOST the data way, but ending up with the best possible N bit representation of the data we can create.
The general approach to this problem that seems to lead to some useful results is to compress using both spatial and temporal predictions. If input bit X at time T predicts input bit Y at time t+1 then we can represent this temporal pattern by one bit internally, so we have created compression.
But in the real life, there are no (or very few) 100% predictions. Bit’s seldom correlate at 100% rate so we can’t compress them down without loss of data.
But what we can do, is pick the compression mappings that lose the least amount of data possible.
If input bit A correlates at a 10% rate to input bit B, and an 80% rate to C, then the AC pattern is more useful to “remember” than the AB pattern because having a node that “recognizes” the more common AC pattern allows the system to “remember” more data than wasting an entire node on the “AB” pattern which is rare.
Your approach to creating one new node per cycle (remembering one more new bit of correlation data) is a way of allocating your hardware to the stuff that is most common which produces a good internal representation.
But in a very high information environment of real-world sensors, you can’t get away with remembering "actual’ bit patterns (HTM suffers from this as well). There are just too many to remember.
So we must move to systems that use a probabilistic approach to pattern recognition. So bit inputs that activate a node need something like a weight that represents some measure how likely that bit is to be on when the node it feeds is on. Then the system learns to adjust these weights to make the nodes activate in ways that best represent all the inputs.
Your “only N nodes fire at once” logic (I think that’s what you implied by your “one” parameter") I assume must be some measure of-of how “well” the nodes match their patterns or something to pick which N nodes will be active? Or to adjust weights to keep more than N firing at once or something?
The end result is that the internal nodes need to learn to represent the input data patterns that happen the most but which don’t overlap with the patterns other nodes are representing.
It all boils down to a big lossy compression problem where the number of bits we compress down to, is fixed (N nodes to represent the state of the environment with). We can’t escape the need to solve this problem by dynamic allocation because we will then just have to figure out how to throw out nodes as fast as we are creating them because we will run out of hardware very quickly. But it’s possible that some nice efficient shortcuts to solving this class of problem can be found by dynamically building a network based on what patterns are seen, and dynamically pruning the least used nodes.
Computer compression algorithms like Lempel–Ziv–Welch use this technique to build an expanding pattern recognizing tree based on what is seen but since the goal is to lossless compression, they never prune the tree or throw any data out. For these AI problems, we have to limit our trees to N nodes and throw MOST the data way (to reach human-like learning).
So your answer to how we forget is just the inverse of what we are able to remember – which is a small tip of the iceberg of the most common patterns we are exposed to. What a system like this can remember, is only what is most common in its environment.
So, when what is common shifts over time, what we can remember shifts with it and what we forget is what is no longer common. Learning system like this can only remember the very small tip of the iceberg of what is common in the environment. The less we are exposed, the more we “forget”. If we live in a world full of cats, our brain fills up with lost of details about cats. But if we then move to a world that has no animals, and live there for years, our brain slowly erases the details of the cats because the nodes used to recognize ct figures are slowly being re-tuned for new features in the new environment, making the resolution of our memory of cats fade over time.
Our entire ability to recognize a pattern like “cat” can be understood in this “remember what is most likely” approach is because a real-world cat creates lots of redundant data in our sensory streams – the real cat makes our sensory stream predictable so the learning network forms an abstract patterns of cat as a way to label the predictability that exists in the sensory data. But to be effective in a real-world high bandwidth high noise environment, it must all be done using very probabilistic learning systems – not systems based on absolute bit patterns.