Problem with Generalization and inversely correlated variables (or constraint variables)

Question
Why is ML fixated with the concept of Generalization when constraint variables/variables (e.g. weights in ANN) can be inversely correlated? Hence Generalization will be a big question, its best performance will only be the local minima otherwise it will underfit/overfit.

Thoughts
During training, the state of a model will evolve from S1 to SN, where N is the final state index. SN is also the final model’s state, hence it may have reached its optimal configuration where an objective function is maximized/minimized.

In fact, reaching SN will cause forgetting of variables that were relevant in the previous states. Hence, new inputs/features may not be familiar with the final model.

I think one of the reasons why Deep Learning works or ANN, in general, is that because it can employ a vast amount of these constraint variables and its updates or reconfiguration is relatively smooth (tiny values), hence inverse correlations might not directly cause a problem or a big side-effect. In HTM inverse correlations are a problem because it is updating itself relatively faster (can have bigger increment values) but are mitigated by boosting and inhibition.

The problem arises when there are so many inversely correlated variables and a single or specifically the final model of an ANN is impractically modeling them - I’m thinking of the models built by the likes of Google, they take up so much space and compute power.

Ensembles of States
Why not instead ensemble models that are trained with different input sequences where these models could represent their previous “healthy configuration” at some training iterations? I would view these configurations as for example reaching local minima in gradient-descent. These ensembles can also be inspired by quantum superposition, where a each model corresponds to a state. Then the final decision is a consensus of these states or perhaps some equation that will predict which state is most probable at a certain input? Additionally, with this ensemble, the prediction task can also focus more on “which state is more probable to effectively predict an input” given an unknown input.

What are your thoughts?

Perhaps the answer is error correction. Even with a single weighted sum you can get a type of error correction. The variance equation for linear combinations of random variables will tell you the type of preprocessing you need to do. The central limit theorem also gives some clues. It is indeed a funny sort of error correction being a pull toward a central vector in hyperspace. Nevertheless it is real, it’s just you can’t keep reapplying it sensibly.
In deep neural networks I evolved strong filtering/error correction was apparent.
If you train on faces is not a pull/filtering/error correction toward a general face shape not a generalization?

I’m guessing you are suggesting to build a mean of the modes S1~SN? That does not work because ANN’s internal representation evolves over time and they are non-linear. You get garbage out doing so.

If not, you’re just pushing the problem to a further stage. You still need an algorithm (likely a ML model) to determine which/how much weight to assign to a model in the ensemble. Said model still experience the problem you described while also being nearly impossible to train.

You could ensemble the outputs by using a large vector to small vector random projection. The point being if there is a large error in the output of one neuron in one ensemble network, that error is spread out (diluted) over all the actual outputs.
It will appear as a tiny amount of additive Gaussian noise across the entire output vector, rather than as one glaring fault. Think of an inverse FFT that instead of producing a particular frequency pattern on the output from a single input, a transform that produces a particular random pattern on the output from a single input. A transform from a point like coherent domain to a higher dimensional incoherent domain. There is a slight quantum aspect to that, maybe.
Anyway the random projection should be invertible unless you are training by evolution when you don’t need that restriction.

Filtering/error correction in the middle of a deep network means the network is pulling its internal representation toward certain key concepts and categories. It is learning the exact boundaries of these concepts and categories, and adapting current ones or creating new ones where the current ones don’t agree with the data. Filtering/error correction is a gentle move toward symbolic representation in such systems.

1 Like

I’m not sure if that is even possible for an ANN. However I’m talking about generalization in a general sense.

This would be model-architecture-specific I believe. For ANNs maybe it won’t be even possible because ANN is inherently offline. I’m talking about online learning by the way.

So firstly I was talking about online learning and also having the liberty to relate to the human mind. Sorry if it wasn’t clear enough.

I would like to start with this question. Do you believe that biological learning causes forgetting? In my case I believe this is true because I feel that I’ve experienced it.

The idea of central limit theorem I believe is useful only if we assume that the existence of variables do not result to disappearance/irrelevance of other variables and vice versa. A normal distribution is just a snapshot of a static distribution but in the real world a distribution could shift in shape so making other points irrelevant or relevant at a particular iteration or state.

But we all know by intuition that data samples exist with hidden causal relationships. Some relationships would cause one variable to become of lesser value and vice versa. These variables even if they are inversely correlated can be modeled together in ML but up to some level only. Otherwise it would need a huge model and requires vast amount of data to model a part of the real world which is very impractical, let alone the computing power that it requires.

My thoughts about ensembling is that, finding for these distributions that can generalize variables with lesser forgetting of course with some parameter that controls the scope of these distributions, and build models from these distributions. The models can them be ensembled.

In the real world, an analogy would be that, it is a fact that many countries’ cultures disagree with many other countrie’s cultures. But we are not currently on world war, one reason is because groups for ideologies, beliefs, cultures, politics, etc that agree with each other can exist, sympathize with each other and neutralize at a certain level that is accepted in the global society. In turn these groups when combined together does not cause any world war, most importantly it doesn’t generally cause extinction of the other groups (forgetting). But without these groups, everyone would try to enforce their interests with little consideration to the effects to other countries and when they succeed they get greedier and pursue colonizaiton for example. This is like generalization in modern and mainstream ML, everyone wants to model something real from static data and hope that the resulting model can model the unseen real-time data stream in the real world which of course can’t be modeled by a static distribution.

I think I can summarize my thoughts to a simple question. Is there a sound theory and proven tests that curve-fitting the real-world real-tme data stream is a practical endeavor? Maybe I missed something as I’m not an ML guru. The self-driving vehicle would be a good test, but we all know that it is still not totally reliable and it is not purely ML-driven.

1 Like

I am not sure that I would call it forgetting - I prefer generalization in perception and recall. Old memories form connections that lead to useful generalizations when perceiving new experiences.

I would add a data point for your consideration - I have kept journals on various trips I have taken around the world. I have opened journals from 20 years ago on trips I have forgotten that I have ever taken (much less any of the details of the trip) and when I start reading it all come back in great detail.
Those memories are still there somewhere - it just does not have any key to recall the experience. When that key is provided it all floods back in surprising detail.

BTW: I do second recommendations to keep a journal. There are almost always dead times where you can update it and it is amazing to read your old personal journals. Include notable details of what you saw, heard, smelled, felt. How did it make you feel? Were you hot? Cold? Annoyed? Happy? Excited? Irritated or upset? Car or seasick? Was it sunny or overcast? The smell of the sea or bakery? What odd suit was that guy wearing? These multi-sensory keys are part of what makes the old memory come alive.

1 Like

I agree but there is always that experience or memory that you have already forgotten or cannot be recalled. Common sense tells me that this is because of some limit that has been reached. For memory this could be capacity limit so something has to go away. That something is probably highly likely to have inverse correlation with the recent/resident stored memory/s that is currently learned. So capacity might be the size if generalizable inputs, beyond this something has to go away and if it continues to get inveresely correlated variables it can go to a chaotic state.

Seems about time to throw this old chestnut in the mix:

@marty1885 @Bitking @SeanOConnor

I think the relationship of my question to HTM is that roughly speaking, generalization in HTM, at least for an SP is equal to its memory capacity. The memory capacity is the ability of its instantaneous state/configuration/linear combination (or whatever you would like to call it) to recall inputs. Recall here means a mapping process of inputs to outputs without forgetting or kicking out memory items (input representations). Even though most probably an HTM ML engineer’s goal is to “generalize” inputs as much as it can as any ML engineer would do (e.g. in ANN’s case), the HTM models themselves seemingly are not designed for this ambitious task. For example, based on people’s testimonials in their HTM works, the HTM can only do simple ML tasks.

Now, this is the beauty of HTM models I believe is because they are very simple and are potential good candidates for intelligent and distributed computational units for a massively distributed environment. In my experience, these models don’t quite encourage engineers to intentionally overparameterize these models so to increase the generalization output because they fail quickly but can recover fast, see TM it catches up quickly with new patterns. Below is an oversimplified application that is possible due to HTM’s simplicity, and this would probably be impractical for a deep learning model.

Imagine if there is a parent SP doing online learning, and when it “reaches its capacity” it stops learning (can be a variable) but spawns a child SP and this child SP does learning and spawns another child and so on (let’s call them nodes). This simple application would ensemble healthy or primed SP’s (specialized for a set of correlated inputs) at the same time minimizing the forgetting of old memories while creating new ones. If one allows these nodes to participate in a dynamic election algorithm, in theory, they may produce good results. If one thinks about this in terms of set theory, these nodes are actually subsets, while the combination of them is the intersecting solution sets in a solution space. I could write more variations of this simple application actually.

1 Like

Good point! May you clarify a few points, so I can run experiments?

Are you purposing to have a cascade-style of SPs. Assuming we have SP1…SPn. If the result from SP1 is inconclusive/it thinks it should ask SP2 for answer; we go to SP2. And so on…

Do you have an algorithm in mind?

Why does only HTM benefit from it? I believe same cascade/primed approach can also work in ANNs?

I’m roughly suggesting that an SP can replicate itself (child/sibling) so that it can maintain its memory (learned patterns) and minimize forgetting, and allow a copy of itself to learn new patterns. These copies of SPs if you can imagine will then be allowed to participate to an election algorith or in general an algorithm that takes advantage of their retained memories.

Note that the main point here is that to allow an SP to reach at a “healthy” state (e.g. stabilized to some subset of the input set) and regenerate/replicate itself to cater more input subsets with minimal forgetting. Generally treating an SP as simple but learning agent, however these SPs when combined can be used to form a multi-agent approach to approaching an ML task.

There are many election algorithms out there, one is simply counting the largest number of votes.

For the case I’ve elaborated above, where SPs are replicated, ANNs are also capable of this but this will be an impractical design. An ANN (at least in deep learning) is designed to be capable of modelling hundreds or thousands or millions of parameters, this also means in the computing world that it is good at scaling up, also at least for now, no one would like to run an ANN as an online intelligent agent., it would not be fitting for ML engineers/data scientist thought process. While I believe models like the SP are probably unintentionally designed to scale out because they are very simple models and at the same time capable of learning but you can only push them up to some limit which is still at a manageable level - I noticed this in my experiments before and also in this forum about people telling along the lines of SPs can only do simple tasks. Another thing to consider is that SPs don’t perform error-driven learning therefore it doesn’t converge fast unlike ANNs but it can focus or get attracted on a subset of inputs.

@SeanOConnor

A relevant topic a change in distribution. See at 4:40 - 5:25.

The thousand Brains Theory reinvigorated this idea in me. The idea of voting between small units of coulmns is very interesting and kind of suggests that generalization is done in smaller scale and thus minimizing the variables with inverse relationships. Still reading more.

1 Like