Problem with Generalization and inversely correlated variables (or constraint variables)

Question
Why is ML fixated with the concept of Generalization when constraint variables/variables (e.g. weights in ANN) can be inversely correlated? Hence Generalization will be a big question, its best performance will only be the local minima otherwise it will underfit/overfit.

Thoughts
During training, the state of a model will evolve from S1 to SN, where N is the final state index. SN is also the final model’s state, hence it may have reached its optimal configuration where an objective function is maximized/minimized.

In fact, reaching SN will cause forgetting of variables that were relevant in the previous states. Hence, new inputs/features may not be familiar with the final model.

I think one of the reasons why Deep Learning works or ANN, in general, is that because it can employ a vast amount of these constraint variables and its updates or reconfiguration is relatively smooth (tiny values), hence inverse correlations might not directly cause a problem or a big side-effect. In HTM inverse correlations are a problem because it is updating itself relatively faster (can have bigger increment values) but are mitigated by boosting and inhibition.

The problem arises when there are so many inversely correlated variables and a single or specifically the final model of an ANN is impractically modeling them - I’m thinking of the models built by the likes of Google, they take up so much space and compute power.

Ensembles of States
Why not instead ensemble models that are trained with different input sequences where these models could represent their previous “healthy configuration” at some training iterations? I would view these configurations as for example reaching local minima in gradient-descent. These ensembles can also be inspired by quantum superposition, where a each model corresponds to a state. Then the final decision is a consensus of these states or perhaps some equation that will predict which state is most probable at a certain input? Additionally, with this ensemble, the prediction task can also focus more on “which state is more probable to effectively predict an input” given an unknown input.

What are your thoughts?

Perhaps the answer is error correction. Even with a single weighted sum you can get a type of error correction. The variance equation for linear combinations of random variables will tell you the type of preprocessing you need to do. The central limit theorem also gives some clues. It is indeed a funny sort of error correction being a pull toward a central vector in hyperspace. Nevertheless it is real, it’s just you can’t keep reapplying it sensibly.
In deep neural networks I evolved strong filtering/error correction was apparent.
If you train on faces is not a pull/filtering/error correction toward a general face shape not a generalization?

I’m guessing you are suggesting to build a mean of the modes S1~SN? That does not work because ANN’s internal representation evolves over time and they are non-linear. You get garbage out doing so.

If not, you’re just pushing the problem to a further stage. You still need an algorithm (likely a ML model) to determine which/how much weight to assign to a model in the ensemble. Said model still experience the problem you described while also being nearly impossible to train.

You could ensemble the outputs by using a large vector to small vector random projection. The point being if there is a large error in the output of one neuron in one ensemble network, that error is spread out (diluted) over all the actual outputs.
It will appear as a tiny amount of additive Gaussian noise across the entire output vector, rather than as one glaring fault. Think of an inverse FFT that instead of producing a particular frequency pattern on the output from a single input, a transform that produces a particular random pattern on the output from a single input. A transform from a point like coherent domain to a higher dimensional incoherent domain. There is a slight quantum aspect to that, maybe.
Anyway the random projection should be invertible unless you are training by evolution when you don’t need that restriction.

Filtering/error correction in the middle of a deep network means the network is pulling its internal representation toward certain key concepts and categories. It is learning the exact boundaries of these concepts and categories, and adapting current ones or creating new ones where the current ones don’t agree with the data. Filtering/error correction is a gentle move toward symbolic representation in such systems.

1 Like

I’m not sure if that is even possible for an ANN. However I’m talking about generalization in a general sense.

This would be model-architecture-specific I believe. For ANNs maybe it won’t be even possible because ANN is inherently offline. I’m talking about online learning by the way.

So firstly I was talking about online learning and also having the liberty to relate to the human mind. Sorry if it wasn’t clear enough.

I would like to start with this question. Do you believe that biological learning causes forgetting? In my case I believe this is true because I feel that I’ve experienced it.

The idea of central limit theorem I believe is useful only if we assume that the existence of variables do not result to disappearance/irrelevance of other variables and vice versa. A normal distribution is just a snapshot of a static distribution but in the real world a distribution could shift in shape so making other points irrelevant or relevant at a particular iteration or state.

But we all know by intuition that data samples exist with hidden causal relationships. Some relationships would cause one variable to become of lesser value and vice versa. These variables even if they are inversely correlated can be modeled together in ML but up to some level only. Otherwise it would need a huge model and requires vast amount of data to model a part of the real world which is very impractical, let alone the computing power that it requires.

My thoughts about ensembling is that, finding for these distributions that can generalize variables with lesser forgetting of course with some parameter that controls the scope of these distributions, and build models from these distributions. The models can them be ensembled.

In the real world, an analogy would be that, it is a fact that many countries’ cultures disagree with many other countrie’s cultures. But we are not currently on world war, one reason is because groups for ideologies, beliefs, cultures, politics, etc that agree with each other can exist, sympathize with each other and neutralize at a certain level that is accepted in the global society. In turn these groups when combined together does not cause any world war, most importantly it doesn’t generally cause extinction of the other groups (forgetting). But without these groups, everyone would try to enforce their interests with little consideration to the effects to other countries and when they succeed they get greedier and pursue colonizaiton for example. This is like generalization in modern and mainstream ML, everyone wants to model something real from static data and hope that the resulting model can model the unseen real-time data stream in the real world which of course can’t be modeled by a static distribution.

I think I can summarize my thoughts to a simple question. Is there a sound theory and proven tests that curve-fitting the real-world real-tme data stream is a practical endeavor? Maybe I missed something as I’m not an ML guru. The self-driving vehicle would be a good test, but we all know that it is still not totally reliable and it is not purely ML-driven.

I am not sure that I would call it forgetting - I prefer generalization in perception and recall. Old memories form connections that lead to useful generalizations when perceiving new experiences.

I would add a data point for your consideration - I have kept journals on various trips I have taken around the world. I have opened journals from 20 years ago on trips I have forgotten that I have ever taken (much less any of the details of the trip) and when I start reading it all come back in great detail.
Those memories are still there somewhere - it just does not have any key to recall the experience. When that key is provided it all floods back in surprising detail.

BTW: I do second recommendations to keep a journal. There are almost always dead times where you can update it and it is amazing to read your old personal journals. Include notable details of what you saw, heard, smelled, felt. How did it make you feel? Were you hot? Cold? Annoyed? Happy? Excited? Irritated or upset? Car or seasick? Was it sunny or overcast? The smell of the sea or bakery? What odd suit was that guy wearing? These multi-sensory keys are part of what makes the old memory come alive.

I agree but there is always that experience or memory that you have already forgotten or cannot be recalled. Common sense tells me that this is because of some limit that has been reached. For memory this could be capacity limit so something has to go away. That something is probably highly likely to have inverse correlation with the recent/resident stored memory/s that is currently learned. So capacity might be the size if generalizable inputs, beyond this something has to go away and if it continues to get inveresely correlated variables it can go to a chaotic state.

Seems about time to throw this old chestnut in the mix: