Question
Why is ML fixated with the concept of Generalization when constraint variables/variables (e.g. weights in ANN) can be inversely correlated? Hence Generalization will be a big question, its best performance will only be the local minima otherwise it will underfit/overfit.
Thoughts
During training, the state of a model will evolve from S1 to SN, where N is the final state index. SN is also the final model’s state, hence it may have reached its optimal configuration where an objective function is maximized/minimized.
In fact, reaching SN will cause forgetting of variables that were relevant in the previous states. Hence, new inputs/features may not be familiar with the final model.
I think one of the reasons why Deep Learning works or ANN, in general, is that because it can employ a vast amount of these constraint variables and its updates or reconfiguration is relatively smooth (tiny values), hence inverse correlations might not directly cause a problem or a big side-effect. In HTM inverse correlations are a problem because it is updating itself relatively faster (can have bigger increment values) but are mitigated by boosting and inhibition.
The problem arises when there are so many inversely correlated variables and a single or specifically the final model of an ANN is impractically modeling them - I’m thinking of the models built by the likes of Google, they take up so much space and compute power.
Ensembles of States
Why not instead ensemble models that are trained with different input sequences where these models could represent their previous “healthy configuration” at some training iterations? I would view these configurations as for example reaching local minima in gradient-descent. These ensembles can also be inspired by quantum superposition, where a each model corresponds to a state. Then the final decision is a consensus of these states or perhaps some equation that will predict which state is most probable at a certain input? Additionally, with this ensemble, the prediction task can also focus more on “which state is more probable to effectively predict an input” given an unknown input.
What are your thoughts?