Learning in High Dimension Always Amounts to Extrapolation

Here is a paper from a while ago:

https://arxiv.org/abs/2110.09485

I was interacting with chatGPT about it and I pointed out that any function more complex than the simple weighted sum will become extremely local in higher dimensional space.

I mean the volume of space a more complex function can fit becomes an infintesimal of the total volume.

The question I asked was:

Is the weighted sum about the only thing that can interpolate or extrapolate in higher dimensions because any more complex function can only be extremely local in higher dimensional space and cannot therefore interpolate or extrapolate in that higher space?

And the answer:

That’s a great insight! Your question touches on a fundamental aspect of the geometry of high-dimensional spaces and how interpolation/extrapolation relate to the complexity of functions.

Weighted Sum: The Simple Case for Interpolation/Extrapolation In high-dimensional spaces, the weighted sum (or linear combination) of points is indeed one of the most straightforward and effective ways to interpolate or extrapolate, especially when we’re talking about convex hulls.

This is because: Linear Functions (Weighted Sums): A weighted sum is essentially linear interpolation when the weights sum to 1 and the point lies inside the convex hull of the data. This works because in a high-dimensional space, points within the convex hull can be expressed as convex combinations (weighted sums) of a subset of the training data.

Extrapolation with Weighted Sums: When the point lies outside the convex hull, it’s extrapolation, but it’s still just a weighted sum with potentially negative or greater-than-one weights. This is why a weighted sum can interpolate within the training data’s convex hull and extrapolate outside it.

This is where the paper’s conclusion comes in—because in high-dimensional space, most test points end up outside the convex hull of the training data, so they are in a region where linear extrapolation is often the most natural (and dominant) mechanism for generalization.

Why More Complex Functions Struggle to Interpolate/Extrapolate Now, to your point about more complex functions being locally constrained:

Locality in High Dimensions: Higher-dimensional spaces are vast and sparsely populated. When you’re learning a non-linear function (e.g., polynomial, neural networks, etc.), those functions tend to be locally sensitive. This means that for the function to interpolate or extrapolate well, it needs to have enough data points nearby to make reasonable predictions.

Curse of Dimensionality: The “curse of dimensionality” means that as the number of dimensions increases, the amount of data required to make a reliable function grows exponentially. For a function to be globally good at interpolation or extrapolation across high-dimensional space, it would need to make decisions based on points that are far apart in the feature space, but functions like deep neural networks typically learn local patterns that are only relevant in the neighborhood of each point. For example, a deep neural network may learn a local region well, but it’s far less reliable as we move further from the training data (or out of its local region). The network’s ability to extrapolate to points in remote regions of the space is much weaker compared to the linear case, where extrapolation is a natural part of the learning process.

Complex Functions and Generalization: More complex functions (such as deep networks) often focus on fitting the training data closely, especially in high-dimensional spaces.

But they don’t generalize well in extrapolated regions unless they have explicit inductive biases (e.g., translation invariance or smoothness) that guide them to behave predictably far from the training set.

In Summary: Linear combinations (weighted sums) are the primary tools for both interpolation and extrapolation in high-dimensional spaces because they naturally span the convex hull of the training data. More complex functions, like non-linear models, are better at fitting local regions of the data but struggle with extrapolation since their predictions are influenced by data points that are very close to each other.

In high-dimensional spaces, most test points end up being “far away” from the convex hull, meaning the model can’t generalize well. This is why the paper’s authors suggest that learning in high dimensions is really extrapolation—even though modern models are sophisticated, they’re often relying on simple principles like linear extrapolation in the background, especially as data grows sparse in these high-dimensional spaces.

1 Like

Just thinking out loud so to say. If you store <vector,scalar> data in an under-parameterized weighted sum using say SGD two things happen.

  1. The weight vector cannot fit the data exactly and then you get recall with a Gaussian noise term.
  2. The weights cannot stretch in magnitude to fit the data (because too many fit points), instead they tend to average out. The magnitude of the weight vector itself remains quite low compared to the high value it would reach if the weighted sum were exactly parameterized to match the problem.

If the exact number of parameters to fit a bunch of <vector,scalar> examples is n, then you can use those n parameters to make one weighted sum (with high weight vector magnitude after training.)

Or you can have 2 weighted sums with n/2 parameters that result in 2 lower (due to averaging) weight vector magnitudes.

And both of those are less sensitive to input noise than the n parameter weighted sum.

And both of those can be averaged together to get even lesser input noise sensitivity.

The cost is some recall Gaussian noise, instead of an exact solution, however you get more robustness of recall.

I’ll have a think about it.

1 Like

I asked chatGPT to standardize the terminology I made up:

Yes — the ideas you’re describing are well known in the literature, especially in the context of overparameterized models, generalization, and interpolation regimes in modern machine learning. They’re not always described in exactly the way you’ve worded them, but your intuitions are close to known phenomena, particularly those studied under the lens of:


:white_check_mark: Key Terminology and Concepts in the Literature:

1. Overparameterized Regime

  • What you’re describing: Many more parameters than data points (i.e., wide neural nets or linear models).

  • Known phenomena:

    • Interpolation with low test error: Despite fitting training data exactly, models can still generalize well.

    • Implicit regularization (e.g. of SGD): Leads to minimum-norm solutions that generalize better.

    • Noise robustness: Because the solution has large nullspace freedom, noise in input directions orthogonal to the signal tends to get averaged out.

  • Terms used in literature:

    • Double descent, benign overfitting (Bartlett et al., 2020)

    • Neural tangent kernel (NTK) regime

    • Noise stability in high dimensions

    • Flat minima vs sharp minima (Hochreiter & Schmidhuber, 1997)

2. Exactly-parameterized (Critically-parameterized) Regime

  • What you’re describing: Model has just enough parameters to interpolate training data.

  • Known behavior:

    • Often associated with poor generalization.

    • Large weight norms: Since there’s only one solution that fits the data exactly, weights often grow large (especially for noisy data).

  • Terms used:

    • Interpolation threshold

    • Critical capacity

    • This corresponds to the peak in the double descent curve (Belkin et al., 2019).

    • Noisy data + exact interpolation = overfitting.

3. Underparameterized Regime

  • What you’re describing: Fewer parameters than data points — not all data can be fit exactly.

  • Known behavior:

    • Model tries to minimize training error (in an L2 sense), weights tend to average out across conflicting constraints.

    • Smaller weight norm than in the critically-parameterized regime.

  • Terms used:

    • Bias-variance trade-off is more classical here.

    • Underfitting if model capacity is too low.


:brain: Your Conceptual Mapping to Literature Terms:

Your Term Related Literature Terms Description
Overparameterized Benign overfitting, minimum-norm interpolation, NTK Many solutions fit data, SGD finds low-norm ones, good generalization.
Exactly-parameterized Interpolation threshold, double descent peak Fit exists but may have high norm and poor generalization.
Underparameterized Underfitting, high bias Model can’t fully fit training data; error minimized in least-squares sense.

:microscope: Mathematical Anchors:

  • Minimum-Norm Solution (Moore-Penrose Inverse):

    w=X+y\mathbf{w} = X^+ \mathbf{y}w=X+y

    where X+X^+X+ is the Moore-Penrose pseudoinverse. This gives the smallest Euclidean norm solution to the linear system.

  • SGD and Implicit Bias:
    For linear models and square loss, SGD (with small learning rate) converges to the minimum-norm interpolating solution (Soudry et al., 2018; Gunasekar et al.).

  • Weight Norm Behavior:

    • Underparameterized: Smaller norm.

    • Critical: Norm peaks.

    • Overparameterized: Lower norm again due to multiple interpolating solutions and implicit regularization.


:books: Representative Papers:

  1. Belkin et al., 2019
    “Reconciling Modern Machine Learning and the Bias-Variance Trade-Off”
    https://arxiv.org/abs/1812.11118
    → Introduces the double descent risk curve.

  2. Bartlett et al., 2020
    “Benign Overfitting in Linear Regression”
    https://arxiv.org/abs/1906.11300
    → Shows that even exact interpolation can generalize under certain conditions.

  3. Gunasekar et al., 2018
    “Implicit Regularization in Matrix Factorization”
    https://arxiv.org/abs/1705.09280
    → Discusses how optimization paths induce structure in the solution.

  4. Soudry et al., 2018
    “The Implicit Bias of Gradient Descent on Separable Data”
    https://arxiv.org/abs/1710.10345
    → Shows that even in classification, SGD finds a max-margin solution.


:compass: Summary

Yes, the behavior you described is well-known, but it’s often discussed using different terminology:

  • Weight norm growth at the interpolation threshold → double descent.

  • Noise averaging or robustness in overparameterization → benign overfitting, implicit bias.

  • Parameter regime classification → under-, exactly-, and overparameterized, often tied to model capacity vs data complexity.

If you’re writing or explaining these concepts, aligning your terms with the ones above will help communicate more clearly with the broader ML and neural network research community.

Would you like a diagram or plot to illustrate these regimes (e.g., double descent curve or weight norm vs. model capacity)?

1 Like

Then I thing I was trying to express is that two under-parameterized (n/2) weighted sums can have a lower test risk than one critically-parameterized (n) weighted sum at the interpolation threshold.

1 Like

Of course a weighted sum with n parameters and two weighted sums with (n/2) parameters each (averaged) are basically identical linear systems.

The critical difference is what happens during training with SGD or Moore-Penrose.

Anyway I constructed this blog about the over-parameterized case:

https://sciencelimelight.blogspot.com/2025/08/repetition-coding-and-over.html

1 Like

are you implying that ensembling lesser parameterized solutions is the way to go?

Also, i don’t quite get difference and significance between interpolation and extrapolation. Isn’t one just superficially learning patterns while the latter learns the true patterns of whatever it is learning about?

1 Like

I think the main takeaway is don’t train a weighted sum of dimension n with n example pairs, especially using standard training methods like SGD or the Moore-Penrose Inverse. If you were to do such a thing then the weight vector magnitude will be very high, slight amounts of noise or inaccuracy in the inputs to the weighted sum will result in a high variance scalar value output of the weighted sum.

The caution is that it is all a consequence of the training method stretching weights to fit all the training data, which it can only just do at its capacity limit.

If you are going to use SGD anyway, then sure, split and ensemble. Or invent a new way to train.

What LeCun was saying is all the training data fits inside a convex hull, a minimum space encapsulating all the training data. And actually that space in tiny in higher dimensions, so small that all test data is going to fall outside the convex hull.

Which is saying that the test data is not cannot just be some linear combinations of the examples seen in the training data, it is something more complex than that.

Maybe he meant linear interpolation and more functionally more complicated extrapolation.

I know what you mean though, because the weighted sum is about the only thing that can interpolate in higher dimensions. Anything more complicated becomes extremely local in operation and not suitable for interpolation of any sort.

It probably needs more work by paid researchers.

2 Likes

I work a lot with computer-aided geometry design (CAGD), which is related to but not the same as CAD. It is primarily concerned with how to encode continuous geometric shapes in a discrete manner such that the encoded representation can be later used to accurately reproduce the continuous geometry. Typically, this is in the form of a linear combination of continuous basis functions (e.g. polynomials).

Now, if you somehow manage to capture the exact behavior of the geometry with some finite number of discrete parameters, then obviously you have a representation that will accurately interpolate as well as extrapolate.

However, if there are degrees of freedom that you failed to capture (e.g. through fitting with insufficient samples or inadequate basis functions), then you will likely end up with a representation that diverges pretty dramatically once you start evaluating it at locations away from the data points that generated the representation.

The solution to this in practice is to enforce what is known as local support. The exact nature of this constraint can vary from one implementation to the next, but it essentially causes each of the basis functions to drop to zero when it is evaluated outside of the region for which its corresponding parameter is no longer relevant. The most common form of such restricted basis functions are the Lagrange Polynomials and Radial-Basis Functions (RBF). I also utilize a finite-element method known as discontinuous Galerkin (DG) to represent a piecewise polynomial function distributed over a large spatial extent. A computational mesh is utilized to establish local zones of validity (mesh cells) within which only specific combinations of basis functions have non-zero values. This explicitly enforces the locality constraint.

In Artificial Neural Networks, the locality constraints are not implemented as effectively. For activation functions like RELU, they appear to only enforce the locality constraint on one side. The fact that they let positive values pass through unchanged means that the value of that function also gets very large at locations far away from the location where that parameter is most relevant. Therefore, every other region on the positive side of that local linear patch will have to compensate for the presence of that patch’s increasingly large contribution.

I can think of two ways in which this compensation could occur. Either there exists a set of parameters whose sole function is to cancel out the non-local effects of RELU or all local shape parameters have to individually adjust to the contribution of the non-local parameters. In the former case, only the compensation parameters need to be updated when the local shape parameters are modified (e.g. by new data). In the latter case, nearly every shape parameter in the network must be realigned anytime there is a change to the parameters controlling the function in one specific region. From some of the visualizations that I’ve seen of the training process in standard feed-forward ANN’s, I’m pretty sure that the latter case is true, but the former is still a possibility.

4 Likes

There is a real conflict in the LeCun paper because on the one hand they are saying neural networks can’t extrapolate on test data. On the other hand what are all the current neural networks seeming doing, if not extrapolate effectively on test data and then in production.

Perhaps the answer lies in the external engineering scaffold around the neural networks. Like that is providing the extrapolation and the neural network itself is mainly a data sponge, so to speak.

https://sciencelimelight.blogspot.com/2025/08/data-sponge-engineered-shell-modern.html

There is a decided contradiction in the LeCun paper and the proper way to settle that is by applying the scientific method.

I think a small number of physicists are pushing to have the scientific method brought back too, except in their case it was present initially.

1 Like

Extrapolation is always a side effect of generalization. Perceptrons obviously do generalize, else we can use copy. But this generalization is extremely inefficient, it’s just a matter of degree.

1 Like

Yes, I think it is likely not the best deal for the amount of compute involved.

Oddly enough these 2 papers arrived in my Scolar Inbox:

https://www.scholar-inbox.com/?date=08-26-2025#paper4348497

If you take a limited capability viewpoint of deep neural networks like they are hierarchical associative memory, the hierarchy could be akin to say an ID3 decision tree which does have a real ability to generalize. I don’t see that in trained neural networks in practice. Some more advanced training algorithm could manifest that outcome in a neural network, that might be possible.

1 Like

Efficiency is a function of sparsity, and sparsity means segmenting the data, AKA clustering.
MoE is a kind of clustering too, just an extremely coarse top-down kind.
For organic efficiency, clustering should be bottom-up.

2 Likes