For decades, the conventional wisdom in statistical learning theory was clear: a model with more parameters than training samples will overfit. Complexity must be penalized. Regularization is mandatory. Then deep learning arrived and broke every rule in the textbook — networks with millions of parameters trained to zero training loss somehow generalize beautifully to unseen data. This article explores why.
The Classical View and Why It Broke Down
Traditional learning theory frames generalization through the lens of VC dimension and Rademacher complexity. The core idea is that a hypothesis class with high capacity can shatter arbitrary labels, so its generalization gap — the difference between training and test error — grows with model complexity. The prescription is simple: keep your model small, or explicitly regularize.
This framework worked well for SVMs, decision trees, and shallow neural networks. Then practitioners started training ResNets, Transformers, and GPT-class models — all wildly overparameterized — and observed that these models, despite fitting training data perfectly, generalized to held-out data with remarkable accuracy. Classical theory predicted catastrophe. Reality delivered state-of-the-art performance.
The phenomenon was formalized by Belkin et al. (2019) through the concept of the double descent curve. As model complexity increases, test error first follows the classical U-shaped bias-variance curve, peaks at the interpolation threshold (where the model just barely fits training data), and then — counterintuitively — decreases as the model becomes increasingly overparameterized. This is not a quirk. It is a fundamental property of modern ML systems.
Implicit Regularization: The Inductive Bias of Gradient Descent
If overparameterized models aren’t explicitly regularized, something else must be constraining their solutions. The answer lies in the optimization algorithm itself.
Consider a linear regression problem where the number of parameters p far exceeds the number of samples n. There are infinitely many parameter vectors θ that achieve zero training loss. Gradient descent initialized at the origin doesn’t just find any solution — it finds the minimum L2-norm solution. This is provable: gradient descent on squared loss with zero initialization traces a path through parameter space that stays in the row space of the data matrix, which corresponds exactly to the minimum-norm interpolant.
This is implicit regularization. The algorithm’s geometry imposes structure on the solution even without an explicit penalty term.
For neural networks, the story is richer and less fully understood, but the evidence is compelling:
- Stochastic Gradient Descent (SGD) with small batch sizes finds flatter minima than full-batch gradient descent. Flatter minima — measured by the sharpness of the loss landscape around the solution — correlate empirically with better generalization. This is the sharpness-aware minimization intuition, formalized by Foret et al. (2021).
- Early stopping acts as implicit L2 regularization in function space. The number of optimization steps effectively controls the complexity of the learned function.
- The Edge of Stability phenomenon (Cohen et al., 2021) shows that during training, the largest eigenvalue of the Hessian (the sharpness) hovers just above the stability threshold of gradient descent. The optimizer is perpetually on the edge of instability, which may itself contribute to finding generalization-friendly solutions.
The Neural Tangent Kernel: Linearizing the Deep Net
One of the most elegant theoretical tools to emerge in recent years is the Neural Tangent Kernel (NTK), introduced by Jacot, Gabriel, and Hongler (2018).
Consider a neural network f(x; θ) with parameters θ ∈ ℝᵖ. The NTK is defined as:
K(x, x') = ⟨∇_θ f(x; θ), ∇_θ f(x'; θ)⟩
In the infinite-width limit, a remarkable thing happens: as the network width → ∞, the NTK converges to a deterministic, fixed kernel at initialization and remains constant throughout training. This means training an infinitely wide neural network is equivalent to kernel regression with the NTK — a convex problem with a closed-form solution.
This gives us a tractable theoretical lens: the generalization of infinitely wide networks is governed by the smoothness of the target function in the RKHS (Reproducing Kernel Hilbert Space) induced by the NTK. Functions that are smooth with respect to this kernel are learned efficiently; irregular functions are not.
The practical implication is subtle but important. The NTK regime tells us that networks learn a specific kind of interpolant — one that is smooth in the geometry defined by the network architecture and initialization. This is the “implicit prior” that drives generalization.
Of course, finite-width networks and those trained with large learning rates move out of the NTK regime, entering what is sometimes called the feature learning regime, where the network’s internal representations evolve meaningfully during training. Much of the ongoing theoretical work in deep learning is about understanding this transition.
Benign Overfitting: When Interpolation Doesn’t Hurt
A particularly striking theoretical result comes from Bartlett, Long, Lugosi, and Tsigler (2020): benign overfitting. They prove, for linear regression in high dimensions, that a model can interpolate noisy training data — fitting the noise perfectly — and still achieve near-optimal test error.
The intuition is geometric. In high-dimensional spaces, the model decomposes the parameter space into a “signal” subspace and a “noise” subspace. The minimum-norm interpolant fits the signal with low bias. The noise gets absorbed into the high-dimensional complement, where it is diluted across many directions, contributing negligible variance to predictions on new points.
For this to work, two conditions must hold:
- High effective rank of the data covariance — the input distribution must spread mass across many directions.
- Sufficient overparameterization — the model needs enough capacity to separate signal from noise geometrically.
These conditions are frequently met in practice: natural language tokens live in high-dimensional embedding spaces, and image patches exhibit approximately low-rank structure. This is part of why deep learning works so well on real-world data distributions despite the apparent paradox.
Practical Takeaways for Practitioners
Theory and practice don’t always travel together, but these insights offer concrete guidance:
1. Don’t underparameterize. If your model sits near the interpolation threshold, you’re in the worst of both worlds — high variance without the benign-overfitting benefits of true overparameterization. When in doubt, scale up before adding explicit regularization.
2. Learning rate is a regularizer. Larger learning rates in SGD tend to find flatter minima with better generalization. Don’t rush to lower it at the first sign of instability.
3. Architecture encodes a prior. The NTK perspective means your choice of architecture isn’t just about expressivity — it defines what kinds of functions are “natural” to your model. Convolutional architectures impose translation equivariance; Transformers impose a permutation-equivariant attention prior. Choose architecture to match your problem’s symmetries.
4. Batch size interacts with generalization. Large-batch training tends to find sharper minima (Keskar et al., 2017). If you’re scaling batch size for throughput, compensate with learning rate warmup and potentially explicit sharpness penalties.
5. The loss landscape matters more than the loss value. A model at training loss = 0.001 in a sharp basin may generalize worse than one at 0.01 in a flat basin. Tools like SAM (Sharpness-Aware Minimization) explicitly optimize for this.
Where the Field Is Heading
The theoretical frontier is moving fast. Some active directions worth watching:
- Mechanistic interpretability: Rather than studying generalization at the macro level, researchers are reverse-engineering the specific algorithms neural networks implement internally — induction heads in Transformers, modular arithmetic circuits in MLP layers, etc.
- Grokking: Shah et al. and others have documented cases where networks first memorize training data and then, after further training, “grok” the underlying structure — suddenly generalizing from near-zero to near-perfect test accuracy. This suggests generalization happens through phase transitions, not gradual improvement.
- Information-theoretic bounds: PAC-Bayes theory, tightened with data-dependent priors, is currently producing the tightest non-vacuous generalization bounds for practical deep networks — a significant improvement over classical VC-based bounds.
Conclusion
The question “why do neural networks generalize?” remains one of the deepest open problems in machine learning. But the last five years have delivered genuine progress: we understand that gradient descent implicitly regularizes, that overparameterization can be beneficial in high dimensions, that the geometry of the loss landscape matters as much as its value, and that network architecture encodes strong inductive priors.
Classical statistical learning theory gave us the right questions. Modern deep learning theory is slowly, rigorously providing the answers — and those answers are turning out to be far more interesting than anyone expected.
Further reading: Belkin et al., “Reconciling Modern Machine Learning Practice and the Classical Bias-Variance Tradeoff” (PNAS 2019); Jacot et al., “Neural Tangent Kernel” (NeurIPS 2018); Bartlett et al., “Benign Overfitting in Linear Regression” (PNAS 2020); Cohen et al., “Gradient Descent on Neural Networks Typically Occurs at the Edge of Stability” (ICLR 2022).
