Generalization Gap in Over‑Parameterized Models

Generalization Gap in Over‑Parameterized Models
Photo by Katja Anokhina / Unsplash

Textbook bias–variance intuition implies that worst‑case test error should eventually rise as model capacity overtakes the sample size because the variance term in the bound grows while the bias term has already bottomed out (see Vapnik and Chervonenkis, 1971; Bartlett and Mendelson, 2002). Those worst‑case guarantees shrink like $\sqrt{\text{capacity}/n}$ and become numerically useless when a network’s parameter count is much larger than $n$.

Yet networks still generalize. Zhang et al. (2017) show that a deep CNN could memorize completely random labels when trained to zero loss, demonstrating that model capacity far exceeds sample complexity. Yet the same architecture generalizes well on real data under standard training, suggesting that generalization depends on which solution the optimizer finds, not just model capacity. In ridgeless linear regression, test error falls again beyond the interpolation point (Hastie et al. 2019). The same rise-then-fall pattern—dubbed double descent—appears across kernels and deep nets (Belkin et al. 2019). These facts show that generalization hinges on the specific solution the optimizer selects, not on capacity alone. (Over-parameterization is a facilitating condition: it makes multiple zero-loss solutions available; whether the one found generalizes depends on the algorithm, optimizer, and data.)

Decomposing the realized generalization gap

For the weights $\hat{\theta}$ produced by one training run, define

$$\Delta = L_{\text{pop}}(\hat{\theta}) - L_{\text{train}}(\hat{\theta})$$

Insert the empirical and population minima in the same function class $\mathcal{F}$:

$$ L_{\text{train}}^{*} = \min_{\theta \in \mathcal{F}} L_{\text{train}}(\theta) \quad \text{and} \quad L_{\text{pop}}^{*} = \min_{\theta \in \mathcal{F}} L_{\text{pop}}(\theta) $$

Then,

$$ \Delta = \underbrace{L_{\text{pop}}^{*} - L_{\text{train}}^{*}}_{\text{sampling error}} + \underbrace{L_{\text{train}}(\hat{\theta}) - L_{\text{train}}^{*}}_{\text{under-optimization}} + \underbrace{L_{\text{pop}}(\hat{\theta}) - L_{\text{pop}}^{*}}_{\text{implicit bias}} $$

  • Sampling error decays like $n^{-1/2}$ under i.i.d. sampling and coincides with the classical "variance" only for squared-error losses.
  • Under-optimization captures leftover training loss; it vanishes once training reaches (near) zero error.
  • Implicit bias captures the optimizer's preference among zero-loss solutions.

Classical uniform-convergence bounds treat every function in $\mathcal{F}$ as equally likely, so they can only guarantee an upper bound on the total gap $\Delta$. Once the parameter count dwarfs the sample size, that worst-case guarantee becomes vacuous. To obtain meaningful guarantees, we must build the learning algorithm into the analysis. Modern algorithm-aware frameworks—algorithmic stability (Bousquet and Elisseeff 2002) and PAC-Bayes/compression (Dziugaite and Roy 2017)—remain non-vacuous even when $|\theta| \gg n$ because they track how the optimizer and data interact.

Note: Approximation error—the distance from $L_{\text{pop}}^{*}$ ​to the Bayes risk—affects the level of test error but cancels out of the difference, so it cannot create the gap.

Measuring Each Part

Sampling error. Estimate with:

  1. Bootstrap: Resample training data, run short retrains, convert logit variance on a probe set into excess risk.
  2. Learning-curve fit: Train on 25%, 50%, 100% of data; fit $L_{\text{pop}}(m) = \alpha m^{-\beta} + c$, then evaluate $\alpha n^{-\beta}$. Jackknife-plus or CV-plus offers cheaper, large-scale alternatives.

If this term dominates, consider acquiring more data, improving labels, or incorporating semi-supervised techniques.

Under-optimization. From the final checkpoint, run a handful of LBFGS or Sharpness-Aware Minimization (SAM) steps. Any drop in training loss lower-bounds the term. On typical supervised vision benchmarks, it is effectively zero; on language, speech, or reinforcement-learning tasks, it can be significant. Early stopping may help reduce the train/test gap, as both are large.

Implicit Bias. Driving training loss to (nearly) zero does not mean a model can settle on just any interpolating solution. Two nested forces steer the final destination, and together they constitute the implicit-bias term (3) in Equation (1).

First comes representation bias—the notion of simplicity that is hard-wired into the model’s parameterization. In linearly separable logistic regression, expressing the classifier as a free weight vector makes the minimum-norm solution coincide with the familiar maximum-margin separator (Soudry et al., 2018). In linear convolutional networks, the parameters align with Fourier modes, so gradient descent automatically favors low-frequency filters (Gunasekar et al., 2018). In deep matrix-factorization problems, the usual “skinny factor” coordinates turn low nuclear rank into low Euclidean norm, leading gradient descent to the minimum-rank interpolator (Razin and Cohen 2021). Change the coordinates—whiten the inputs, rescale the factors—and the set of “simple” solutions shifts with them.

Layered on top is optimizer bias—extra preferences introduced by the training algorithm once the loss is already near zero. On a linearly separable dataset, stochastic gradient descent with a decaying learning rate converges to the max-$\ell_{2}$-margin classifier (Soudry et al., 2018), whereas AdaGrad (Nacson and Srebro, 2019) and Adam with decoupled weight decay under the same diminishing-step schedule converge to the max-$\ell_{\infty}$-margin classifier (Zhang, Zou, and Cao, 2024). Batch size, momentum, noise injection, and learning-rate schedules all tilt the search toward sharper or flatter regions of the loss surface.

Because these two layers interact, it is most practical to diagnose their sum. Sharp Hessian spectra, large weight norms, long PAC-Bayes code lengths, or high prediction variance across random seeds all signal a large term (3). (Because sharpness is not invariant under re-parameterization (Dinh et al. 2017), choose scale-invariant alternatives when possible.)

If those metrics are high, apply flatness‑promoting changes. Smaller mini-batches or injected gradient noise coax SGD toward flatter basins. Weight decay penalizes large norms directly. Sharpness-Aware Minimization (SAM), stochastic weight averaging (SWA and SWA-GA), or Entropy-SGD enlarge the neighborhood over which the solution must remain low-loss, thereby selecting broader minima. Techniques like ghost-batch normalization stabilize activations against perturbations. All of these interventions leave the training error unchanged yet systematically pull the network toward solutions with a lower implicit-bias term, shrinking the realized generalization gap $\Delta$ while leaving training loss unchanged.

Subscribe to Gojiberries

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe