More is More: Double Descent and HTE
Since the last essay, I have been percolating over whether double descent also means smaller s.e. for HTE.
Estimating heterogeneous treatment effects involves two goals: accurate out-of-sample prediction of $\hat{\tau}(x)$ and valid inference for functionals like average treatment effects and policy values. Standard practice constrains model complexity through regularization, honest splitting, or careful cross-validation to avoid overfitting (Künzel et al., 2019; Wager & Athey, 2018; Athey & Imbens, 2016). Recent work documents the double descent phenomenon, where test error decreases with model capacity, spikes near the interpolation threshold, then decreases again in the overparameterized regime (Belkin et al., 2019; Nakkiran et al., 2021; Hastie et al., 2022). This second descent occurs because minimum-norm interpolators, which arise naturally from ridgeless regression, generalize well despite fitting training data perfectly (Bartlett et al., 2020).
For estimating CATE, this matters when we have unbiased targets for $\tau(X)$ such as transformed outcomes $Y^* = Y(W/p - (1-W)/(1-p))$ in RCTs or AIPW pseudo-outcomes observationally. With these unbiased labels, estimating $\tau(x)$ becomes a supervised learning problem where double descent applies. Very wide models can reduce approximation error, while the implicit regularization of minimum-norm solutions controls variance. This suggests we may be unnecessarily limiting model capacity in econometric applications.
Influence-function inference remains valid regardless of predictor complexity (Chernozhukov et al., 2018; Kennedy, 2020). When we compute doubly-robust scores and their empirical variance on held-out data, the resulting standard errors maintain proper coverage whether the underlying nuisance functions come from small or overparameterized models, provided the nuisances are estimated consistently. This separation between representation and inference is key. We can use width for a better approximation of $\tau(x)$ while maintaining econometric rigor through orthogonal scores.
We demonstrate this through a simulation with strict data separation: $n=400$ training, $n=2500$ validation, $n=4000$ test samples. The true treatment effect function is constructed from cosine features in the tail of a large Random Fourier Feature (RFF) dictionary (indices 1500-1539), designed so small models cannot capture these components while sufficiently wide models can. Using ridgeless regression on RFF with minimum-norm solutions and no ridge penalty, we select capacity $M^{*}$ by minimizing validation MSE on the transformed outcome. The validation curve exhibits the expected pattern. Performance improves with capacity, deteriorates approaching $M \approx 400$ (the training sample size), then improves again beyond this interpolation threshold. The selected capacity $M^{*} = 1800$ lies in the overparameterized regime.
On the held-out test set, the wide model achieves MSE of 0.02198 versus 0.02440 for a pre-specified small model ($M=32$), a 10% improvement in predictive accuracy. For inference, we compute doubly-robust scores for functionals using nuisance functions fit only on training data. The resulting standard errors are 0.00312 for the ATE (versus 0.00342 for the small model) and 0.00449 for a subgroup ATE where $X_1 > 0$ (versus 0.00474). Both models produce valid confidence intervals; the SE reductions reflect lower score variance from more accurate nuisance estimates.
When unbiased targets for $\tau(X)$ are available, wide models selected on validation data can improve prediction without sacrificing inference. The orthogonal score framework with influence functions computed on held-out data (or through cross-fitting) provides valid uncertainty quantification regardless of model capacity, assuming consistent nuisance estimation (Chernozhukov et al., 2018). For pointwise CATE uncertainty, smoothed local functionals with influence functions or honest last-layer approaches remain appropriate.
By stopping at traditional complexity bounds, we leave predictive performance unrealized. For the class of models exhibiting benign overfitting (like minimum-norm RFF interpolators demonstrated here), the second descent regime offers improvements in treatment effect estimation while our existing inference machinery continues to deliver valid standard errors through orthogonal scores, influence functions, and cross-fitting. Overparameterization and rigorous inference are complements, not substitutes.