By Gaurav in ML/Statistics — 20 Aug 2025

Bandwidth Selection Without Grid Search

Bandwidth selection is often framed as "scan a grid and pick the CV minimum." For common univariate problems, you can do better. The leave‑one‑out (LOO) cross‑validation objectives admit closed‑form derivatives in the bandwidth $h$. With analytic gradients (and Hessians when useful), a one‑dimensional Newton method finds the minimizer in a few steps.

The idea is straightforward. Kernel weights change smoothly with $h$, so the LOO risk $R(h)$ is (piecewise) differentiable in $h$. Once you can evaluate $R'(h)$ (and optionally $R''(h)$), bandwidth selection becomes a stable scalar optimization.

Nadaraya–Watson (NW) regression

With kernel $K$ and bandwidth $h$:

$\hat{m}(x) = \sum_{j=1}^n w(j,x;h) y(j)$

$w(j,x;h) = \frac{K_h(x-x(j))}{\sum_{l=1}^n K_h(x-x(l))}$

$K_h(u) = \frac{1}{h}K\left(\frac{u}{h}\right)$

At the training inputs, $\hat{y}(h) = S(h) y$ (a linear smoother) with diagonal entries $S(i,i) = w(i,x(i);h)$. The LOO fitted value and residual are:

$\text{LOO fitted value at } x(i) = \frac{\hat{y}(i) - S(i,i) \cdot y(i)}{1 - S(i,i)}$

$e(i,h) = y(i) - \text{LOO fitted value} = \frac{y(i) - \hat{y}(i)}{1 - S(i,i)}$

Define the LOO risk $R(h) = \frac{1}{n}\sum_i e(i,h)^2$. Differentiating yields:

$e'(i,h) = \frac{-(1-S(i,i)) \cdot \hat{y}'(i,h) + (y(i) - \hat{y}(i)) \cdot S'(i,i,h)}{(1-S(i,i))^2}$

$R'(h) = \frac{2}{n}\sum_i e(i,h) \cdot e'(i,h)$

So it is enough to compute $\hat{y}'(h) = S'(h)y$ and $S'(i,i)$. From the normalized weights:

$\frac{\partial w(i,j)}{\partial h} = w(i,j)\left(\frac{\partial\log K_h(x(i)-x(j))}{\partial h} - \sum_{l} w(i,l) \frac{\partial\log K_h(x(i)-x(l))}{\partial h}\right)$

For common kernels the log‑derivative is closed‑form. Example (Gaussian):

$$\frac{\partial K_h(u)}{\partial h} = \frac{(u/h)^2-1}{h} K_h(u)$$

Therefore:

$$\frac{\partial\log K_h(u)}{\partial h} = \frac{(u/h)^2-1}{h}$$

This gives $S'(h)$ and hence $R'(h)$ (and $R''(h)$ similarly, if you want true Newton).

For numerical stability, it makes sense to optimize in log‑space: let $g = \log h$. Then:

$$\frac{dR}{dg} = h R'(h)$$

$$\frac{d^2R}{dg^2} = h^2 R''(h) + h R'(h)$$

Take Newton steps in $g$ with a backtracking line search; this preserves $h > 0$ and stabilizes extreme steps.

KDE with least‑squares CV (LSCV)

With density estimate:

$\hat{f}(x) = \frac{1}{nh}\sum_{j=1}^n K\left(\frac{x-x(j)}{h}\right)$

The LSCV objective is:

$\text{LSCV}(h) = \int \hat{f}(x)^2 dx - \frac{2}{n}\sum_{i=1}^n \hat{f}_{-i}(x(i))$

Both terms are analytic kernel sums. The $L^2$ term reduces to pairwise convolutions:

$\int \hat{f}^2 = \frac{1}{n^2}\sum_{i,j}(K*K)(x(i)-x(j))$

For Gaussian $K$ this equals:

$\frac{1}{n^2}\sum_{i,j} K_{\sqrt{2}h}(x(i)-x(j))$

The LOO term is:

$\frac{2}{n(n-1)}\sum_{i \neq j} K_h(x(i)-x(j))$

For Gaussian or Epanechnikov kernels, derivatives of these sums follow the same log‑derivative pattern as above, yielding $\text{LSCV}'(h)$ (and $\text{LSCV}''(h)$).

Caveats and safeguards

With compact‑support kernels (e.g., Epanechnikov), the CV objective is piecewise smooth in $h$ (kinks when $|x(i)-x(j)|/h = 1$); use safeguarded steps (line search or bracketing).
LSCV can be bumpy (often undersmoothing); expect local minima. Warm‑start from a plug‑in or a coarse grid and allow restarts.
Numerics degrade when $S(i,i) \to 1$ (near duplicates or tiny $h$); clip denominators $1-S(i,i)$ and bound $h$ to sensible $[h_{min}, h_{max}]$.

Python Package

The hessband package provides select_nw_bandwidth and select_kde_bandwidth with several search strategies: analytic (analytic‑derivative Newton), grid, golden (golden‑section), plug‑in, fdnewton (finite‑difference Newton), and bayes (Bayesian optimisation). It supports Gaussian and Epanechnikov kernels (e.g., kernel='gaussian'/'gauss' and kernel='epan').

Subscribe to Gojiberries