Bandwidth Selection Without Grid Search
Bandwidth selection is often framed as "scan a grid and pick the CV minimum." For common univariate problems, you can do better. The leave‑one‑out (LOO) cross‑validation objectives admit closed‑form derivatives in the bandwidth $h$. With analytic gradients (and Hessians when useful), a one‑dimensional Newton method finds the minimizer in a few steps.
The idea is straightforward. Kernel weights change smoothly with $h$, so the LOO risk $R(h)$ is (piecewise) differentiable in $h$. Once you can evaluate $R'(h)$ (and optionally $R''(h)$), bandwidth selection becomes a stable scalar optimization.
Nadaraya–Watson (NW) regression
With kernel $K$ and bandwidth $h$:
$\hat{m}(x) = \sum_{j=1}^n w(j,x;h) y(j)$
$w(j,x;h) = \frac{K_h(x-x(j))}{\sum_{l=1}^n K_h(x-x(l))}$
$K_h(u) = \frac{1}{h}K\left(\frac{u}{h}\right)$
At the training inputs, $\hat{y}(h) = S(h) y$ (a linear smoother) with diagonal entries $S(i,i) = w(i,x(i);h)$. The LOO fitted value and residual are:
$\text{LOO fitted value at } x(i) = \frac{\hat{y}(i) - S(i,i) \cdot y(i)}{1 - S(i,i)}$
$e(i,h) = y(i) - \text{LOO fitted value} = \frac{y(i) - \hat{y}(i)}{1 - S(i,i)}$
Define the LOO risk $R(h) = \frac{1}{n}\sum_i e(i,h)^2$. Differentiating yields:
$e'(i,h) = \frac{-(1-S(i,i)) \cdot \hat{y}'(i,h) + (y(i) - \hat{y}(i)) \cdot S'(i,i,h)}{(1-S(i,i))^2}$
$R'(h) = \frac{2}{n}\sum_i e(i,h) \cdot e'(i,h)$
So it is enough to compute $\hat{y}'(h) = S'(h)y$ and $S'(i,i)$. From the normalized weights:
$\frac{\partial w(i,j)}{\partial h} = w(i,j)\left(\frac{\partial\log K_h(x(i)-x(j))}{\partial h} - \sum_{l} w(i,l) \frac{\partial\log K_h(x(i)-x(l))}{\partial h}\right)$
For common kernels the log‑derivative is closed‑form. Example (Gaussian):
$$\frac{\partial K_h(u)}{\partial h} = \frac{(u/h)^2-1}{h} K_h(u)$$
Therefore:
$$\frac{\partial\log K_h(u)}{\partial h} = \frac{(u/h)^2-1}{h}$$
This gives $S'(h)$ and hence $R'(h)$ (and $R''(h)$ similarly, if you want true Newton).
For numerical stability, it makes sense to optimize in log‑space: let $g = \log h$. Then:
$$\frac{dR}{dg} = h R'(h)$$
$$\frac{d^2R}{dg^2} = h^2 R''(h) + h R'(h)$$
Take Newton steps in $g$ with a backtracking line search; this preserves $h > 0$ and stabilizes extreme steps.
KDE with least‑squares CV (LSCV)
With density estimate:
$\hat{f}(x) = \frac{1}{nh}\sum_{j=1}^n K\left(\frac{x-x(j)}{h}\right)$
The LSCV objective is:
$\text{LSCV}(h) = \int \hat{f}(x)^2 dx - \frac{2}{n}\sum_{i=1}^n \hat{f}_{-i}(x(i))$
Both terms are analytic kernel sums. The $L^2$ term reduces to pairwise convolutions:
$\int \hat{f}^2 = \frac{1}{n^2}\sum_{i,j}(K*K)(x(i)-x(j))$
For Gaussian $K$ this equals:
$\frac{1}{n^2}\sum_{i,j} K_{\sqrt{2}h}(x(i)-x(j))$
The LOO term is:
$\frac{2}{n(n-1)}\sum_{i \neq j} K_h(x(i)-x(j))$
For Gaussian or Epanechnikov kernels, derivatives of these sums follow the same log‑derivative pattern as above, yielding $\text{LSCV}'(h)$ (and $\text{LSCV}''(h)$).
Caveats and safeguards
- With compact‑support kernels (e.g., Epanechnikov), the CV objective is piecewise smooth in $h$ (kinks when $|x(i)-x(j)|/h = 1$); use safeguarded steps (line search or bracketing).
- LSCV can be bumpy (often undersmoothing); expect local minima. Warm‑start from a plug‑in or a coarse grid and allow restarts.
- Numerics degrade when $S(i,i) \to 1$ (near duplicates or tiny $h$); clip denominators $1-S(i,i)$ and bound $h$ to sensible $[h_{min}, h_{max}]$.
Python Package
The hessband package provides select_nw_bandwidth
and select_kde_bandwidth
with several search strategies: analytic
(analytic‑derivative Newton), grid
, golden
(golden‑section), plug‑in
, fdnewton
(finite‑difference Newton), and bayes
(Bayesian optimisation). It supports Gaussian and Epanechnikov kernels (e.g., kernel='gaussian'
/'gauss'
and kernel='epan'
).