By Gaurav in ML/Statistics — 08 Sep 2025

Unbiased regression with costly item labels

Consider a common scenario: you observe how units (users, devices, firms) interact with items (websites, products, apps), and each item has an unknown binary trait (is it harmful? is it premium content?). You want to understand how these traits relate to unit characteristics through regression, but labeling items is expensive. How do you choose which items to label to get accurate regression coefficients without breaking the budget?

Let $C=(c_{ij})$ be exposure counts, $T_i=\sum_{j} c_{ij}$ be the total exposure for unit $i$, and define the row-level outcome as the exposure-weighted share:

$$y_i=\sum_{j=1}^m \frac{c_{ij}}{T_i}a_j$$

where $a_j\in\lbrace 0,1\rbrace$ is the unknown item label.

With covariates $X\in\mathbb{R}^{n\times p}$, the population target is:

$$\beta^*=(X^{\top} X)^{-1}X^{\top} y$$

The challenge: labels $a_j$ are expensive to obtain. We need a principled way to label only a subset of items while still getting unbiased regression estimates with controlled variance.

One reasonable solution is to apply Horvitz-Thompson (HT) at the item level rather than the traditional unit level. Sample items with known inclusion probabilities $\pi_j\in(0,1]$, independently of their unknown labels, and construct row shares using inverse probability weighting.

Applying HT, however, has two limitations:

When we decide not to label item $j$, every unit that interacts with that item is affected simultaneously. This creates correlated errors across units—the estimation error $u=\widehat{y}-y$ has off-diagonal covariance terms. Units that share many items will have highly correlated errors.
While HT guarantees unbiased means, it severely distorts distributions. The empirical distribution of $\widehat{y}$ will have inflated variance and distorted quantiles. If you care about inequality measures, percentiles, or individual-level predictions rather than just regression coefficients, this approach has serious limitations. The realized distribution is essentially the true distribution convolved with design noise.

HT is a naive solution. A better answer is to cognize the items that are important for regression accuracy. To understand which items matter most, we need to trace how item-level noise propagates through to coefficient estimates.

The influence weight $w_j$ embodies a crucial geometric fact: only estimation errors that align with the column space of $X$ affect OLS estimates. Items with large $w_j$ are those whose noise gets amplified through the regression projection. An item heavily used by units with extreme covariate values will have high influence.

Under independent item sampling, the bound on the coefficient variance is:

$$\operatorname{tr}\left(\mathrm{Cov}(\widehat{\beta})\right) \le \sum_{j=1}^m \left(\frac{1}{\pi_j}-1\right) w_j$$

This bound is tight when all items are positive ($a_j=1$ for all $j$), making it a worst-case guarantee. The message is clear: leaving high-$w_j$ items unlabeled is expensive for regression precision.

Optimal Sampling Designs

Given this geometric understanding, we can formulate optimal sampling strategies as convex optimization problems.

Regression-Focused Design (A-Optimal)

To minimize the trace of coefficient covariance with a fixed labeling budget $K$, solve:

$$\min_{\pi} \sum_j \frac{w_j}{\pi_j} \quad \text{subject to} \quad \sum_j \pi_j=K, \quad \pi_{\min}\le\pi_j\le 1$$

The solution follows a square-root law: $\pi_j \propto \sqrt{w_j}$ (then clamp to bounds). This is the classical Neyman allocation applied to regression—sample more where the variance contribution is highest, but the square root ensures we don't overconcentrate on just the highest-influence items.

If items have heterogeneous labeling costs $c_j$, the allocation becomes $\pi_j\propto \sqrt{w_j/c_j}$. A minimum inclusion probability $\pi_{\min}$ (e.g., 0.01) prevents any weight from exploding to infinity.

Row-Level Fairness Design

Sometimes you need guarantees for individual units—perhaps for auditing, fairness, or regulatory reasons. Define $q_{ij}=(c_{ij}/T_i)^2$ as the squared exposure share. To ensure each unit's estimate has standard error below $\varepsilon_i$:

$$\sum_{j} \frac{q_{ij}}{\pi_j} \le \varepsilon_i^2+\sum_{j} q_{ij}$$

Minimizing total labels subject to these per-row constraints is still convex, yielding $\pi_j \propto \sqrt{\sum_{i} \mu_i q_{ij}}$ where the dual variables $\mu_i$ emphasize items important to the most constrained units.

Prevalence-Aware Refinement

If you know that at most an $\alpha$ fraction of items are positive, you can tighten the variance bounds by protecting only against the worst-case $M=\lceil \alpha m\rceil$ positive items. This uses the convex epigraph representation of the sum-of-largest operator.

From Probabilities to Practice

The optimization gives you probabilities $\pi^*$, but annotators need a concrete list of items. Here's the implementation path:

Draw exactly $K=\text{round}(\sum_j \pi_j^*)$ items using a sampling scheme that respects the inclusion probabilities. The gold standard is balanced sampling (e.g., the cube method) that achieves:

$$\sum_{j} \left(\frac{I_j}{\pi_j^*}-1\right) g_j \approx 0$$

This constraint ensures the realized noise is nearly orthogonal to the regression space, reducing variance in finite samples. You get the same unbiasedness guarantee with better realized performance than independent Poisson sampling.

For cruder solutions and the problem that motivated this work, check out https://www.gojiberries.io/gathering-domain-knowledge/

Optimal Sampling Designs

From Probabilities to Practice

Subscribe to Gojiberries