The Elephant in the Sampling Frame: Lessons from Basu's Parable
In Basu's circus story (PDF), the owner wants to estimate the weight of fifty elephants and suggests weighing Sambo, who looked “average” three years ago, and reporting $50y$. A statistician counters with a contrived sampling plan: choose Sambo with probability $99/100$ and pick any of the other forty-nine elephants (including the huge "Jumbo") with probability $1/4900$ and estimate the total using the Horvitz–Thompson (HT) estimator (sampled weight divided by its own inclusion probability). The estimator is unbiased but unstable; on the extremely rare $1/4900$ draw of "Jumbo", the estimate is comically large.
What makes the estimator wobble is not HT’s algebra but the sampling plan. In an enumerated, observable herd, the plan assigns inclusion probabilities as small as $1/4900$ to elephants that visibly dominate the total. Designs with tiny $\pi_i$ make large estimates rare but explosive when they occur. No competent statistician would adopt such a plan when the frame itself reveals structure.
Because the frame here is enumerated and observable, the plan for repair is straightforward: use the structure the frame makes available. The analyst can see the elephants and use visible size as a proxy for weight. More generally, a competent plan encodes such structure before sampling. If a proxy $x$ can be measured or at least ranked—girth, shoulder height, or a coarse small/medium/large tag—let selection probabilities track it: use probability‑proportional‑to‑size (PPS), so $\pi_i\propto x_i$. If only ranks are available, stratify by size and sample within strata so that heavy units are not excluded by chance. If a quick proxy is available for all units and precise measurement for a subset, use two‑phase sampling and then calibrate or apply ratio/GREG so that the weighted sum of $x_i$ matches a benchmark.
When stability is the priority, two levers work in tandem. Acting at the design stage preserves unbiasedness: set a minimum $\pi_i$ (or treat a few dominant units as certainty), use balanced/spread sampling to control pairwise selection $\pi_{ij}$, and make obvious outliers certainty. Under such designs, HT keeps its design‑based guarantees while reducing variance. Acting after sampling trades a little bias for a larger variance cut: trim or cap extreme weights $1/\pi_i$, or smooth via calibration that shrinks weights toward targets. If the governing criterion is mean‑squared error and the tail of $1/\pi_i$ is extreme, a modest cap can lower $\mathbb{E}[(\hat Y - Y)^2]$. (Where unbiasedness is non‑negotiable (e.g., official totals), one option is to report untrimmed HT and show trimmed estimates as a sensitivity analysis.)
A complementary route, appropriate when the goal is to learn a generating process or when aspects of the design are only partly known, is to model $y\mid x$ (and, if needed, selection). Outcome regression, doubly-robust AIPW, and hierarchical/Bayesian partial pooling are standard patterns. In that mode, assumptions—not design-unbiasedness—guard against error, so diagnostics and sensitivity analyses are essential.
There are also settings in which unit‑level proxies are thin or absent at frame time—for example, broad general‑population surveys with only weak frame metadata. In those cases, the honest baseline is equal‑probability sampling. With $\pi_i=n/N$ for all $i$, weights are bounded by $1/\pi_i=N/n$; the HT total reduces to $\widehat Y_{\mathrm{HT}}=N,\bar y_s$, and for means HT/Hájek equals the plain sample mean. The familiar pathologies arising from extreme weights cannot occur under this baseline; the remaining error is ordinary sampling noise commensurate with the information at hand.
Two principles follow. Unbiasedness is a property, not a decision rule: one can be design-unbiased and still high-risk for the problem at hand; choose methods for the loss you care about (mean-squared error, tail risk, subdomain precision). Use information when available, acknowledging when it is not. When structure is visible, encode it in the design (certainty units, PPS, stratification, two-phase plus calibration), and, if modeling, in the estimator.