The Elephant in the Sampling Frame: Lessons from Basu's Parable

Basu's circus story is well known: the owner wants the total weight of fifty elephants but will weigh only one. The statistician chooses a lopsided design—pick "Sambo" with probability 99/100 and pick any of the other forty-nine elephants (including the huge "Jumbo") with probability 1/4900—and uses Horvitz–Thompson (HT), reporting the sampled weight divided by its own inclusion probability. The estimator is design‑unbiased yet typically far from the truth; in the ultra‑rare 1/4900 draws (when it picks "Jumbo"), it is astronomically large. So far, so theatrical.

Here is the crux of the critique: the fable quietly assumes the statistician can see size (Sambo/Jumbo), an excellent proxy for weight, and still chooses a design that places near-zero inclusion probabilities on many elephants. The setup is an enumerated, observable frame—a finite list of fifty elephants you can inspect before sampling. In that setting, ignoring visible size as a proxy is the vice. By contrast, many finite‑population problems (e.g., U.S. adults) lack an observable frame; equal‑probability designs (or designs that use only frame‑available proxies) are then the sober baseline. And if the aim were superpopulation learning, you’d model and judge by predictive risk. The parable bites by sliding between these frames while blaming HT.

Blaming HT for a perverse design misplaces the fault. What HT actually promises is narrow and honest. It is a design‑based estimator that uses one fact—the inclusion probabilities $π_i$—and guarantees design‑unbiasedness for finite‑population totals when two preconditions hold: positivity (every unit has $π_i > 0$) and known $π_i$. If you truly have no auxiliary information—you cannot distinguish elephants in any meaningful way—the sober answer is equal-probability sampling and the plain sample mean (HT/Hájek under equal $\pi$). In that information-bare world, the explosive-weight pathology disappears under equal $\pi$; small-$n$ noise remains. Under equal $\pi$, the HT total is $N$ times the plain sample mean; for the mean, HT and Hájek coincide with the sample mean.

Stability is another matter: variance depends on the spread of the weighted contributions $y_i/π_i$ and on pairwise inclusion probabilities $π_{ij}$ that govern co‑selection. Designs that avoid tiny $π_i$ on units that might be large, and that avoid clumpy co‑selection, curb risk; designs that do the opposite magnify it.

Once some structure is observable—even a rough proxy—there are two natural ways to use it.

  1. Amend the Sampling Design. Ask where the total comes from. If a proxy $x$ tends to rise with $y$, give higher selection probability to units with larger $x$ (probability‑proportional‑to‑size, PPS), or sort units into coarse $x$‑bands, draw from each, and reweight. When $x \approx c \cdot y$, the contributions $\frac{y_i}{\pi_i}$ are roughly even and variance falls; with only monotone/noisy $x$, the same logic still helps, and residual imbalance can be corrected by post‑stratification or calibration to known totals. In practice, this shows up as PPS with replacement (Hansen–Hurwitz) or without replacement (e.g., Sampford/Brewer). The without‑replacement variants usually reduce variance by avoiding duplicates but require handling $\pi_{ij}$ for variance estimation.
  2. Model the Outcome. When the goal is to learn a generating process or the design is only partly known, write down a model for $y \mid x$ (and, if needed, selection), estimate with partial pooling, and aggregate or predict. Standard patterns include outcome regression, doubly‑robust AIPW (outcome + propensity), and hierarchical/Bayesian models. Here assumptions—not design‑unbiasedness—do the guarding, so diagnostics and sensitivity checks matter; design information can still enter as covariates, offsets, or priors.

Points to keep

  • Unbiasedness is a property, not a decision rule. You can be design-unbiased and still high-risk for the population at hand. Choose methods for the loss you actually care about (MSE, tail risk, subdomain precision), not because a single property looks virtuous.
  • Use information when you have it; admit when you don't. No information → equal-probability sampling + sample mean (HT/Hájek under equal $\pi$). Visible structure or proxies → encode it in the design (certainty units, PPS, stratification) and/or in the estimator (post-stratification, calibration, ratio/GREG). When you trim or stabilize weights, or use model-assisted estimators, estimate variance with methods that reflect those choices (linearization or replication), or standard errors will mislead.
  • Control leverage, not just expectations. Even with no outcome information, you can cap leverage by design constraints: Enforce a lower bound on inclusion probabilities: $\pi_i \geq \varepsilon$ - Cap maximum design weights: $\frac{1}{\pi_i} \leq W_{\text{max}}$ - Use spread/balanced sampling to control co-selection through $\pi_{ij}$

In short, the vice in Basu's setup is a perverse design masquerading as virtue. HT is a reasonable answer when your target is a finite total and you decline to model outcomes beyond the design; it is not a shield for bad designs. When your aim is process learning, or when useful structure is visible, use designs and estimators that reflect that information and the loss you actually care about.

Subscribe to Gojiberries

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe