By Gaurav in ML/Statistics — 11 Apr 2026

The Ties That Bound

The previous post described a conservative record-linkage pipeline: accept only high-confidence matches, flag ambiguous cases, and leave the rest unmatched. That is the right approach when false matches are more costly than missed ones, but it means the matched subset is selected on ease of matching, and estimates from it may not recover population quantities.

Consider what makes records hard to match in a fuzzy join. Common names generate more candidate collisions — a "Ram Kumar" produces dozens of plausible matches where a "Jagmohan Trivikramji" produces one — and name frequency tracks caste and socioeconomic status. Missing fields leave less to score on, and field completeness reflects data collection infrastructure, which varies with remoteness and institutional capacity. Inconsistent transliteration across tables lowers similarity scores, and transliteration conventions vary by language and region. In each case, the records the pipeline cannot resolve are systematically different from the ones it can.

One way people address concerns about selection bias is by varying the threshold a little and noting that the point estimate did not move much. But this leaves useful information on the table. For each unresolved record, the pipeline leaves behind a set of candidates that survived scoring. Those candidate sets, combined with the constraint that the true matching is one-to-one, are enough to bound the population estimand without committing to a probability model for the linkage.

Bounding without a probability model complements the model-based approaches surveyed by Kamat and Gutman (Statistical Science, 2026): likelihood and Bayesian methods, multiple imputation, and weighting. All of these require a probability model for the linkage process, and all propagate uncertainty through that model. When such a model is credible, it is the right tool. The question here is what can be learned from the candidate graph alone, without taking a stand on the right probabilistic model. That is partial identification in Manski's sense, and it requires only the candidate sets and the structural constraint that the true matching is injective.

Start with the simplest estimand, the population mean. For each record $i$ in table A, let $C_i$ denote the set of candidates in table B that survive scoring and filtering. If $C_i$ contains the true match, then the true outcome for record $i$ is one of ${Y_j : j \in C_i}$. If $C_i$ is empty, all that is known is that $Y_i$ lies in the support $[\underline{y}, \bar{y}]$.

The loosest bounds treat each record independently. Push each uncertain record to its local best or worst case:

$$\mu^{UB} = \frac{1}{N}\left[\sum_{i: C_i \neq \emptyset} \max_{j \in C_i} Y_j + n_U \cdot \bar{y}\right]$$

$$\mu^{LB} = \frac{1}{N}\left[\sum_{i: C_i \neq \emptyset} \min_{j \in C_i} Y_j + n_U \cdot \underline{y}\right]$$

where $n_U$ is the number of records with no surviving candidate.

A small example shows both what these bounds get right and what they miss. Suppose there are two records in A, $i_1$ and $i_2$, each with the same two surviving candidates in B, $j_1$ and $j_2$, with outcomes $Y_{j_1} = 10$ and $Y_{j_2} = 0$. The pointwise upper bound lets both records claim $j_1$, so it sets the mean to $(10 + 10)/2 = 10$. But that is not a feasible linkage. If the match is one-to-one, only one of the two records can be matched to $j_1$. The other must take $j_2$ or remain unmatched. Once the matching structure is respected, the sharp upper bound is $(10 + 0)/2 = 5$, not 10. The looseness in the pointwise bounds comes exactly from ignoring the coupling across records.

The natural correction is to optimize over feasible linkages rather than over records in isolation. Let $\mathcal{M}$ be the set of feasible injective partial matchings on the candidate graph. Then the sharp upper bound solves:

$$\mu^{UB} = \frac{1}{N} \max_{\pi \in \mathcal{M}} \sum_i Y_{\pi(i)}$$

with the lower bound defined analogously, and unmatched records contributing $\bar{y}$ or $\underline{y}$ as appropriate. Computationally, this is an assignment or min-cost flow problem with an outside option for unmatched records.

The pipeline's three tiers determine how much each record contributes to the width of these bounds. Records with one clear surviving candidate — the high-confidence tier — contribute little or no width. Records with multiple plausible candidates contribute the range of outcomes across their candidates, tightened by the one-to-one constraint. Records with no surviving candidate contribute the full support. This is available before any downstream model is fit. Plot the margin distribution, identify records with small margins, look at the outcome range across their candidates. If those ranges are small, the bounds will be tight. If the ranges are large, the bounds will be wide, and you know exactly which records are responsible.

Treating the high-confidence tier as known (zero width contribution) is a stronger commitment than the general candidate-set bounds require. It assumes those matches are correct, not just plausible. The commitment pays off with tighter bounds when accuracy is high, and costs coverage when it is not.

Seen this way, threshold-setting becomes an inferential problem rather than a purely linkage problem. A conservative pipeline is usually described as managing a precision-recall tradeoff over committed matches. For the bounds, the relevant condition is different: the true match does not need to be identified, but it does need to appear somewhere in the candidate set. Tightening thresholds narrows the bounds by removing candidates, but risks pruning the true edge and losing coverage. Loosening thresholds does the opposite. The tradeoff is width versus candidate-graph recall.

Treatment effects

The previous post distinguished two designs: one where linkage determines who counts as treated, and one where treatment status is already known and linkage only recovers outcomes. The bounds interact with that distinction differently.

When linkage determines treatment, different feasible matchings produce different treated and control groups. The sharp ATE bounds optimize over matchings that jointly determine group membership and outcomes:

$$\tau^{UB} = \max_{\pi \in \mathcal{M}} \left[\frac{\sum_{i: \pi(i) \in T} Y_{\pi(i)}}{|{i: \pi(i) \in T}|} - \frac{\sum_{i: \pi(i) \in C} Y_{\pi(i)}}{|{i: \pi(i) \in C}|}\right]$$

Because the denominators depend on the matching, this is not a linear assignment problem, though the ambiguous subgraph is typically small enough for enumeration or mixed-integer methods.

When treatment status is already known, within-arm ambiguity seems harmless: permuting outcomes within an arm leaves the arm mean unchanged. But this holds only when every feasible within-arm reassignment produces the same multiset of outcomes. If treated units share candidates, different feasible matchings can move the treated mean even without cross-arm confusion.

For both population means and treatment effects, the bounds are computable before fitting any downstream model, and the candidate graph shows which records drive the width. That makes this a design tool as much as an inferential one: it indicates where better blocking, better linkage features, or clerical review would yield the greatest return.

Treatment effects

Subscribe to Gojiberries