By Gaurav in ML/Statistics — 05 Dec 2025

Strings Attached: Shipping Fuzzy‑Joined Data

A growing share of empirical work relies on "data products" that integrate multiple administrative and survey sources via fuzzy joins. Researchers typically receive only the final merged panel and treat each cell as if it were directly observed. But upstream, the data producer has made substantive decisions---how aggressively to deduplicate within a table, how to match noisy identifiers across tables, what to do with ambiguous matches---that shape the final dataset in ways users cannot see.

This note asks: for data products designed for statistical inference, what information must be exposed so that analysts can reason about uncertainty and bias?

The Pipeline

Fuzzy matching enters typical data pipelines in two places: within-table deduplication and cross-table linkage. These are related problems but not identical ones.

Within-table deduplication identifies groups of records that may refer to the same entity---firm, school, village. For each group, the pipeline either selects the most complete or consistent record and drops the rest, or discards the entire group if no record is clearly best. Deduplication is an upstream step: unresolved duplicates that survive into cross-table scoring can create false matches that downstream filters may not reliably distinguish. Two records for the same firm with slightly different names will each find a plausible target in the other table, producing one correct link and one false link that looks clean.

Cross-table linkage is a separate stage. Given a source table and a target table, the goal is to identify which record in the target corresponds to each record in the source, using noisy identifiers such as place names, addresses, or transliterated personal names. Whenever the mapping has structural constraints---one-to-one in entity matching, capacity limits in many-to-one joins---linkage decisions are coupled: two source records sharing a candidate cannot be resolved independently, because assigning one affects what is available to the other. The one-to-one case is the tightest such constraint and the focus here.

A typical conservative pipeline blocks on trusted fields, scores pairs within each block using string similarities or other features, filters out ambiguous cases (where the best and second-best candidates score similarly), resolves remaining conflicts via greedy or optimal assignment, and leaves unresolved cases unmatched. The resulting analytic file contains only those source records that received a committed match. In common practice, the data product ships only the output of this single policy, with no way for downstream users to see or vary the decisions that shaped the final file.

Consequences for Inference

Records dropped by the matching pipeline are missing for a specific reason: the algorithm could not resolve their candidate sets under the chosen policy. The pattern of missingness tracks the data. Places with common or highly variable names generate more candidate collisions, producing ambiguous sets that the margin filter removes. Units with worse data quality---smaller firms, marginal villages---have noisier identifiers, hence more ambiguous or empty candidate sets. Regions with inconsistent transliteration conventions or boundary changes are systematically harder to match.

Whether this selection causes bias depends on the estimand. For a population mean, bias arises when the expected outcome among dropped records differs from the expected outcome among retained ones. For a treatment effect, dropped records reduce the effective treated and control groups non-randomly: if hard-to-match records concentrate in one arm, the comparison is over a selected subset of the intended population.

False matches are a separate problem. A conservative pipeline reduces their incidence by dropping ambiguous cases, but the false matches that survive are systematic---they concentrate among common names, dense blocks, and records with missing fields. When linkage determines treatment status, a single false match contaminates both groups: an untreated person enters the treatment group while the actual treated person falls into control. When treatment status is already known and linkage only recovers outcomes, within-group false matches can be benign in a narrow special case: if the errors amount to a one-to-one permutation of outcomes within the group, with no cross-group contamination, no selective unmatching, and no weighting, the unweighted group mean is unchanged. Outside that special case, within-group false matches distort estimates.

The data product should make both problems---dropped records and fragile links---visible so that analysts can evaluate their severity.

What to Ship

1. Crosswalks and Weights

The mapping from raw identifiers to analysis units must be explicit and auditable: crosswalk tables listing each raw record's corresponding analysis unit(s), and weights for aggregation (population, land area, fractional assignment weights when a record spans units). These make the structure of the mapping visible---splits, merges, one-to-many, many-to-one---and allow analysts to reconstruct aggregates and inspect which raw records feed into each unit.

2. The Candidate Graph

For each pair $(i, j)$ of source and target records within a block, the pipeline computes a pairwise evidence score $s_{ij} = f(x_{ij}) \in [0, 1]$ based on features of the pair---string similarities, geographic agreement, numeric distances. This score is row-separable: it depends only on the features of $(i, j)$, not on what other candidates exist.

Row-separable scores are not, in general, calibrated match probabilities. A score of 0.85 between two "Ram Kumar" records does not carry the same match probability as 0.85 between two "Jagmohan Trivikramji" records, because common names have a higher base rate of false agreement. The classical Fellegi-Sunter framework weights fields by the likelihood ratio $m/u$, where $m$ is the probability of agreement given a true match and $u$ is the probability of agreement by chance, but in its standard form it does not automatically handle the common-versus-rare-name problem; that requires value-specific or frequency-based adjustments. Many practical pipelines use heuristic similarities that lack even this basic calibration. When scores are calibrated probabilities, probabilistic operations on them---weighting, sampling, expected-loss calculations---are interpretable. When they are heuristic similarities, they define candidate sets and ordering but do not license probabilistic manipulation.

Given a screening threshold $\varepsilon$, the candidate set for source record $i$ is $\mathcal{C}_i = \{ j : s_{ij} \ge \varepsilon \}$. The candidate graph $G = (A \cup B, E)$ has an edge for every surviving pair, with weight $s_{ij}$. This graph is conditional on upstream blocking and screening: pairs excluded at those stages are treated as nonmatches and do not appear as edges, so any uncertainty analysis over the graph inherits that conditioning.

Ship this graph as a candidate-match table with one row per surviving pair: source identifier, target identifier, $s_{ij}$, rank within the candidate set, and whether this pair is used in the default linkage.

The shipped linkage is one feasible matching among many. In a one-to-one setting, the space of alternatives is the set of injective partial matchings on $G$---subsets of edges where each node appears at most once. The standard rowwise framing, which defines ambiguity for each record independently, misses the coupling that structural constraints create. Two records may each have an unambiguous best candidate, but if that best candidate is the same target, one of them must take a worse option or go unmatched. A data product that reports only per-record ambiguity flags hides this; the candidate graph preserves it.

This coupling has direct consequences for how analysts explore the space of alternatives. Rowwise operations that treat each record independently---assigning each fractionally across its candidates, or sampling independently from each candidate set---can use the same target more than once, producing infeasible linkages. Consider two source records that each have the same two candidates, with outcomes 10 and 0, where the true matching is a bijection. Independent rowwise optimization lets both claim the high-outcome candidate, producing a mean of 10. Every feasible bijection assigns one source to each candidate, so the sharp upper bound on the population mean is $(10 + 0)/2 = 5$. The looseness comes entirely from ignoring the coupling. (Under partial matching, where sources can go unmatched and unmatched records contribute support bounds, the gap between pointwise and sharp bounds depends on the support and the estimand.)

There are several ways to explore the space of feasible matchings. Re-threshold scores and re-assign on the resulting subgraph. Compute sharp bounds on estimands by optimizing over feasible matchings. Resample feasible matchings to propagate linkage uncertainty into standard errors---though resampling requires a justified probability law over matchings; the candidate graph provides the support, and the analyst must supply the measure. Fit a probability model for the linkage process, using the candidate graph as support, in the style surveyed by Kamat and Gutman (2026). All of these require the same shipped object: the candidate graph with per-pair scores and structural constraints.

3. The Full Source Roster

The candidate-match table contains only records with at least one surviving candidate. Records where nothing passed screening have no rows in that table and no rows in the analytic file. Ship the full source roster with a match\_status column on every record: matched under the default policy, unmatched with ambiguous candidates, or unmatched with no candidates. With this column, analysts can tabulate match rates by geography, size, or other characteristics and evaluate whether the loss is plausibly ignorable for their estimand.