Shipping Fuzzy‑Joined Data for Inference
A growing share of empirical work relies on "data products" that integrate multiple administrative and survey sources via fuzzy joins. Researchers typically receive only the final merged panel and treat each cell as if it were directly observed. But upstream, the data producer has made substantive decisions—how aggressively to deduplicate within a table, how to match noisy identifiers across tables, what to do with ambiguous matches—that shape the final dataset in ways users cannot see.
This paper asks: for data products designed for statistical inference, what information must be exposed so that analysts can reason about uncertainty and bias?
Where Ambiguity Arises
Fuzzy matching enters typical data pipelines in two places:
Within-table deduplication. Collapsing multiple noisy records that may refer to the same firm, school, or village.
Cross-table linkage. Joining two datasets—e.g., a road registry to census villages, or Economic Census records to a geographic frame—via noisy identifiers such as place names.
In both settings, the basic problem is identical. For each record $i$ in a "source" table, the algorithm identifies zero, one, or many plausible "target" records. Often there is no single clearly correct match:
- A village name may fuzzily match several towns in the same district.
- Multiple firm records with nearly identical names and addresses may or may not correspond to the same underlying firm.
The common engineering default treats only "clean" one-to-one matches as usable and drops ambiguous cases. Upstream code flags them as ambiguous or low-confidence; downstream, they simply never appear in the joined table.
This raises two concerns. First, the pattern of dropped records is induced by fuzzy-matching rules, not by randomness, and may be systematically related to observables (region, size, etc.). Second, once ambiguous cases are deleted, users cannot reconstruct or vary the matching decisions; the space of "reasonable" joins is collapsed to a single choice.
Formalizing the Problem
Candidate Sets and Evidence Scores
Let $i$ index records in a source table (or within a table, for deduplication), and $j$ index records in a target table. For each pair $(i, j)$ we construct features $x_{ij}$: string similarities between names, agreement on higher-level geography, numeric distance in coordinates, and so on.
A pairwise evidence score is a function
$$s_{ij} = f(x_{ij}) \in [0, 1],$$
monotone in "how much this pair looks like a match." This is where one would plug in a modified Levenshtein similarity or a model-based estimate of $P(\text{match} = 1 \mid x_{ij})$.
Given a screening threshold $\varepsilon$, the candidate set for $i$ is
$$\mathcal{C}i = { (j, s{ij}) : s_{ij} \ge \varepsilon }.$$
Three regimes arise:
- No candidates ($\mathcal{C}_i = \emptyset$): likely unmatched.
- A unique, clearly dominant candidate.
- Multiple plausible candidates: several $j$ with similar scores.
We can formalize "ambiguous" candidate sets in various ways. One natural criterion requires that the gap between the best and second-best scores be small:
$$\text{ambiguous}(i) \iff s_{i(1)} - s_{i(2)} < \delta,$$
where $s_{i(1)} \ge s_{i(2)} \ge \dots$ are the ordered scores in $\mathcal{C}_i$.
This structure appears identically in cross-table linkage (source and target tables differ) and deduplication (source and target are the same table). In both cases, we construct a bipartite (or self-) graph of candidate edges $(i, j)$ with weights $s_{ij}$.
The Default Policy as Inner Join Plus Listwise Deletion
A matching or deduplication algorithm is a policy $\pi$ that maps each candidate set $\mathcal{C}_i$ to either a set of chosen matches or "no match." The common policy can be written:
if C_i is empty:
link(i) = NULL
elif best_score(i) >= τ and (best_score(i) - second_best_score(i)) >= δ:
link(i) = argmax_j s_ij
else:
link(i) = NULL # ambiguous -> dropped
In a left join of a source table to a target table, this induces a joined dataset containing only those $i$ for which $\text{link}(i) \neq \text{NULL}$. Structurally, this is equivalent to creating a deterministic mapping $i \mapsto j^(i)$ for "good" cases, performing an inner join on this mapping, and dropping all rows where $j^(i)$ is undefined.
The ambiguous records are not marked as uncertain; they are absent. The same pattern holds in deduplication: a policy that selects a canonical representative and discards ambiguous records is applying the same logic in a self-join.
Matching-Induced Missingness
The substantive point is that this deletion is not random. The probability that $\mathcal{C}_i$ is empty or ambiguous depends on the underlying data:
- Places with common or highly variable names will have more ambiguous matches.
- Units with worse data quality (smaller firms, marginal villages) may have noisier names and addresses, hence more ambiguous candidate sets.
- Certain regions may be systematically harder to match because of differences in spelling conventions or boundary changes.
Thus, the set
$$I^* = { i : \pi(\mathcal{C}_i) \neq \text{NULL} }$$
is not a random subsample of the original records. It is selected by properties of the candidate sets. Aggregates computed on $I^*$—number of firms, total employment, averages of outcomes—will generally be biased if ambiguous or unmatched units systematically differ from unambiguous ones.
This is exactly the analogue of listwise deletion in regression: mathematically convenient, but only innocuous under strong assumptions about the missingness mechanism. The key observation is that both pairwise evidence uncertainty (how likely $(i, j)$ is to be a match) and assignment uncertainty within $\mathcal{C}_i$ are present, yet most data products reveal only the final assignment or deletion.
Design Principles
The solution pattern is to represent ambiguity rather than resolve it silently via deletion. Below we articulate five design principles that address the core problems.
Transparent Mapping and Weights
Problem. When the mapping from raw records to analysis units is implicit, users cannot see where many-to-one aggregation, boundary crossings, or splits occur.
Requirement. The mapping from raw identifiers (census village codes, firm IDs) to analysis units must be explicit and auditable:
- Crosswalk tables listing, for each raw record, the corresponding analysis unit(s).
- Weights for aggregation (population, land area, fractional assignment weights when a record is split across units).
These make the structure of many-to-one relationships visible and allow analysts to reconstruct aggregates and inspect which raw records feed into each unit.
Row-Separable Evidence Scores
Problem. Without a quantitative measure of match quality at the pair level, all matches appear equal. Analysts cannot distinguish robust links from fragile ones.
Requirement. For any fuzzy linkage or deduplication step, ship a row-separable evidence score $s_{ij}$ for each candidate pair:
- $s_{ij}$ is a function only of the features of the pair $(i, j)$.
- $s_{ij}$ is not normalized within the candidate set; any normalization (e.g., probabilities summing to one over $j \in \mathcal{C}_i$) is treated as a derived quantity.
Row-separability ensures that a match_score column is well-defined: its value for a given pair does not depend on what other candidates are present.
Explicit Representation of Ambiguity
Problem. Candidate sets that are not "clean" are resolved by deletion; ambiguous cases disappear from the joined table.
Requirement. Ambiguity should be represented explicitly:
- Identify ambiguous candidate sets (e.g., small gap between top two scores).
- Encode this in the shipped data—either by exposing the full candidate set (a table with $i$, $j$, $s_{ij}$, and ranks), or at minimum by flagging units whose links rely on ambiguous candidate sets.
This turns "drop ambiguous matches" from an unobserved internal choice into a documented feature of the data.
Matching-Induced Missingness as a First-Class Object
Problem. Rows omitted due to matching decisions are not generic "missing data"; they are missing because the matching algorithm could not resolve their candidate sets under a particular policy.
Requirement. The dataset should make matching-induced missingness measurable:
- Label records according to why they are absent or present: matched under the default policy, unmatched because no candidates passed screening, or unmatched because candidates were ambiguous.
- Provide summary diagnostics: the share of units (or of total weight) that are matched, unmatched, or ambiguous, broken down by geography, size, or other characteristics.
This is analogous to reporting imputation shares or nonresponse rates. It allows analysts to evaluate how much of their effective sample is lost to matching and whether that loss is plausibly ignorable.
Interfaces for Alternative Policies
Problem. If only a single hard-coded linkage policy is reflected in the shipped data, users cannot assess the sensitivity of their results to alternative, equally defensible choices.
Requirement. The data product should support alternative policies without requiring users to re-run the fuzzy matcher:
- Allow users to impose stricter or looser thresholds on
match_score. - Provide enough information (candidate sets or flags) to implement alternative ambiguity rules: treating ambiguous cases as missing, assigning them fractionally across candidates, or sampling among top-$k$ candidates.
- Support resampling or multiple-imputation-style procedures over candidate sets, so that matching uncertainty can propagate into standard errors.
The shipped dataset should encode both the default policy and the space of neighboring policies that users may wish to explore.
SHRUG and masala-merge: A Case Study
SHRUG Architecture
SHRUG is structured around shrids—stable village/town identifiers—and a set of location keys that map census and Economic Census locations to shrids and then to higher-level units. Keys include population and land area, allowing weighted aggregation and transparent handling of splits and merges.
When aggregating data to districts or constituencies, SHRUG uses a weight-based imputation scheme: impute only when a sufficient share of the unit's weight has non-missing data, and report imputation shares at the aggregate level. Constituency-level datasets drop units where more than a specified share of employment is imputed or where geographic assignments are ambiguous.
This addresses the geographic and imputation dimensions of the design problem.
masala-merge: Pairwise Evidence and Ambiguity
The matching of locations across censuses and datasets often uses masala-merge, a fuzzy string matching algorithm for Indian names. It computes a modified Levenshtein distance with lower penalties for character substitutions common in transliteration (treating "laxmi" and "lakshmi" as close), tries several methods (including Stata's reclink), and exposes a fuzziness parameter controlling how permissive matches are.
For each matched pair, it yields a distance or evidence variable (e.g., masala_dist) and indicators of the method used and whether the match is ambiguous. Crucially, masala_dist is a row-separable evidence measure: it depends only on the pair of strings involved.
Dropping Ambiguous Matches as a Design Choice
The public masala-merge implementation is deliberately conservative with ambiguity: by default, it drops ambiguous matches—cases where multiple targets are similarly good candidates. Options exist to retain them, but the out-of-the-box behavior is equivalent to retaining only the top candidate and discarding cases where top candidates are not sufficiently separated.
Two observations follow. First, this is structurally the "listwise deletion" solution: ambiguous records vanish from downstream datasets. Second, the effect on coverage is non-trivial. Published applications report match rates of 65–90% for various administrative datasets linked to census villages, implying 10–35% of units are unmatched or dropped. Whether those dropped units are ignorable depends on their characteristics—and match quality is often worse in certain regions or for particular kinds of localities.
What SHRUG Exposes—and What It Does Not
SHRUG exposes the stable unit of analysis (shrids), the keys mapping locations to shrids with weights, and at aggregate levels, the share of imputed values and criteria used to drop problematic units.
It does not, in its public tables, expose the underlying masala_dist or any match_score variables, expose candidate sets for ambiguous matches, or distinguish exact from fuzzy matches at the row level. From the user's perspective, the matching step is a black box, even though the matching code is open source. The choice to drop ambiguous matches is made upstream and is not easily revisited.
An Extended Design
SHRUG already addresses the unit and imputation dimensions. Here we sketch how to extend its architecture to make matching decisions explicit and analyzable.
Exposing Match Scores and Candidate Sets
For selected linkages (e.g., Economic Census to shrids), the data producer could publish:
A canonical linkage table with one row per realized link:
- original identifier (e.g., EC location ID)
- shrid
match_score(transformed frommasala_dist)match_method(exact vs. fuzzy)- a flag indicating whether this match was unambiguous
A candidate-match table with one row per plausible candidate pair:
- original identifier, shrid
match_scorematch_rankwithin the candidate setis_defaultindicating whether this pair is used in the canonical linkage
Making "Dropped Ambiguous" Explicit
Rather than silently dropping ambiguous matches, the canonical linkage could contain a flag
$$\texttt{match_status} \in {\text{exact}, \text{unambiguous_fuzzy}, \text{ambiguous_dropped}}$$
and for ambiguous_dropped, point to the corresponding rows in the candidate-match table.
For example, a firm $i$ with three near-equivalent town matches could have no row in the canonical link (under the default policy) but have three rows in the candidate table, all flagged as belonging to an ambiguous set. This makes the analogue of listwise deletion explicit: analysts can see that a given fraction of the population is missing because of matching ambiguity and inspect the characteristics of those units.
Alternative Linkage Policies
With candidate sets and scores exposed, alternative policies become implementable without re-running masala-merge:
- Top-1 strict: Restrict to
match_rank = 1andmatch_score$\ge \tau$. - Top-$k$ stochastic: Randomly select one candidate among the top $k$, with weights proportional to
match_score. - Fractional: Assign each source record to all candidates with weights proportional to
match_score, normalized within the candidate set.
For uncertainty quantification, an analyst could construct $M$ synthetic joined datasets by sampling matches within each candidate set, estimate their model on each, and examine the distribution of estimates to quantify the contribution of matching uncertainty.
Deduplication as a Special Case
Within-table deduplication can be treated identically: match the table to itself via masala-merge, store candidate-duplicate pairs and match_score in a candidate table, produce a canonical deduplicated table using a default policy (e.g., retain the earliest record), but allow users to apply alternative policies. This transforms deduplication from a fixed, opaque step into one that can be inspected and varied.
Practical Trade-offs
Complexity vs. Transparency
The primary cost of exposing candidate sets is complexity. Candidate tables are large; they introduce many-to-many structures that some users may mishandle. Documentation must explain how to reconstruct the canonical dataset and avoid double-counting.
A tiered approach mitigates this: a "basic" release with shrid-level panels and keys (as SHRUG already provides), and an "advanced linkage" module with candidate sets and scores for users who opt into that complexity.
Storage and Versioning
Storing candidate sets multiplies the size of linkage artifacts. Changes to the matching algorithm will change match_score and candidate sets. Released datasets must be tagged with matching-algorithm versions, and users should be informed when changes occur.
A Tiered Release Strategy
A pragmatic recommendation:
- Core panel: Shrid-level data, geographic keys, and imputation diagnostics, suitable for most applied work.
- Linkage metadata: Additional flags in keys indicating match method (exact/ID-based vs. fuzzy) and coarse match-quality categories.
- Optional candidate-match modules: Separate, larger tables with candidate sets and row-separable
match_scorevariables, plus reference code for reconstructing the canonical linkage and generating alternatives.
This preserves accessibility while enabling more sophisticated treatment of matching uncertainty for users who need it.
Conclusion
Fuzzy joins and deduplication are routine components of the pipelines that generate widely-used empirical datasets. Decisions such as "drop ambiguous matches" are analogous to listwise deletion in regression: easy to implement, but with non-trivial implications for coverage and bias.
The SHRUG platform illustrates a careful approach to geographic reconciliation and imputation: it defines stable units (shrids), publishes crosswalks and weights, and documents imputation-based exclusion rules. Its matching layer, by contrast, is mostly hidden from users: the masala-merge algorithm is public, but the resulting scores and ambiguities are not exposed in the distributed tables.
If one takes seriously the idea that fuzzy linkage is part of the data-generating process, a more complete design would treat matching uncertainty explicitly. This entails row-separable evidence scores for candidate matches, explicit representation of ambiguity and matching-induced missingness, and optional candidate-match modules that allow users to explore alternative linkage policies.
Such a design would not eliminate uncertainty. But it would move uncertainty from an invisible upstream step to a documented, manipulable component of the empirical workflow—on the same footing as sampling error and model specification.