Join, or Die

Join, or Die
Photo by Brett Jordan / Unsplash

Record linkage is the problem of identifying the same entity across two or more tables that lack a shared unique identifier. You have a name, maybe a district, maybe an age, and you need to figure out which row in table A corresponds to which row in table B.

When false links are substantially costlier than missed links, the decision of which scored pairs to actually commit to as matches deserves more attention than it usually gets. This post lays out a pipeline for that setting: high-precision linkage without ground truth, where the true match is one-to-one.

Which errors cost you

Which matching errors cost you bias (as opposed to just variance) depends on your estimand and your design. A missed match usually reduces your sample size, which is a power problem but not a bias problem (assuming missingness from matching failure is not correlated with the estimand, which is itself worth checking). A false match introduces measurement error, but whether that error biases your estimate depends on the structure of the error relative to the quantity you are trying to estimate.

This matters because false matches are not random. They concentrate where matching is hardest: common names, dense blocks, records with missing fields. If you are linking Indian electoral rolls, the hardest records to match are the "Ram Kumars" of the world (common name, many plausible candidates). False matches will concentrate among exactly those records. That is systematic error correlated with name frequency, which is correlated with caste, region, and socioeconomic status. If treatment exposure or outcomes also vary along those dimensions, the bias is heterogeneous in ways that simple attenuation corrections will not fix. A few common designs illustrate how the error structure plays out.

When the join assigns treatment status, meaning you have a treatment roster and match it to an outcome dataset to identify who was treated, a single false match causes two contaminations simultaneously. An untreated person enters the treatment group, contributing $Y_j(0)$ to the treated mean. The actual treated person $i$, who should have matched, falls into the residual and becomes "control," contributing $Y_i(1)$ to the control mean. Both group means are pushed toward each other. Errors cannot be contained within a group because group membership is itself defined by the match.

When the join attaches outcomes to pre-assigned groups, where treatment status is already recorded and you link $Y$ to each unit, the picture changes. A false match within the treated group means you observe $Y_j(1)$ instead of $Y_i(1)$, where both $i$ and $j$ are treated. For unweighted group means, this is benign: if the false matches amount to a permutation of outcomes within the group, the group mean is unchanged exactly. If $\pi$ is a permutation of treated units, $\bar{Y}T = \frac{1}{|T|} \sum Y{\pi(i)} = \frac{1}{|T|} \sum Y_i$. Variance inflation enters with unit-level analyses, subgroup means, weighted estimands, or non-permutation errors. Only cross-boundary errors, where a treated unit is matched to a control unit's outcome, cause attenuation of the treatment effect.

If your estimand is a level rather than a treatment effect (average earnings of program participants, say), every false match is costly regardless of any treatment-control boundary. You are estimating $E[Y]$ for a specific population and any wrong $Y$ is bias. And if you are estimating subgroup-specific treatment effects, the relevant boundary becomes subgroup-specific: a within-treatment false match that crosses subgroups distorts the subgroup mean.

The absence of ground truth compounds all of this. With labeled pairs you could train a classifier, calibrate its probabilities, and set thresholds against a known false-match/missed-match tradeoff. Without them, even principled methods like the EM estimation in splink and fastLink leave you unable to directly validate output. The pipeline described here compensates with conservative design: aggressive filtering, margin-based ambiguity removal, and iterative inspection.

The pipeline

The logic of a fuzzy join starts from a simple assumption: what looks the same is the same. The pipeline is a sequence of steps that reduce the blast radius of that assumption failing. Normalize to remove irrelevant variation. Identify within-table duplicate groups and either resolve them to a single record or discard the group entirely. Block on fields where you can match exactly, reducing the search space and encoding domain knowledge. Score with fuzzy methods only on the fields where exact matching fails. Filter out records where the scoring produced ambiguous results. Match what survives. Inspect everything.

Each step removes a specific source of error that later steps cannot catch. Skipping a step does not just reduce quality; it causes errors that compound through the rest of the pipeline.

Step 1: Preprocess

Normalization means removing variation that is irrelevant to matching. Case, punctuation, whitespace, encoding differences: none of these distinguish one person from another, but they will pollute similarity scores if left in. What counts as "irrelevant" is dataset-specific. Diacritics may be noise in one context and meaningful in another. Transliteration conventions (Sharma vs. Sharman) may reflect real ambiguity or just inconsistent romanization. The goal is to make records that refer to the same entity look as similar as possible before scoring.

Records that cannot be scored at all should be dropped here. If both first name and last name are empty, no downstream step can do anything useful with the record. Set these aside and report how many.

Step 2: Resolve or discard within-table duplicates

Suppose table A contains two records for the same person: A1 = "Rajesh Sharma, age 35, Jaipur" and A2 = "R. Sharma, age 35, Jaipur." These are the same person, but the records look quite different. When you score against table B, A1 matches B1 = "Rajesh Sharma" with a score of 0.95. A2, now a free agent, matches B2 = "Renu Sharma" with a score of 0.72. Both B1 and B2 have exactly one candidate above threshold. Both matches look clean to the downstream pipeline. A1-B1 is correct. A2-B2 is a false match. Nothing in the cross-table scoring, filtering, or assignment will flag this, because the duplicate was dissimilar enough to distribute across two different B records, each producing an unambiguous-looking pair.

Within-table scoring catches what cross-table scoring cannot: A1 and A2 score well against each other (same last name, compatible age, same district), flagging them as a duplicate group. You keep A1 (more complete), drop A2, and the false match to B2 never happens.

Dedup also recovers recall that would otherwise be lost. If A1 and A2 are near-identical duplicates that both score highly against B1, the margin filter in step 5 sees B1 with two close candidates and removes it. Neither A1 nor A2 matches anything. Precision is preserved, but you lost a perfectly good match. Dedup removes A2 before cross-table scoring, so B1 has one clear candidate and the match goes through.

To find duplicates, score all pairs of records within each block of the same table. Records that score above a dedup threshold against each other form a group. For each group of two or more, check whether one record is clearly the best on completeness, recency, or internal consistency. If so, keep that record and drop the rest. If the records are indistinguishable, drop the entire group rather than risk sending an unresolvable duplicate into cross-table matching.

Step 3: Block

Blocking partitions both tables on fields where you have high trust and where disagreement genuinely implies non-match. Only records that share the same block are compared.

Blocking encodes domain knowledge that the scoring function cannot learn. You match on state, city, and name precisely because you know a "Rajesh Sharma" in Jaipur and a "Rajesh Sharma" in Lucknow are different people. The scoring function has no way to know this. It sees similar strings and rates the pair highly. The hard constraint does what soft scoring cannot.

Blocking also isolates the scoring function from noise it shouldn't have to deal with. When a field has low cardinality (10 districts, 5 states, a handful of program codes), you can manually verify the mapping between tables and resolve discrepancies once. "Jaipure" in one table maps to "Jaipur" in the other. You handle that in a crosswalk and never ask the scoring function to deal with it. If you instead included district as a scored field, you would be mixing coarse string errors on categorical fields (misspelled district names) with the genuinely hard fuzzy matching on names. A Levenshtein distance of 1 between "Jaipure" and "Jaipur" is a clerical error with a known answer. A Levenshtein distance of 1 between "Rajesh" and "Brajesh" is the actual matching problem. Resolving the first through the crosswalk keeps the scoring function focused on the second.

To implement blocking, build crosswalks for each blocking field: map variant labels to canonical values (if one table uses "ST" and the other uses "South Tyneside," the crosswalk resolves that), reconcile boundary changes (three districts in the 2020 table correspond to one district in the 2011 table), and harmonize category codes. Run a completeness check before proceeding: every distinct value in both tables' blocking fields must appear in the crosswalk. Unmapped values are records that will be silently dropped at the blocking stage.

Block at the coarsest level where you trust both sources. Fields you trust in one table but not the other (one has clean machine-generated constituency codes, the other has hand-entered constituency names with typos) belong in scoring, not blocking.

Step 4: Score

Within each block, compute the full pairwise score matrix: every record in A scored against every record in B, using all informative fields not already used in blocking. Storing only each row's best match discards information the downstream steps need; you want the full matrix.

Use field-appropriate similarity metrics: Jaro-Winkler or normalized Levenshtein for names, numeric distance for ages or years, exact match for categorical fields not used in blocking. Not all agreement is equally informative. Agreement on "Jagmohan Trivikramji" is strong evidence of a true match because few people have that name. Agreement on "Ram Kumar" is weak evidence because many people share it. The Fellegi-Sunter framework formalizes this through the likelihood ratio, where the match weight for a field value is $\log(m/u)$ and $u$ is the random agreement rate. This variation in informativeness has a deeper consequence: heuristic similarity scores are not calibrated match probabilities. A score of 0.85 between two "Ram Kumar" records does not carry the same match probability as a score of 0.85 between two "Jagmohan Trivikramji" records, because common names have a higher base rate of false matches. Stratifying the scoring by name frequency or block density, so that the threshold adapts to the local false-match rate, is a partial fix but requires enough data within each stratum. The margin filter in step 5 helps indirectly by removing records where scores are least informative, since multiple close candidates concentrate among common names.

Several situations produce scores that are actively misleading. Numeric identifiers like "Dimapur 1" vs "Dimapur 2," or "Dimapur IX" vs "Dimapur X," are completely different entities that score very high on string similarity. Decompose these fields: match the alphabetic part with fuzzy methods, require exact match on the numeric suffix. Partial missingness is a related problem. If a field used in scoring is empty for one record and you built a composite string, the empty field creates a truncated string that fuzzy-matches against things it shouldn't. A composite like "rajasthan_jaipur_block1_" with a missing GP name will match any record in that block regardless of GP. Score fields independently rather than concatenating, and treat a missing field as zero similarity on that dimension. Short strings have the same problem in a different form: a Levenshtein distance of 1 between two 4-character strings means something very different from a distance of 1 between two 15-character strings. When values are short, apply tighter thresholds or require exact match.

Embeddings capture semantic similarity that edit distance cannot. "CA Forestry Assoc PAC" and "California Forestry Association" are close in embedding space but far in Levenshtein space. But embeddings are opaque: when Jaro-Winkler gives you 0.92, you can inspect the two strings and see why. When cosine similarity gives you 0.92, you cannot. For person-name matching with typos and transliteration noise, string distance is usually the better fit.

Step 5: Filter

The score matrix needs to be filtered before matching. Start by applying a minimum score threshold and zeroing out everything below it. This threshold should be conservative: a pair scoring 0.6 that happens to be the best available match for some record is still a bad match.

Next, compute margins for each record (both rows and columns), meaning the gap between its best and second-best surviving candidate:

$$\Delta_i = s_{i(1)} - s_{i(2)}, \quad \Gamma_j = s_{(1)j} - s_{(2)j}$$

A small margin means two candidates are nearly tied and any assignment between them is close to a coin flip. Remove these records from the candidate set or flag them for manual review.

Step 6: Match

After filtering, some records have exactly one surviving candidate. These matches are unambiguous and can be read off directly.

The remaining records may still have conflicts: two records from B might both want the same record from A, even after ambiguous cases have been removed by the margin filter. These conflicts require resolution. In hand-written scripts, the common approach is to iterate over one table row by row, assigning each to its best available candidate. This is order-dependent: shuffling the rows changes the output.

Greedy-by-best-global-pair avoids order dependence: find the highest-scoring surviving pair across the entire matrix, commit to it, remove both records, repeat. With calibrated match probabilities, maximum-weight bipartite matching (the Jonker-Volgenant algorithm in scipy.optimize.linear_sum_assignment) is provably better: it maximizes expected correct matches under an explicit loss function. With heuristic similarity scores, both methods are heuristics solving different objectives. Maximum-weight matching maximizes total score. Greedy-by-best-global preserves the highest-scoring pairs first and produces a nested sequence of feasible match sets.

In a precision-first setting, there is also a case for leaving conflicting records unresolved rather than forcing them into a hard one-to-one decision. If two records from A both want the same record from B and neither the margin filter nor the scores can clearly distinguish them, recording the ambiguity and flagging it for manual review may be more honest than letting an algorithm choose.

Step 7: Inspect and iterate

Spot-check matched pairs across the score distribution (not just the best matches, but also pairs near the threshold). Examine unmatched records for systematic patterns. If the unmatched set concentrates in one district, one demographic, or one time period, your pipeline has a blind spot, likely in blocking or the crosswalk.

A single linear pass from step 1 through step 6 producing a final output is the exception. More commonly, inspection sends you back upstream: adjusting crosswalks (step 1), re-examining dedup (step 2), tightening or loosening blocking (step 3), revising scoring weights (step 4), adjusting thresholds (step 5). The linear presentation is a useful skeleton, but the actual workflow is a loop.

Multi-pass structure

Run the pipeline with strict criteria first: tight thresholds, exact or near-exact match on a strong composite key. Pull out confirmed matches. Remove them from both tables. Run again on the remainder with relaxed scoring. This prevents easy matches from creating unnecessary competition in the filtering and assignment steps, and gives you a clean high-confidence tier in your output.

Setting thresholds

Every threshold in this pipeline is a free parameter: the score threshold, the margin threshold, the dedup threshold, the weights in the scoring function. Without labels, there is no fully trustworthy generic rule for setting them. There are more principled options than pure hand-tuning (posterior estimators in the Fellegi-Sunter framework, cost-based decision rules), but none fully substitutes for labeled data or manual review. Start conservative, inspect outputs, adjust.

Subscribe to Gojiberries

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe