Join or Die

Join or Die
Photo by Brett Jordan / Unsplash

Record linkage is the problem of identifying the same entity across two or more tables that lack a shared unique identifier. You have a name, maybe a district, maybe an age, and you need to figure out which row in table A corresponds to which row in table B. This comes up constantly in empirical research: linking historical census records across decades, connecting electoral rolls to administrative data, merging survey responses with program records.

There are good tools for parts of this problem (splink, dedupe, fastLink among them), and the pipeline described here is not a replacement for them. Where existing packages tend to be strongest is candidate generation and pairwise scoring. Where they tend to leave the most on the table is the final decision: given a set of scored candidate pairs, which ones do you actually commit to as matches? When false links are substantially costlier than missed links, that decision rule deserves more attention than it usually gets. This post lays out a pipeline for that setting: high-precision linkage without ground truth, where the true match is one-to-one.

Which errors cost you

Which matching errors cost you bias (as opposed to just variance) depends on your estimand and your design. A missed match usually reduces your sample size, which is a power problem but not a bias problem (assuming missingness from matching failure is not correlated with the estimand, which is itself worth checking). A false match introduces measurement error, but whether that error biases your estimate depends on the structure of the error relative to the quantity you are trying to estimate.

This matters because false matches are not random. They concentrate where matching is hardest: common names, dense blocks, records with missing fields. If you are linking Indian electoral rolls, the hardest records to match are the "Ram Kumars" of the world (common name, many plausible candidates). False matches will concentrate among exactly those records. That is systematic error correlated with name frequency, which is correlated with caste, region, and socioeconomic status. If treatment exposure or outcomes also vary along those dimensions, the bias is heterogeneous in ways that simple attenuation corrections will not fix. The pipeline described below is precision-first in general, but the intensity of filtering at each step, and the definition of a tolerable error, should be set with the downstream analysis in mind. A few common designs illustrate why.

When the join assigns treatment status, meaning you have a treatment roster and match it to an outcome dataset to identify who was treated, a single false match causes two contaminations simultaneously. An untreated person enters the treatment group, contributing $Y_j(0)$ to the treated mean. The actual treated person $i$, who should have matched, falls into the residual and becomes "control," contributing $Y_i(1)$ to the control mean. Both group means are pushed toward each other. Errors cannot be contained within a group because group membership is itself defined by the match.

When the join attaches outcomes to pre-assigned groups, where treatment status is already recorded and you link $Y$ to each unit, the picture changes. A false match within the treated group means you observe $Y_j(1)$ instead of $Y_i(1)$, where both $i$ and $j$ are treated. For unweighted group means, this is benign: if the false matches amount to a permutation of outcomes within the group, the group mean is unchanged exactly. If $\pi$ is a permutation of treated units, $\bar{Y}T = \frac{1}{|T|} \sum Y{\pi(i)} = \frac{1}{|T|} \sum Y_i$. Variance inflation enters with unit-level analyses, subgroup means, weighted estimands, or non-permutation errors. Only cross-boundary errors, where a treated unit is matched to a control unit's outcome, cause attenuation of the treatment effect.

If your estimand is a level rather than a treatment effect (average earnings of program participants, say), every false match is costly regardless of any treatment-control boundary. You are estimating $E[Y]$ for a specific population and any wrong $Y$ is bias. And if you are estimating subgroup-specific treatment effects, the relevant boundary becomes subgroup-specific: a within-treatment false match that crosses subgroups distorts the subgroup mean.

The absence of ground truth compounds all of this. With labeled pairs you could train a classifier, calibrate its probabilities, and set thresholds against a known precision-recall tradeoff. Without them, even principled methods like the EM estimation in splink and fastLink leave you unable to directly validate output. The pipeline described here compensates with conservative design: aggressive filtering, margin-based ambiguity removal, and iterative inspection.

The problem structure

The pipeline assumes the true match is one-to-one. Each entity appears at most once in each table, and you want to pair them up. This is a bipartite matching problem: two sets of nodes, weighted edges between them, pick a subset of edges such that no node is used twice.

This assumption needs to be correct for the pipeline to work. If table A contains person-years and table B contains persons, the relationship is many-to-one. If both tables have unresolved duplicates, you may have many-to-many. The pipeline addresses this by deduplicating each table before cross-table linking, and by requiring you to confirm that the unit of observation in both tables is actually the entity you're trying to match. If it isn't, reshape first, match second.

The pipeline

Step 1: Preprocess

Preprocessing has two goals. The first is normalization: removing variation that is irrelevant to matching. Case, punctuation, whitespace, encoding differences: none of these distinguish one person from another, but they will pollute similarity scores if left in. What counts as "irrelevant" is dataset-specific. Diacritics may be noise in one context and meaningful in another. Transliteration conventions (Sharma vs. Sharman) may reflect real ambiguity or just inconsistent romanization. The goal is to make records that refer to the same entity look as similar as possible before scoring.

The second is defining a missingness policy. Decide which fields are essential (record is dropped if missing or incomplete) and which are helpful but not required (record is kept, field similarity treated as zero or neutral). A record missing both first name and last name cannot be matched. A record missing only father's name can still be matched on the remaining fields, just with less confidence.

Step 2: Deduplicate each table separately

If table A has two records for the same person, any downstream one-to-one assignment is wrong from the start. One duplicate matches correctly; the other grabs an unrelated record from table B. That is a false match caused by failed dedup, not by the matching step.

The goal here is not to synthesize a canonical record from multiple inputs. It is to ensure every record that enters cross-table matching is unique. Block and score within the table using the same mechanics as cross-table (symmetric matrix, upper triangle only). Find groups of near-duplicate records using connected components. Overmerging at this stage, grouping records that were actually two different people, costs you recall on both. Undermerging, leaving a duplicate pair in the table, creates a false match downstream. For precision, overmerging is the safer error.

For each group of size one, keep it. For each group of size two or more, ask whether one record is clearly the best: most complete (fewest missing fields), most recent, most internally consistent. If one record dominates, keep it and drop the rest. If the records are indistinguishable, drop the entire group. The margin logic from step 5 applies here too: if you can't tell which record is better, keeping either one is a coin flip, and a coin flip is not precision.

Report what you dropped and how much. If you are dropping 15% of the table, the data has a duplicate problem that dedup-by-deletion will not solve. If you are dropping 1%, you have cleaned up the edges and the cross-table step will be better for it.

Step 3: Block

The naive approach to record linkage is to score every record in A against every record in B. Blocking reduces this search space by partitioning both tables on fields where you have high trust and where disagreement genuinely implies non-match. Only records that share the same block are compared.

Before blocking, build crosswalks for each blocking field. This means mapping variant labels to canonical values (if one table uses "ST" and the other uses "South Tyneside," the crosswalk resolves that), reconciling boundary changes (three districts in the 2020 table correspond to one district in the 2011 table), and harmonizing category codes across tables. Run a completeness check: every distinct value in both tables' blocking fields must appear in the crosswalk. Unmapped values are records that will be silently dropped. Report them and resolve them before proceeding.

In this pipeline, blocking is not only a computational shortcut (though it is that too). It also encodes domain knowledge as a hard constraint. You match on state, city, and name precisely because you know a "Rajesh Sharma" in Jaipur and a "Rajesh Sharma" in Lucknow are different people. The scoring function has no way to know this. It sees similar strings and rates the pair highly. The hard constraint does what soft scoring cannot.

Blocking also removes a layer of matching noise from the scoring step. When a field has low cardinality (10 districts, 5 states, a handful of program codes), you can manually verify the crosswalk is correct and complete. "Jaipure" in one table maps to "Jaipur" in the other. You resolve that once in the crosswalk and never ask the scoring function to deal with it. This matters because if you instead include district as a scored field, you are mixing coarse string errors on categorical fields (misspelled district names) with the genuinely hard fuzzy matching on names. A Levenshtein distance of 1 between "Jaipure" and "Jaipur" means something completely different from a Levenshtein distance of 1 between "Rajesh" and "Rajssh." The first is a clerical error with a known answer; the second is the actual matching problem. By resolving the first through the crosswalk and blocking on it, you keep the scoring function focused on the fields where fuzzy matching is genuinely needed.

Block at the coarsest level where you trust both sources. Fields you trust in one table but not the other (one has clean machine-generated constituency codes, the other has hand-entered constituency names with typos) belong in scoring, not blocking. Using them as blocking keys will lose true matches wherever the untrusted field has errors.

Step 4: Score

Within each block, compute the full pairwise score matrix: every record in A scored against every record in B, using all informative fields not already used in blocking. Not best-match-per-row: the full matrix. Storing only each row's best match is a greedy error that discards information the downstream steps need.

The choice of similarity metric matters. For names, Jaro-Winkler or normalized Levenshtein work well. For ages or years, numeric distance. For categorical fields not used in blocking, exact match. Weight by informativeness: agreement on a rare name ("Jagmohan Trivikramji") is much stronger evidence than agreement on a common one ("Ram Kumar"). The Fellegi-Sunter framework formalizes this through the likelihood ratio, where the match weight for a field value is $\log(m/u)$ and $u$ is the random agreement rate. Even without full Fellegi-Sunter estimation, being aware that common values carry less signal will improve your scoring. Embeddings capture semantic similarity that edit distance cannot. "CA Forestry Assoc PAC" and "California Forestry Association" are close in embedding space but far in Levenshtein space. But embeddings are opaque in a way that string distance is not. When Jaro-Winkler gives you 0.92, you can inspect the two strings and see why. When cosine similarity gives you 0.92, you cannot. For person-name matching with typos and transliteration noise, string distance is usually the better fit.

One hazard deserves special attention. Fuzzy matching on values with numeric identifiers like "Dimapur 1" vs "Dimapur 2," or worse, "Dimapur IX" vs "Dimapur X," produces high similarity scores for completely different entities. The same applies to any short string where a single edit represents a large fraction of the total length. The fix is to decompose: match the alphabetic part with fuzzy methods, require exact match on the numeric suffix, and apply tighter thresholds when strings are short.

Finally, be explicit about what the "score" is. The matrix might contain heuristic similarity values (a weighted sum of Jaro-Winkler and age distance) or calibrated match probabilities (from Fellegi-Sunter EM or a trained classifier). These are different objects. A calibrated probability of 0.7 means something precise; a heuristic similarity of 0.7 does not. The distinction matters for step 6.

Step 5: Filter

The score matrix needs to be filtered before the decision step. Start by applying a minimum score threshold and zeroing out everything below it. This threshold should be conservative: a pair scoring 0.6 that happens to be the best available match for some record is still a bad match if you're optimizing for precision.

Next, compute margins for each record (both rows and columns), meaning the gap between its best and second-best surviving candidate:

$$\Delta_i = s_{i(1)} - s_{i(2)}, \quad \Gamma_j = s_{(1)j} - s_{(2)j}$$

where $s_{i(1)}$ and $s_{i(2)}$ are the first and second highest scores for record $i$. A small margin means the record is ambiguous: two candidates are nearly tied, and any assignment between them is close to a coin flip. Remove ambiguous records from the candidate set or flag them for manual review. The margin threshold is a free parameter. Start conservative, inspect the flagged records, adjust.

Finally, examine the sparsified matrix for any record that participates in many surviving pairs. This is diagnostic, not filtering. If one record in B has high scores against five records in A, something is wrong upstream: either deduplication in step 2 missed a cluster, or blocking is too coarse, or the scoring function isn't discriminating well. Diagnose and fix before proceeding.

Step 6: Decision rule

Once candidate pairs are scored and filtered, the final task is to choose a partial bipartite matching under an explicit loss function. This is a decision problem, not just an optimization, and framing it correctly requires distinguishing the score (what you computed in step 4) from the objective (what you want to maximize or minimize).

If scores are calibrated match probabilities $p_{ij}$, the decision step can be formalized as a partial matching problem with abstention:

$$\max_{L \subseteq E} \sum_{(i,j) \in L} \left[ u_{TP} \cdot p_{ij} - u_{FP} \cdot (1 - p_{ij}) \right]$$

subject to each $i$ and $j$ being used at most once, where $u_{TP}$ and $u_{FP}$ are the utilities (or costs) of true positives and false positives. Making the pipeline "precision-first" is concrete in this formulation: set $u_{FP}$ large relative to $u_{TP}$, and the optimizer will leave records unmatched rather than risk a false link. This formulation also naturally allows abstention: a pair is included only if its expected utility is positive, i.e. $p_{ij} > u_{FP} / (u_{TP} + u_{FP})$. With calibrated probabilities and an explicit loss, maximum-weight partial matching maximizes expected precision at any given number of returned links.

If scores are heuristic similarities rather than calibrated probabilities, the formal guarantee breaks down. A heuristic similarity of 0.7 does not map to a known false-positive risk, and the choice of assignment method matters more.

The most common approach, row-sequential greedy, iterates over rows in table A, assigning each to its best available candidate. This is order-dependent (shuffling the rows changes the result) and does not maximize any coherent objective.

The Hungarian algorithm finds the one-to-one assignment maximizing total score. If scores were calibrated, this would also maximize expected correct matches. With uncalibrated scores, maximizing total score can break a near-perfect 0.99 pair to create two mediocre 0.70 pairs when the total goes up. Whether this helps or hurts precision depends on how well the scoring function is calibrated.

A more conservative alternative is greedy-by-best-global-pair: look at the entire score matrix, find the single highest-scoring pair, assign it, remove both records, repeat. This always commits to the most confident match first and never sacrifices a strong match to prop up weaker ones. It is order-independent (unlike row-sequential greedy) and biases toward precision (unlike Hungarian). It is not optimal in any formal sense; it is a heuristic that works well when scores are uncalibrated and the analyst wants a conservative commit-first rule.

For Hungarian specifically, apply the score threshold after assignment. The standard implementation (scipy.optimize.linear_sum_assignment) produces a complete assignment and may pair a row with a zeroed-out column if nothing better remains. Drop those.

Step 7: Inspect and iterate

Spot-check matched pairs across the score distribution (not just the best matches, but also pairs near the threshold). Examine unmatched records for systematic patterns. If the unmatched set concentrates in one district, one demographic, or one time period, your pipeline has a blind spot, likely in blocking or the crosswalk.

This is where the pipeline's iterative nature is honest. A single linear pass from step 1 through step 6 producing a final output is the exception. More commonly, inspection at step 7 sends you back upstream: adjusting crosswalks (step 1), re-examining dedup (step 2), tightening or loosening blocking (step 3), revising scoring weights (step 4), adjusting thresholds (step 5). The linear presentation of the pipeline is a useful skeleton, but the actual workflow is a loop.

Multi-pass structure

Run the pipeline with strict criteria first: tight thresholds, exact or near-exact match on a strong composite key. Pull out confirmed matches. Remove them from both tables. Run again on the remainder with relaxed scoring.

This is better than a single pass because easy matches create unnecessary competition in the decision step. If two records are obviously the same person, the optimizer should not be "choosing" between that pair and some other pair. Removing confirmed matches simplifies the optimization over genuinely hard cases and gives you a clean high-confidence tier in your output.

Setting thresholds

Every threshold in this pipeline is a free parameter: the score threshold, the margin threshold, the weights in the scoring function. Without labels, there is no fully trustworthy generic threshold-selection rule, but there are more principled options than pure hand-tuning: explicit cost-based decision rules (as in the formalization of step 6), posterior/loss-based estimators in the Fellegi-Sunter framework, and emerging unsupervised linkage-quality estimation methods. What remains true is that none of these fully substitutes for either labeled data or manual review.

If you have labeled pairs (even a small number), score them, examine the score distribution for true matches and non-matches, and set the threshold where the distributions separate. If you don't, start conservative, inspect outputs, and adjust. Examine pairs near the decision boundary. If borderline matches look correct on inspection, you can relax the threshold. If they look wrong, tighten it.

The margin threshold requires looking at actual ambiguous cases. Examine records that have two candidates within 0.05 of each other. Can you tell which is correct? If not, neither can the algorithm.

Subscribe to Gojiberries

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe