Behavioral Validity Checks for ML‑Based Coding
Social scientists increasingly use supervised ML and LLMs to code text at scale: protest events, frames, sentiment, policy issues. Behavioral counterfactuals, small theory-guided edits to inputs evaluated only on observed behavior, have emerged as the surviving response to the construct-validity problem these coders pose. The CheckList framework (Ribeiro et al. 2020) is the proximate source.
Behavioral counterfactuals are doing two different jobs, and current practice conflates them. The first job is validation: establishing that the coder responds to construct evidence rather than to shortcuts. The second is scaling: quantifying how strongly and how reliably it does so, on a metric that supports comparison across coders. The pipeline that separates them runs an instrumental-variables-style validation first and a measurement-theory-style scaling second, in that order. The IV-construct-validity correspondence is the bridge that connects the two literatures, and the two-table reporting standard that follows gives reviewers and authors a cleaner standard than current practice supports.
Why standard evaluations fail
Held-out accuracy, F1, and intercoder agreement establish that a coder works on average, not why. Two equally accurate coders can rely on entirely different heuristics, the underspecification result documented in D'Amour et al. (2020). Post-hoc explanations (attention weights, saliency maps, model-written rationales) are unstable and gameable. Naive ablations remove signal and shortcut together, leaving the analyst unable to attribute the drop.
Behavioral counterfactuals survive these problems because they probe the coder rather than its internals. A targeted edit that changes construct status (or fails to) tests behavior directly, and observed-behavior evaluation cannot be gamed by representations the analyst never inspects.
Stage one: validity through the IV correspondence
The mapping
A defining edit changes the construct status of an item. Adding "I loved it" to a neutral review changes its sentiment. An invariance edit, a typo or paraphrase or identity-term swap, leaves construct status unchanged. The three IV conditions translate exactly. Relevance is directional sensitivity: the edit shifts the construct, so the share of items where the prediction moves in the pre-declared direction (DSR) is first-stage strength. Exclusion is invariance: the edit affects the prediction only through the construct, so the share of cases where predictions change under nuisance edits (IVR) is the exclusion-restriction violation rate. Ignorability is edit independence from other shortcuts, which is why edits must be small, targeted, and curated rather than mass-generated.
The audit metrics are then IV diagnostics. DSR plays the role of a first-stage F-statistic. Weak directional response is a weak instrument, and conclusions about construct attribution are correspondingly weak. IVR is the exclusion violation rate. High IVR means the edit reaches the prediction through some channel other than the construct, which voids identification. The Causal-Proxy Gap, the performance drop when masking construct spans relative to nuisance spans, is a Wald-style construct-attributable share. CPG overlaps closely with comprehensiveness and sufficiency in the ERASER benchmark (DeYoung et al. 2020) and with rationale-based evaluation more generally; the construct-versus-nuisance contrast is the refinement.
Refining the metrics
Invariance should be reported as equivalence rather than null acceptance. Two-one-sided-tests (TOST) replaces "we failed to reject change" with "the predicted-probability shift is bounded within a pre-registered margin." TOST gives reviewers a sharp pass criterion that null-acceptance reporting cannot.
Directional tests benefit from dose-response. Adding k defining cues should produce monotonically stronger predictions; estimate the slope and test monotonicity with Spearman or Kendall on the (dose, confidence) sequence. Saturation diagnostics catch overconfidence in extreme regions, where the codebook implies a flatter slope than the model exhibits.
Compliance is partial. Items where the cue was already present, or contradicted by stronger context, are non-compliers, and including them attenuates DSR. The Local Average Treatment Effect framing identifies construct sensitivity for the compliant subpopulation and gives Manski-style bounds for the rest, replacing a point estimate that hides selection with honest reporting.
Identification leverage
Sensitivity analysis quantifies shortcut robustness. Rosenbaum-style bounds answer how strong an unmeasured shortcut would have to be to overturn construct-attributed performance. VanderWeele's E-value works on the CPG directly: how strong must an unobserved shortcut be, in joint association with label and construct span, to nullify the construct-versus-nuisance gap. A continuous robustness number is more informative than a binary pass.
A paraphrase IV gives a continuous version of CPG. A paraphrase preserves construct meaning while changing wording, which makes it an instrument for construct, exclusion-restricted only if wording-level shortcuts are absent. The 2SLS-OLS gap on paraphrased pairs estimates shortcut bias.
Discriminant validity
Most ML-coding work establishes convergent validity at best: defining edits move the prediction in the right direction. Discriminant validity, that off-construct edits should not move the prediction, is the underdeveloped piece. The behavioral version of Campbell-Fiske requires multi-construct edits. An edit defining for sentiment should move sentiment predictions but not politeness predictions; an edit defining for politeness should do the reverse. A Discriminant Violation Rate, the share of edits where the off-construct prediction moves more than the on-construct prediction, gives a sharp test for the recurring worry that "sentiment" classifiers are picking up arousal, formality, or topical priors. A multitrait-multimethod table with constructs by coders, with same-construct/different-coder cells (convergent) and different-construct/same-coder cells (discriminant), gives a single audit object diagnosing both constructs and their separation.
Stage two: scaling and reliability through measurement theory
Item Response Theory applied to the validated battery gives each coder an ability score (construct sensitivity) on a common scale, with standard errors. Item-difficulty estimates order edits by discriminating power, which lets reviewers prioritize the most informative items in future audits. Differential Item Functioning tests flag edits that behave anomalously across coder families, localizing failure modes to specific architectures or training regimes.
Generalizability theory decomposes prediction-shift variance across facets: items, edit-types, models, slices. The dominant variance source identifies the bottleneck. Variance concentrated in models means models are the weak link; concentrated in edit-types means the audit is too narrow; concentrated in items means the corpus has heterogeneity the codebook has not captured. The G-coefficient gives a single reliability-analog number, and a decision study answers how many edits per item are needed to reach a target reliability. This replaces the current "build until tired" practice with a principled battery size.
Why the order matters
IRT presupposes that the items measure a coherent latent dimension. Fit Rasch on a battery contaminated by shortcut-driven items and the model scales shortcut sensitivity, not construct sensitivity. Item difficulties learned from a confounded battery are uninterpretable: a "difficult" item is one where shortcut cues are scarce, not where construct evidence is subtle. Run the IV audit first, retain only items passing exclusion-style checks, then fit IRT on the validated subset.
The reporting standard that follows is two tables. The validity table reports IVR, DSR, and CPG with TOST equivalence margins, LATE bounds for partial compliance, sensitivity analysis (Rosenbaum bounds or E-values), and MTMM cells for discriminant validity. The reliability and scaling table reports IRT ability with standard errors, item-difficulty distribution, the G-coefficient, and the D-study sample-size answer. Current practice conflates validity and reliability under one umbrella and reports neither well.
Prompt injection
LLM coders introduce failure modes the audit must address. Hidden or incidental instructions inside documents, label-name leakage, format tricks (HTML, markdown, zero-width characters), order sensitivity, and truncation effects can all sway outputs. A Prompt-Injection Violation Rate extends the audit, with the caveat that prompt injection is a security problem with active attackers, not an invariance failure (Greshake et al. 2023; Perez and Ribeiro 2022). Defenses (schema-bound document fields, constrained output to legal labels, randomized non-substantive order) belong alongside the audit, not as a substitute.
Limits
Behavioral tests are local to chosen edits, and untested shortcuts may remain. Narrow batteries can be overfit, which is why edit diversity and theory-grounding matter. Behavioral checks complement careful human auditing rather than replace it. The ML-coder evaluation literature and the measurement-theory literature have developed in parallel; the IV-construct-validity correspondence is the bridge that lets one inform the other.