By Gaurav in ML/Statistics — 20 Aug 2025

Behavioral Validity Checks for ML‑Based Coding

Social scientists increasingly use supervised ML and LLMs to scale content analysis—classifying texts for protest, policy issues, frames, sentiment, and more. The central question is construct validity: does the coder respond to evidence that defines the concept and ignore evidence that is irrelevant? Standard evaluations (held‑out accuracy, F1, intercoder agreement) cannot answer that—they show that a system works on average, not why.

This worry is not new. Grimmer and Stewart (2013) put it plainly: automated content methods are no substitute for careful reading; they exhort researchers to "validate, validate, validate." We take that charge literally by proposing a way to conduct some validation checks.

Two features make this especially tricky. The model is opaque; two equally accurate coders can rely on different heuristics (underspecification). Post‑hoc explanations (attention, token attributions, model‑written rationales) are unstable and easy to game. For correct predictions, there is often no visible error to inspect, and naive ablations risk removing both suspected shortcuts and true signal.

What survives these pitfalls are behavioral counterfactuals—small, theory‑guided edits to inputs—with evaluation based only on observed behavior. This is the core of CheckList (Ribeiro et al., 2020). It pairs naturally with causal‑inference habits: treat invariance tests as placebos/negative controls (change something that should not matter and verify no label change) and directional tests as manipulation checks/first‑stage tests (add or remove a defining cue and verify the label moves in the pre‑stated direction). We add a simple exclusion‑style refutation: compare performance drops when masking construct spans versus nuisance spans.

A compact audit that a reviewer can follow:

Pre‑declare behaviors in the codebook. List edits that must not change labels (identity‑term swaps; common typos/punctuation noise; removal of non‑substantive metadata; category order in prompts) and edits that should flip labels (short, theory‑backed cues; negation of a defining phrase).
Build counterfactual pairs. (x,g(x)) for invariance; (x,h(x)) for directional edits. Apply to correctly classified items as well as errors—spurious success often hides among the hits. Start with templates/regex; add a small, human‑edited set; LLM‑assisted edits are fine with spot‑checks.
Report three metrics alongside accuracy/agreement.
- Invariance Violation Rate (IVR): share of cases where predictions change under nuisance edits (lower is better).
- Directional Sensitivity Rate (DSR): share where predictions move in the pre‑declared direction under defining edits (higher is better).
- Causal‑Proxy Gap (CPG): performance drop when masking construct spans vs. nuisance spans; a large gap favoring construct spans supports the claim that the coder uses the right evidence.
  Slice by source/time/length/identity‑term presence to reveal brittle regions.

Codebook checks fit alongside this audit. Halterman and Keith (2025) introduce codebook‑compliance guardrails—constrain outputs to legal labels, test definition/example recall, verify order invariance, and sanity‑check swapped/generic labels. These defend against prompt and label‑space pathologies. Our behavioral audit complements them by directly testing nuisance-versus-construct sensitivity.

When tests fail, fix the measure, not the claim: tighten the codebook and prompts (e.g., separate definitions from label names), augment the few‑shot/training pool with the counterfactual pairs you built, and—if you train—use a light penalty to encourage invariance across identity swaps. Re‑run the same battery and report deltas.

There are limits worth stating. Passing IVR/DSR/CPG provides supportive evidence for construct validity, not proof of causality; tests are local to your edits, and untested shortcuts may remain. Narrow batteries can be overfit—keep edits diverse and theory‑grounded, and preserve a small, human‑curated set. Behavioral checks complement, not replace, careful human auditing.

Modern LLM coders warrant a brief endnote. Strong instruction‑following shifts, but does not remove failure modes. Hidden or incidental instructions inside documents, label‑name leakage, format tricks (HTML/markdown/zero‑width characters), order sensitivity, and truncation effects can all sway outputs. Simple defenses slot into the audit: isolate documents as data (schema‑bound fields), constrain outputs to legal labels, randomize non‑substantive order, and add prompt‑injection invariance tests; track a Prompt‑Injection Violation Rate (PIVR) alongside IVR/DSR/CPG.

Small Experiment

To illustrate the point, we evaluate the robustness of three popular sentiment classification models using Ribeiro-style perturbation tests on two benchmark datasets (SST-2 and IMDB). Our analysis reveals vulnerabilities in some of the models, particularly in handling negation, resulting in an accuracy drop of up to 14% and consistency scores falling below 65%.

Table 1: Model Robustness Across Perturbation Types

Values averaged across SST2 and IMDB datasets

Model	Perturbation	Orig Acc	Pert Acc	Δ Acc	Consistency	Correlation	Conf Change	Robustness
DistilBERT-SST2	Intensity+	0.883	0.883	0.000	0.980	0.974	0.016	0.981
	Intensity-	0.883	0.885	-0.002	0.988	0.991	0.009	0.991
	Spurious	0.883	0.858	0.025	0.950	0.937	0.041	0.953
	Negation	0.883	0.743	0.140	0.805	0.620	0.195	0.761
	Swap	0.883	0.828	0.055	0.905	0.713	0.095	0.856
NLPTown-BERT	Intensity+	0.653	0.673	-0.020	0.968	0.960	0.012	0.971
	Intensity-	0.653	0.613	0.040	0.933	0.947	0.020	0.952
	Spurious	0.653	0.635	0.018	0.893	0.867	0.042	0.907
	Negation	0.653	0.568	0.085	0.733	0.709	0.111	0.793
	Swap	0.653	0.598	0.055	0.900	0.763	0.049	0.878
CardiffNLP-RoBERTa	Intensity+	0.660	0.658	0.002	0.958	0.957	0.023	0.969
	Intensity-	0.660	0.658	0.002	0.963	0.985	0.017	0.978
	Spurious	0.660	0.670	-0.010	0.940	0.954	0.030	0.960
	Negation	0.660	0.683	-0.023	0.843	0.689	0.111	0.833
	Swap	0.660	0.688	-0.028	0.938	0.820	0.058	0.911

Column definitions:

Original Accuracy (Orig Acc). Baseline model performance on unperturbed text.
Perturbed Accuracy (Pert Acc) & Accuracy Drop (Δ Acc). Performance after perturbation application. Positive Δ Acc indicates performance degradation, while negative values suggest improved performance—a counterintuitive result possibly indicating overfitting to specific linguistic patterns in the original test set.
Consistency. The fraction of predictions remaining unchanged after perturbation.
Correlation. Pearson correlation between original and perturbed confidence scores, measuring the preservation of relative confidence ordering.
Confidence Change (Conf Change). Average absolute change in prediction confidence, quantifying prediction stability.
Robustness Score. Composite metric weighing consistency (30%), correlation (30%), confidence stability (20%), and accuracy preservation (20%).

Code at: https://gist.github.com/soodoku/0b2d8e84d7c325382381a9c18893c72a

Small Experiment

Subscribe to Gojiberries