Behavioral Validity Checks for ML‑Based Coding

Behavioral Validity Checks for ML‑Based Coding
Photo by Louis Reed / Unsplash

Social scientists increasingly use supervised ML and LLMs to scale content analysis—classifying texts for protest, policy issues, frames, sentiment, and more. The central question is construct validity: does the coder respond to evidence that defines the concept and ignore part of the evidence that is irrelevant? Standard evaluations (held‑out accuracy, F1, intercoder agreement) cannot answer that—they show that a system works on average, not why.

This worry is not new. Grimmer and Stewart (2013) put it plainly: automated content methods are no substitute for careful reading; "validate, validate, validate" with problem‑specific checks. We take that charge literally and make it operational.

Two features make this especially tricky. The model is opaque; two equally accurate coders can rely on different heuristics (underspecification). Post‑hoc explanations (attention, token attributions, model‑written rationales) are unstable and easy to game. For correct predictions, there is often no visible error to inspect, and naive ablations risk removing both suspected shortcuts and true signal.

What survives these pitfalls are behavioral counterfactuals—small, theory‑guided edits to inputs—with evaluation based only on observed behavior. This is the core of CheckList (Ribeiro et al., 2020). It pairs naturally with causal‑inference habits: treat invariance tests as placebos/negative controls (change something that should not matter and verify no label change) and directional tests as manipulation checks/first‑stage tests (add or remove a defining cue and verify the label moves in the pre‑stated direction). We add a simple exclusion‑style refutation: compare performance drops when masking construct spans versus nuisance spans.

A compact audit that a reviewer can follow:

  • Pre‑declare behaviors in the codebook. List edits that must not change labels (identity‑term swaps; common typos/punctuation noise; removal of non‑substantive metadata; category order in prompts) and edits that should flip labels (short, theory‑backed cues; negation of a defining phrase).
  • Build counterfactual pairs. (x,g(x)) for invariance; (x,h(x)) for directional edits. Apply to correctly classified items as well as errors—spurious success often hides among the hits. Start with templates/regex; add a small, human‑edited set; LLM‑assisted edits are fine with spot‑checks.
  • Report three metrics alongside accuracy/agreement.
    • Invariance Violation Rate (IVR): share of cases where predictions change under nuisance edits (lower is better).
    • Directional Sensitivity Rate (DSR): share where predictions move in the pre‑declared direction under defining edits (higher is better).
    • Causal‑Proxy Gap (CPG): performance drop when masking construct spans vs. nuisance spans; a large gap favoring construct spans supports the claim that the coder uses the right evidence.
      Slice by source/time/length/identity‑term presence to reveal brittle regions.

Codebook checks fit alongside this audit. Halterman and Keith (2025) introduce codebook‑compliance guardrails—constrain outputs to legal labels, test definition/example recall, verify order invariance, and sanity‑check swapped/generic labels. These defend against prompt and label‑space pathologies. Our behavioral audit complements them by testing nuisance‑versus‑construct sensitivity directly.

When tests fail, fix the measure, not the claim: tighten the codebook and prompts (e.g., separate definitions from label names), augment the few‑shot/training pool with the counterfactual pairs you built, and—if you train—use a light penalty to encourage invariance across identity swaps. Re‑run the same battery and report deltas.

There are limits worth stating. Passing IVR/DSR/CPG provides supportive evidence for construct validity, not proof of causality; tests are local to your edits, and untested shortcuts may remain. Narrow batteries can be overfit—keep edits diverse and theory‑grounded, and preserve a small human‑curated set. Behavioral checks complement, not replace, careful human auditing.

Modern LLM coders warrant a brief endnote. Strong instruction‑following shifts, but does not remove failure modes. Hidden or incidental instructions inside documents, label‑name leakage, format tricks (HTML/markdown/zero‑width characters), order sensitivity, and truncation effects can all sway outputs. Simple defenses slot into the audit: isolate documents as data (schema‑bound fields), constrain outputs to legal labels, randomize non‑substantive order, and add prompt‑injection invariance tests; track a Prompt‑Injection Violation Rate (PIVR) alongside IVR/DSR/CPG.


Small Experiment

To illustrate the point, we evaluate the robustness of three popular sentiment classification models using Ribeiro-style perturbation tests across two benchmark datasets (SST2 and IMDB). Our analysis reveals critical vulnerabilities in state-of-the-art models, particularly in handling negation, with performance degradations up to 14% accuracy drop and consistency scores falling below 65%.

Table 1: Model Robustness Across Perturbation Types

Values averaged across SST2 and IMDB datasets

Model Perturbation Orig Acc Pert Acc Δ Acc Consistency Correlation Conf Change Robustness
DistilBERT-SST2 Intensity+ 0.883 0.883 0.000 0.980 0.974 0.016 0.981
Intensity- 0.883 0.885 -0.002 0.988 0.991 0.009 0.991
Spurious 0.883 0.858 0.025 0.950 0.937 0.041 0.953
Negation 0.883 0.743 0.140 0.805 0.620 0.195 0.761
Swap 0.883 0.828 0.055 0.905 0.713 0.095 0.856
NLPTown-BERT Intensity+ 0.653 0.673 -0.020 0.968 0.960 0.012 0.971
Intensity- 0.653 0.613 0.040 0.933 0.947 0.020 0.952
Spurious 0.653 0.635 0.018 0.893 0.867 0.042 0.907
Negation 0.653 0.568 0.085 0.733 0.709 0.111 0.793
Swap 0.653 0.598 0.055 0.900 0.763 0.049 0.878
CardiffNLP-RoBERTa Intensity+ 0.660 0.658 0.002 0.958 0.957 0.023 0.969
Intensity- 0.660 0.658 0.002 0.963 0.985 0.017 0.978
Spurious 0.660 0.670 -0.010 0.940 0.954 0.030 0.960
Negation 0.660 0.683 -0.023 0.843 0.689 0.111 0.833
Swap 0.660 0.688 -0.028 0.938 0.820 0.058 0.911

Column definitions:

  1. Original Accuracy (Orig Acc). Baseline model performance on unperturbed text.
  2. Perturbed Accuracy (Pert Acc) & Accuracy Drop (Δ Acc). Performance after perturbation application. Positive Δ Acc indicates performance degradation, while negative values suggest improved performance—a counterintuitive result possibly indicating overfitting to specific linguistic patterns in the original test set.
  3. Consistency. The fraction of predictions remaining unchanged after perturbation.
  4. Correlation. Pearson correlation between original and perturbed confidence scores, measuring the preservation of relative confidence ordering.
  5. Confidence Change (Conf Change). Average absolute change in prediction confidence, quantifying prediction stability.
  6. Robustness Score. Composite metric weighing consistency (30%), correlation (30%), confidence stability (20%), and accuracy preservation (20%).

TL;DR

All models exhibit significant degradation under negation perturbations, with consistency dropping 15-35% and accuracy falling 8-14%. All models demonstrate strong robustness to intensity modifications (>0.95 consistency), indicating that gradient semantic shifts are well-handled.

Code at: https://gist.github.com/soodoku/0b2d8e84d7c325382381a9c18893c72a

Subscribe to Gojiberries

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe