By Gaurav in Statistics — 01 Jul 2025

What's Wrong? Adversarial LLM Judges With Their Own Evaluation Criteria

How should we evaluate what large language models say? Metrics like BLEU or exact match don't work well when answers are open-ended or involve reasoning. Human judgments are better—but they're costly and inconsistent. So increasingly, researchers have turned to LLMs themselves to act as evaluators.

Recent work uses models like GPT-4 to judge outputs from other models in benchmarks such as MT-Bench, AlpacaEval, and the LMSYS Arena. These frameworks typically ask the LLMs to compare two outputs and select which is better in terms of helpfulness, correctness, or other dimensions. It works surprisingly well: studies show that LLM-as-a-judge correlates well with human preferences, especially when the judging model is stronger than the generators.

But current setups mostly rely on scalar or comparative judgments. The model is prompted to pick the better answer or assign a score. Rarely is it asked to reason about why one output is better, or how it should be evaluated in the first place.

We propose an extension to this paradigm: Critique-First Evaluation.

A Three-Step Approach

Meta-Evaluation: First, ask the model to generate criteria for evaluating a specific kind of task. For example, "What makes a good answer to a multi-hop reasoning question?" or "What should we look for in a high-quality code snippet for this prompt?"
Critique Instead of Score: Next, instead of asking "Is this correct?" or "Give this a 1-10 rating," we ask: "What’s wrong with this output?" or "Using the criteria above, critique this response."
Judgment from Critique: Finally, either the model or an external process can distill the critique into a final judgment. But crucially, the evaluation is now grounded in articulated standards and fault-finding, not opaque scores.

This adversarial twist ("what’s wrong with this?") forces the model to engage more deeply with the task. And because the evaluation logic is surfaced as natural language, we can inspect it, debug it, and improve it.

Prior Work

This proposal builds on and extends several active threads in LLM evaluation:

LLMs as Judges: Benchmarks such as MT-Bench, AlpacaEval, and LMSYS Arena use models like GPT-4 to compare outputs, usually in a pairwise format. These systems primarily produce scalar judgments or preferences ("which is better?") without surfacing the rationale behind the judgment or inviting deeper critique.
Self-Critique and Iterative Refinement: Work on Constitutional AI, self-reflection, and chain-of-thought refinement (e.g., Anthropic, OpenAI) shows that prompting models to critique their own answers can improve generation quality. However, these methods are typically designed for generation improvement, not evaluation, and rarely formalize critique as a standalone evaluation signal.
Model-Written Critiques: Trains a 13 B "judge" that explains its ratings. Its rubric, however, is baked in during training—no task‑specific, on‑the‑fly criteria.
Adversarial Prompting and Red-Teaming: Safety-focused evaluations often ask LLMs to find flaws, hallucinations, or harmful content in responses. While adversarial in spirit, these setups are narrowly targeted at failure modes and do not generalize to broader notions of output quality or reasoning correctness.

Caveat Emptor

Of course, this is not without risk:

Hallucinated Criticism: Models may invent plausible-sounding flaws that aren't real. Asking for critique invites false positives.
Criteria Drift: When models define rubrics themselves, their evaluation priorities may differ from what humans actually care about.
Bias Coupling: If the generator and evaluator are from the same family, shared blind spots can persist unchecked.

Closing Thoughts

Any successful evaluation framework can be co-opted for self-refinement via RL.

A Three-Step Approach

Prior Work

Caveat Emptor

Closing Thoughts

Subscribe to Gojiberries