What's Wrong? Adversarial LLM Judges With Their Own Evaluation Criteria
How should we evaluate what large language models say? Metrics like BLEU or exact match don't work well when answers are open-ended or involve reasoning. Human judgments are better—but they're costly and inconsistent. So increasingly, researchers have turned to LLMs themselves to act as evaluators.