When the Judge is Wrong: Measuring LLM-as-Judge Reliability Against Graph-Verified Ground Truth in Financial Documents
Abstract
LLM-as-judge is widely adopted for evaluating AI system outputs, yet its reliability for structured tasks remains poorly understood because ground truth is typically established by the same class of system being evaluated. We break this circularity by introducing a graph-verified evaluation framework: we construct provably correct ground truth from financial documents using graph-based verification, then measure how accurately LLM judges assess answer correctness. Across 454 structured retrieval questions spanning ten categories and five financial report types, we evaluate Claude Opus 4.6 as a judge in four configurations. Even a strict LLM judge with access to ground truth disagrees with graph-verified truth in 5.6% of cases, with a false acceptance rate of 7.1%—approving wrong answers as correct. A lenient judge inflates this to 12.0%, and a blind judge (without ground truth) reaches 40.4%—approving 2 in 5 wrong answers. A grounded judge—given the full source document but no ground truth—achieves 72.2% agreement but still produces a 31.7% false acceptance rate, approving nearly 1 in 3 wrong answers even with the source material in context. False acceptances concentrate in threshold (75%), exact recall (59%) and numeric computation (39%) categories for the grounded judge— precisely the structured tasks where accuracy matters most in financial regulation. These findings demonstrate that LLM-as-judge is unreliable for verifying structured information extraction and that graph-based verification provides a necessary alternative.