FinStructBench: A Benchmark for Structured Information Retrieval from Financial Documents Using Graph-Verifiable Questions

Abstract

We introduce FinStructBench, a benchmark for evaluating the ability of large language models (LLMs) to extract, aggregate and reason over structured information in financial documents. Unlike existing QA benchmarks that rely on human-annotated answers, FinStructBench generates questions automatically from a document’s underlying knowledge graph, making ground truth provably correct by construction. We define ten question categories of increasing structural complexity—exact recall, threshold checking, cross-reference, counting, contradiction detection, multi-hop reasoning, absence detection, ranking, numeric computation and crosstable aggregation—and provide five benchmark instances spanning major financial report types: model validation (SR 11-7), fair lending (ECOA/HMDA), stress testing (CCAR/DFAST), credit portfolio review (OCC) and Basel III capital adequacy (Pillar 3). The benchmark supports two question modes: deterministic (template-based) and hybrid (LLM-paraphrased for linguistic diversity). Across 454 questions evaluated with two models, the graph-based retrieval baseline achieves 100% accuracy by design, while Claude Sonnet 4 scores 58% and Claude Opus 4.6 scores 65%. Model scaling improves numeric computation (+20 pp) and cross-reference (+18 pp) but leaves exhaustive-enumeration and cross-table categories unchanged, revealing scale-resistant failure modes. A three-way ingestion comparison (regex-only, LLM-only and hybrid) using Claude Opus 4 demonstrates that LLM-only extraction recovers only 12.5% of structured data, while hybrid ingestion preserves 100% of regex values and adds 22% more entries from prose— empirically validating the regex-first, LLM-augmented architecture. FinStructBench is released as an open-source toolkit at https://github.com/asudjianto-xml/finstructbench, enabling researchers to generate benchmark instances from any markdown-formatted financial document.

FinStructBench: A Benchmark for Structured Information Retrieval from Financial Documents Using Graph-Verifiable Questions

Abstract

Interested in graph-verified AI evaluation?