The Missing Layer Between RAG and Production

The Gap After Retrieval

RAG solved retrieval — your LLM now has the right documents. But no layer in the current stack checks whether the LLM’s answer is actually consistent with those documents.

Consider a banking customer service bot. RAG retrieves the fee refund policy. The LLM reads it and responds: “We’ll process your full refund immediately.” The retrieval was correct. The response is wrong — the policy requires manager approval above $25. The LLM read the right document and gave the wrong answer, confidently and without uncertainty.

The obvious fix is using a second LLM to check the first one’s output. LLM-as-judge works — Constitutional AI, RLAIF, and tools like Patronus Lynx demonstrate this. But it has three failure modes that matter in regulated domains:

Inconsistent on edge cases. Run the same compliance check ten times and you get different decisions. Acceptable for development evals, not for production compliance.
No rule-level traceability. An LLM can say “this violates TILA” but can’t show which claim violated which rule section at what score. Auditors need evidence chains, not generated conclusions.
Not reproducible. The explanation is generated text from the same probabilistic process. A regulator needs to inspect the methodology, not a language model’s best guess at what a compliance review should look like.

In regulated industries — financial services, insurance, healthcare — these aren’t minor gaps. They’re the difference between “we have AI monitoring” and “we can prove compliance under audit.”

Benchmarks

We evaluated against three standard NLP verification benchmarks, each testing a different aspect of claim verification:

FEVER (Fact Extraction and VERification) — 73 claims labeled SUPPORTS or REFUTES, verified against Wikipedia evidence. Tests whether a system can determine if evidence supports or contradicts a factual claim. The standard benchmark for fact verification.
ContractNLI — 62 contract clause hypothesis pairs (excluding neutral). Tests whether a system can determine if a contract clause entails or contradicts a given hypothesis. Representative of real-world document compliance checking.
FactCC — 99 article-summary pairs. Tests whether claims in a generated summary are faithful to the source article. Representative of LLM output verification — exactly the problem enterprises face when deploying generative AI.

Both pipelines use the same Qwen2 7B model running locally via Ollama — no cloud API calls, no cost asymmetry.

Metrics

Every metric answers a specific question a compliance team would ask:

Metric	Question it answers	Why it matters for compliance
Precision	When the system says “supported,” how often is it right?	A false “supported” means a wrong answer reaches the customer
Recall	Of all truly supported claims, how many does the system catch?	A missed “supported” means unnecessary escalation and delay
F1	Harmonic mean of precision and recall	Overall balance — neither too permissive nor too conservative
FPR (False Positive Rate)	How often are contradicted claims incorrectly passed?	The most dangerous failure mode — wrong answers that look right
FNR (False Negative Rate)	How often are supported claims incorrectly blocked?	Operational cost — correct answers stuck in review queues

F1 is the primary comparison metric. FPR is the most important for risk: a high FPR means the system is letting incorrect claims through.

Results

FEVER — fact verification (73 claims):

Metric	Published SOTA	LLM-as-Judge	Knowlytix
Accuracy	80.2% (BEVERS)	77.0%	89.0%
Precision	—	63.0%	81.3%
Recall	—	100.0%	92.9%
F1	—	77.3%	86.7%
FPR	—	37.8%	13.3%
FNR	—	0.0%	7.1%

Knowlytix beats both the LLM-judge (+9.4pp F1) and published SOTA (+8.8pp accuracy). FPR of 13.3% means fewer false claims slip through than any baseline.

ContractNLI — contract clause verification (62 samples):

Metric	Published SOTA	LLM-as-Judge	Knowlytix
Accuracy	~87.5% (BERT-large)	87.1%	88.7%
Precision	—	90.0%	90.2%
Recall	—	96.4%	98.2%
F1	—	93.1%	94.0%
FPR	—	100.0%	100.0%
FNR	—	3.6%	1.8%

Knowlytix edges both LLM-judge (+0.9pp F1) and published SOTA (+1.2pp accuracy). 98.2% recall means almost no valid contract clauses are incorrectly blocked.

FactCC — summarization faithfulness (99 samples):

Metric	Published SOTA	LLM-as-Judge	Knowlytix
Accuracy	72.9% (FactCCX)	85.0%	85.9%
Precision	—	86.3%	88.0%
Recall	—	96.5%	96.4%
F1	—	91.7%	92.1%
FPR	—	86.7%	73.3%
FNR	—	3.5%	3.6%

Knowlytix beats LLM-judge (+0.4pp F1) and published SOTA (+13.0pp accuracy). Sentence-level NLI scoring catches contradictions in long articles that full-text NLI misses, reducing the false positive rate from 86.7% to 73.3%.

Summary: Knowlytix wins all three benchmarks — against both LLM-as-judge and published academic baselines. Using a 7B local model with no cloud API calls.

Benchmark	Published SOTA	LLM-as-Judge	Knowlytix	vs Published	vs LLM-Judge
FEVER	80.2% acc	77.3% F1	86.7% F1	+8.8pp acc	+9.4pp F1
ContractNLI	~87.5% acc	93.1% F1	94.0% F1	+1.2pp acc	+0.9pp F1
FactCC	72.9% acc	91.7% F1	92.1% F1	+13.0pp acc	+0.4pp F1

Methodology

Each benchmark provides a dataset of (claim, evidence, ground-truth label) triples. The ground-truth label is either “supported” or “contradicted” — determined by human annotators, not by either pipeline. We feed the same claim and evidence to both pipelines and compare each pipeline’s predicted label against the ground truth.

Evaluation protocol:

Load benchmark samples. Exclude “not enough info” / neutral samples where applicable — these have no clear ground truth for binary verification.
Run each sample through both pipelines independently. Both pipelines see identical inputs: same claim text, same evidence text.
Each pipeline outputs a binary decision: “supported” or “contradicted.”
Compare each prediction against the human-annotated ground truth to compute precision, recall, F1, FPR, and FNR.

What counts as correct:

Prediction	Ground Truth	Classification
Supported	Supported	True Positive
Contradicted	Contradicted	True Negative
Supported	Contradicted	False Positive (most dangerous)
Contradicted	Supported	False Negative (operational cost)

Controls: Both pipelines use the same Qwen2 7B model running locally. No cloud API calls, no cost asymmetry, no difference in model capability. The only variable is architecture — the LLM-judge uses end-to-end prompting, while Knowlytix uses mathematical verification to produce a deterministic, traceable score.

These are preliminary numbers — sample sizes of 62-99, a single local model, no cherry-picking. We’re sharing them because the evaluation methodology matters more than the specific numbers.

What mathematical verification adds

The right framing isn’t LLM vs verification — it’s LLM plus a governed harness. LLMs are excellent at understanding context and extracting meaning. A dedicated mathematical governance layer — a governed harness that wraps the agent loop — adds what LLMs can’t provide on their own:

Capability	LLM-as-Judge	Knowlytix	Published SOTA
F1 (FEVER)	77.3%	86.7% (+9.4pp)	80.2% acc (BEVERS)
F1 (ContractNLI)	93.1%	94.0% (+0.9pp)	~87.5% acc (BERT-L)
F1 (FactCC)	91.7%	92.1% (+0.4pp)	72.9% acc (FactCCX)
Reproducibility	Varies across runs	Deterministic — same input, same output
Rule traceability	”Violates TILA”	Claim → Rule § → Score → Decision
Audit trail	Generated explanation	Structured decision record
Edge case handling	Inconsistent	Defined thresholds, measurable scores
Regulator-inspectable	No	Yes — methodology, not just output

The governed harness doesn’t replace the LLM — it makes the LLM’s output trustworthy enough for production in regulated domains.

What Domain Verification Actually Looks Like

Domain verification treats LLM output the way a compliance officer would: extract the specific claims being made, check each one against the applicable rules, and produce an auditable decision with evidence.

This requires three capabilities that the current stack lacks:

1. Claim-Level Extraction

An LLM response isn’t a single statement — it contains multiple verifiable claims. “We’ll process your full refund of the $35 fee immediately” contains at least three: that a refund will happen, that it covers the full amount, and that it happens immediately. Each claim must be extracted and verified independently.

This is fundamentally different from evaluating the response as a whole. A response can be 90% correct and still contain a single claim that violates a critical policy rule. Whole-response scoring hides this. Claim-level extraction surfaces it.

2. Domain Rules as Structured Configuration

Every domain has rules. Banking has TILA, FCRA, ECOA. Insurance has coverage terms, exclusions, and state-mandated processing timelines. Model risk management has performance thresholds and fairness requirements.

These rules shouldn’t live in prompts (fragile), in code (requires engineering to change), or in the LLM’s training data (unverifiable). They should be structured, versionable, testable configuration — authored by domain experts, not engineers.

Imagine a compliance officer writing verification rules the way a developer writes test cases:

Rule: refund_balance_limit
Type: MUST
Severity: HIGH
Text: Refunds are not allowed when account balance exceeds policy limit

Rule: manager_approval
Type: MUST
Severity: HIGH
Text: Refunds over $25 require manager approval

When the rules change — a new regulation, an updated policy — the domain expert updates the configuration. No code deployment. No engineering sprint. The verification pipeline picks up the new rules and starts enforcing them.

3. Auditable Decision Traces

In regulated industries, “the AI said it was fine” is not an acceptable audit response. Regulators want to see:

What claims were made in the LLM output
Which rules were checked against each claim
What the verification score was and why
Whether the claim passed, was flagged for review, or was rejected
A tamper-evident record of the entire decision chain

This is the difference between monitoring (dashboards, alerts, aggregate metrics) and auditing (per-decision evidence trails that hold up under regulatory scrutiny). Most observability tools give you the former. Regulated industries need the latter.

Why This Matters Now

Three forces are converging:

Regulatory pressure is concrete, not theoretical. The EU AI Act’s general-purpose AI obligations took effect in August 2025. High-risk system requirements hit in 2027. NIST AI RMF adoption is accelerating across US enterprises. Companies deploying LLMs in regulated domains need compliance verification — not eventually, now.

LLM capabilities keep improving, which makes the problem worse. Better models produce more fluent, more confident outputs. This is good for user experience and bad for compliance — a more confident wrong answer is harder to catch with surface-level checks. As models get better at sounding right, the governed harness around them needs to get better at proving they are right.

RAG is becoming table stakes, not a differentiator. When every enterprise has a vector database and a retrieval pipeline, the competitive advantage shifts to what happens after retrieval. The companies that can verify, audit, and prove compliance will win the regulated enterprise market. The ones that can only retrieve and generate will be stuck in low-stakes use cases.

The Governed Harness

What the enterprise AI stack needs is a governed harness that wraps the entire agent loop with deterministic policy enforcement, claim-level verification, and a tamper-evident audit trail:

                            User Query
                                 |
                                 v
   +------------------------------------------------------------+
   |                      Governed Harness                      |
   |                                                            |
   |  The Agent Loop:                                           |
   |    Perceive -> Reason -> Plan -> Execute -> Evaluate       |
   |                                                            |
   |  Wrapped at every step by:                                 |
   |    * Policy Engine        — rules encoded as math          |
   |    * Behavioral Contracts — formal state machines          |
   |    * Verification         — claims checked vs. rules       |
   |    * Audit Trail          — every decision a triple        |
   |                                                            |
   |  Powered by Geometric Memory Systems (GMS)                 |
   +-----------------------------+------------------------------+
                                 |
                                 v
                         Governed Response
                    (with audit trail attached)

The harness doesn’t replace RAG or guardrails — it completes the stack. RAG ensures the model has the right information. Guardrails ensure the output is safe. The harness ensures every action is governed — policies enforced, outputs verified, decisions logged.

The key properties of this layer:

Domain-agnostic harness, domain-specific rules. The harness doesn’t know anything about banking or insurance. It knows how to extract claims, match them against rules, and produce decisions. The domain knowledge lives in configuration files that domain experts write.
Mathematical, not probabilistic. “92% confidence the output is compliant” is not useful when a regulator asks “was this output compliant?” The verification needs to produce a measurable, reproducible result — a defined distance metric with a threshold, not a probability estimate from another LLM.
Per-claim, not per-response. A response with five claims where four are correct and one violates a critical rule should not get a passing score of 80%. It should identify the specific violation, cite the specific rule, and flag or reject based on severity.
Audit-native. Every verification produces a complete record: claims extracted, rules matched, scores computed, decision rendered. Not as a logging afterthought, but as a first-class output that’s cryptographically signed and tamper-evident.

What This Means for the Stack

The enterprise AI stack is evolving in layers, and each layer creates demand for the next:

Foundation models (solved) — GPT, Claude, Llama, Gemini
Orchestration (solved) — LangChain, LlamaIndex, DSPy
Retrieval (solved) — RAG, vector databases, embedding pipelines
Safety (partially solved) — guardrails, content filtering, prompt injection defense
Governance (unsolved) — a governed harness around the agent loop with policy enforcement, claim-level verification, and audit trails

Layer 5 is where the value shifts for regulated industries. Not because the other layers aren’t important — they are — but because they’re becoming commoditized. When every company has RAG and guardrails, the differentiator is: can you prove your AI outputs comply with your domain’s rules?

The companies building this layer today are building for a market that will be mandatory by 2027.

Where We Go From Here

We’re building this governed harness at Knowlytix. Across three benchmarks — FEVER, ContractNLI, and FactCC — Knowlytix’s structured governance beats LLM-as-judge on F1 while providing the reproducibility and auditability that regulated industries require. Knowlytix now wins all three benchmarks, including FactCC where sentence-level NLI scoring flipped a previous loss into a win.

There’s still work to do. Our false positive rate on contract and summarization benchmarks needs improvement — FactCC’s FPR dropped from 86.7% to 73.3% but is still too high. We need to test on real compliance data, not just academic benchmarks. And we need to validate that the audit trail we produce actually satisfies what regulators ask for — not what we think they ask for.

We’re sharing this early because the engineering problems are interesting and the regulated AI community is small enough that collaboration matters more than secrecy. If you’re working on similar problems — verification, compliance, auditability — or if you’re deploying LLMs in a regulated domain and have opinions on what “good enough” verification looks like, we’d genuinely like to hear from you.