LegalHalluLens: Typed Hallucination Auditing and Calibrated Multi-Agent Debate for Trustworthy Legal AI

Published 16 Jun 2026 in cs.AI, cs.CL, cs.LG, and cs.MA | (2606.18021v1)

Abstract: AI systems deployed in legal workflows hallucinate at rates that aggregate metrics report at ~52%, but this average conceals where errors concentrate and in which direction they run, leaving compliance officers without an actionable signal for trustworthy deployment. We present LegalHalluLens, an auditing framework with three components: typed hallucination profiles across four legally-motivated claim categories (numeric, temporal, obligation/entitlement, factual) over CUAD (Hendrycks et al., 2021); a Risk Direction Index (RDI) that reduces omission-versus-invention bias to a single deployment-comparable scalar; and a typed debate pipeline calibrated to both magnitudes and directions. Across 510 contracts and 249,252 clause-level instances we measure a within-model gap of approximately 38-40 pp between obligation/numeric and temporal claims that aggregate reporting hides, and show that two systems with matched 52% rates can carry opposite RDIs. The debate pipeline reduces fabricated detections by 45% with per-category gains tracking the diagnosis, matching commercial APIs with a substantially smaller backbone (4B active parameters). Typed profiles and RDI surface failure modes that aggregate metrics hide; we further show these diagnostics serve as calibration inputs for multi-agent debate pipelines, where Skeptic challenges and asymmetric gates targeted at measured failure modes outperform generically-tuned debate. The framework supports direction-aware procurement, accountability, and agent design for legal AI deployed in the wild.

Abstract PDF Upgrade to Chat

Authors (2)

Summary

The paper introduces a framework that dissects hallucinations into legally motivated claim types and employs the Risk Direction Index for directional error analysis.
It utilizes a calibrated multi-agent debate pipeline that targets dominant error modes, achieving a 45% reduction in fabricated clause detections.
The approach enables risk-aware legal AI deployment by revealing significant disparities in hallucination rates and guiding targeted mitigation strategies.

LegalHalluLens: Typed Hallucination Auditing and Calibrated Multi-Agent Debate for Trustworthy Legal AI

Overview

"LegalHalluLens: Typed Hallucination Auditing and Calibrated Multi-Agent Debate for Trustworthy Legal AI" (2606.18021) introduces a comprehensive auditing and mitigation framework tailored for legal AI contract workflows, targeting the critical problem of unreliable clause extraction and uninformative aggregate hallucination metrics. By dissecting hallucination failures into four legally motivated claim types and introducing the Risk Direction Index (RDI), the paper establishes a more actionable foundation for deployment, procurement, and accountability decisions. Furthermore, it demonstrates that per-category calibration enables a small open model combined with a typed multi-agent debate pipeline to rival commercial API systems at a fraction of the inference cost.

Typed Hallucination Profiles and Aggregation Failure

Aggregate hallucination rates (e.g., $\sim$ 52\%) conceal the complexity and practical risks underlying clause extraction for legal contracts. The paper empirically validates that different claim types—numeric, temporal, obligation/entitlement, factual—exhibit consistent and substantial gaps in hallucination rates across diverse LLM architectures:

Numeric and obligation claims display hallucination rates of 65–74%, while temporal claims are markedly lower at 29–35%.
The within-model gap between the worst and best categories is 38–41 percentage points, a disparity not visible under aggregate reporting.

This typological consistency means that aggregate metrics are non-actionable for compliance. Models scoring similarly in aggregate can have vastly different reliability on the most legally consequential clauses.

Figure 1: Typed hallucination rates on the 510-contract benchmark reveal substantial per-category disparities, with numeric and obligation failures significantly higher than temporal.

Directional Risk Characterisation with RDI

Beyond the magnitude of hallucinations, the directionality of errors—whether models invent conditions (overstatement) or suppress them (understatement)—is vital for legal deployment. LegalHalluLens operationalises this via the Risk Direction Index (RDI), a signed scalar reflecting net omission vs. invention bias:

qwen3-32b and gpt-5.2 have matched aggregate hallucination rates but opposite RDI values: qwen3-32b predominantly omits conditions, gpt-5.2 predominantly invents them.
This distinction directly affects compliance risk profiles: omission-heavy systems threaten enforceability, while invention-heavy systems create spurious risk ceilings.
Figure 2: Error direction analysis shows scope errors dominate, but models with similar overall hallucination rates differ considerably in omission vs. invention of conditions.

Typed Debate Pipeline for Calibrated Mitigation

Building on the diagnostic power of typed profiles and RDI, LegalHalluLens implements a six-role debate pipeline, with Skeptic challenges and gate asymmetries directly calibrated to detected failure modes:

Typed Skeptic questions target the dominant error mode per claim type, driving focused deliberation.
Addition/deletion gates enforce conservative policies, especially for high-failure types, based on empirical FAR/FRR ratios.
Re-extraction is triggered for structural extraction errors, promoting targeted repair over redundant debate.

This approach moves beyond generic debate mitigation (which treats all errors equally), enabling a small open model to match commercial systems' performance on composite score and achieve a 45% reduction in fabricated clause detections.

Figure 3: The debate pipeline is structured to focus deliberation and decision making on typed failure modes, with asymmetric gates tailored for risk mitigation.

Per-Type Mitigation and Direction Correction

Calibration of the debate pipeline yields the predicted outcome: gains are concentrated on the highest-failure claim types, and error direction is measurably corrected.

False positives among obligation and factual claims drop significantly, aligning with baseline profiles, while temporal claims remain essentially unchanged.
The RDI for obligation clauses is shifted from omission-heavy toward balanced by targeted Skeptic interventions.

Figure 4: Mitigation gains by claim type show strong concentration on obligation and factual categories, validating typed calibration as an effective specification.

Practical and Theoretical Implications

The diagnostic tools (typed profiles and RDI) equip practitioners to make deployment decisions that reflect real legal exposure, not aggregate metrics. The methodology is transferrable to any oracle-verifiable legal corpus, supporting direction-aware procurement and post-deployment monitoring. The calibrated debate pipeline demonstrates that structured interventions targeting measured failure modes can substantially improve trustworthiness and efficiency, opening avenues for low-cost wildcard models to challenge commercial offerings.

For regulatory and compliance applications, model selection can be guided by specific risk profiles: omission-dominant systems are unsuitable for contractual enforceability, while invention-dominant systems may be deployed for liability-sensitive workflows with robust human review. The framework also lays the groundwork for more nuanced agent designs in AI auditing, moving away from generic multi-agent pipelines toward per-type calibration.

Future Directions

Key extensions include:

Generalisation studies across jurisdictions and document types to verify the stability of the typed gap and RDI.
Evaluation in retrieval-augmented setups, where additional failure modes interplay with clause extraction quality.
Development of human-validated evaluation judges to refine RDI's cardinal interpretation and eliminate judge-dependence noise.
Minimal-prompt and ablation studies of generic vs. typed calibration debate pipelines.

Broader investigation into how legal AI frameworks can leverage these diagnostics for risk disclosures, governance, and human-in-the-loop review is warranted, especially as deployment scales into adversarial or high-compliance environments.

Conclusion

LegalHalluLens provides robust evidence that aggregate hallucination metrics are inadequate for legal AI deployment decisions, unveiling a consistent and substantial 38–41 percentage point typed gap and directional risk profiles obscured by aggregate reporting. Typed hallucination profiles and RDI are actionable diagnostics, enabling practitioners to calibrate mitigation strategies and agent design. The calibrated debate pipeline demonstrates practical gains: a 45% reduction in fabricated clause detections and strong concentration of improvements on the claim types where risk is highest. For trustworthy legal AI, evaluation and mitigation must be typologically and directionally specific.

Markdown Report Issue