Deciding When Legal AI Systems Are Good Enough for Use

Determine task-specific conditions under which an AI system is sufficiently reliable to be used in legal settings, based on auditing metrics such as accuracy, consistency, and groundedness, recognizing that acceptable baselines may differ across tasks such as court reporting and hallucination detection.

Background

The paper emphasizes auditing AI systems across multiple metrics to understand limitations for specific legal tasks. However, even with audits, deciding whether measured performance suffices for deployment is nontrivial and varies by application.

Examples include differences between tasks like speech-to-text for court reporting (with varying demographic accuracy) and identifying hallucinated content (where both humans and models may struggle), underscoring the need for principled, task-specific thresholds for deployment.

References

Conducting audits of AI systems — ideally across a suite of metrics that may address accuracy, consistency, and groundedness as is relevant to the task at hand — allows practitioners to better understand the limitations of the AI system for that task. An open question may remain: when is a system good enough to use?

— Tasks and Roles in Legal AI: Data Curation, Annotation, and Verification (2504.01349 - Koenecke et al., 2 Apr 2025) in Challenge 3: Output Verification, paragraph beginning “Conducting audits of AI systems”

Deciding When Legal AI Systems Are Good Enough for Use

Background

References

Related Problems