ContractNLI: Legal Document Inference
- ContractNLI is a benchmark for legal natural language inference that evaluates entailment, contradiction, and evidence extraction in contract reviews.
- It introduces span-based multi-label classification with dynamic context segmentation to handle lengthy contracts and complex legal structures.
- Empirical results show that evidence-driven methods notably improve NLI accuracy, especially in detecting contradictions within legal documents.
ContractNLI is a specialized benchmark and methodology for document-level natural language inference (NLI) within the legal domain, specifically aimed at the automated review of contracts. The central challenge is to determine, for a given contract and a set of hypotheses—formulated as legally meaningful statements—whether each hypothesis is entailed by, contradicts, or is not mentioned in the contract, and to identify the specific evidence spans supporting these decisions. As legal documents exhibit complex logical structures, domain-specific vocabulary, and often substantial length, ContractNLI introduces both a unique data resource and a set of methodological innovations to address the intricacies of legal semantic reasoning.
1. Problem Definition and Dataset Construction
The ContractNLI task is formulated as multi-label, document-level NLI: given a contract (the premise) and a fixed set of 17 hypotheses per contract (e.g., “Some obligations of Agreement may survive termination.”), a system predicts for each pair one of three labels—Entailment, Contradiction, or NotMentioned—and also highlights the supporting text spans as evidence. The corpus comprises 607 annotated non-disclosure agreements (NDAs), each manually labeled at the contract–hypothesis level by legal experts (Koreeda et al., 2021).
Distinctive features of the dataset include:
- Contract length (mean >2,250 tokens; ~86% >512 tokens, exceeding standard model limits)
- Annotation of evidence as full sentences or list items (averaging 77.8 candidate spans per contract)
- Fixed set of hypotheses, enabling cross-document comparison and fine-grained semantic evaluation
This design supports rigorous benchmarking of both the NLI classification and the evidence extraction subsystems under realistic, high-complexity legal scenarios.
2. Methodological Innovations: Span-based Multi-label Evidence Identification
A major contribution in ContractNLI is recasting evidence identification as a multi-label classification problem over pre-identified candidate spans, as opposed to the conventional start–end token prediction used in QA-style reading comprehension. In the baseline Span NLI BERT method, every candidate span (sentence or list item) is marked by introducing [SPAN] tokens into the input. The model architecture uses BERT to encode the entire contract, with a multi-layer perceptron (MLP) and sigmoid activation atop the encoded [SPAN] tokens to assign evidence probabilities for each hypothesis–span pair (Koreeda et al., 2021).
Dynamic context segmentation is incorporated to address input length constraints and preserve evidence integrity: overlapping context windows are selected by a stride algorithm ensuring that each span and its immediate context are wholly included in at least one input segment. At inference time, span-level predictions are aggregated (by averaging) across all covering contexts, while NLI label decisions aggregate per-context predictions weighted by evidence confidence.
The multi-task loss combines cross-entropy for NLI labeling (on [CLS] token representations) and binary cross-entropy for span evidence identification, with a weight hyperparameter balancing the two objectives:
where
and is the binary ground truth for evidence span .
3. Linguistic Phenomena and Sources of Complexity
ContractNLI identifies several linguistic phenomena that substantially increase task difficulty, most notably “negation by exception.” In contracts, general prohibitions are often locally or non-locally overridden by exception clauses, flipping their logical force:
- Local exception: “Recipient shall not disclose Confidential Information except to its employees…”
- Non-local exception: General rule and its exception split across different sections
This structure complicates both entailment and evidence selection, as systems must handle discontinuous and noncontiguous spans, dependency across sections, and intricate modality (obligation, prohibition, permission). Additional complexity arises from dense cross-referencing, specialized legal terms, and the prevalence of implicit knowledge and defaults.
4. Empirical Evaluation and Performance Metrics
Span NLI BERT outperforms classical baselines (majority class, TF-IDF+SVM, SQuAD-style QA) in both NLI and evidence identification. Using BERT_base:
- Mean average precision (mAP) for evidence identification: ~0.885; BERT_large: ~0.922
- Overall document-level NLI accuracy (BERT_large): ~87.5%
- Controlled oracle experiments show that providing accurate evidence spans notably improves NLI classification, especially for Contradiction cases (Koreeda et al., 2021).
The results underscore the necessity of robust evidence extraction; performance is strongly tied to the system’s ability to localize relevant spans, particularly in the face of discontinuity and cross-section dependencies.
Legal-domain pretraining (e.g., DeBERTa v2_xlarge, models fine-tuned on contract/case law corpora) provides incremental gains but is less impactful than system-level innovations—specifically, the explicit multi-label span modeling and dynamic segmentation. The greatest improvement in contradiction identification arises from perfect (oracle) evidence provision, confirming the intertwined nature of evidence and label prediction.
5. Related Developments: Generalization, Zero-Shot Methods, and Retrieval Aggregation
Subsequent work (Schuster et al., 2022) demonstrates that state-of-the-art sentence-pair NLI models (SENTLI) can be applied to ContractNLI using retrieve-and-aggregate methods. These systems split the contract into sentences and score each against the hypothesis, applying aggregation heuristics across sentences (e.g., maximal entailment score). Advanced pipelines, such as retrieve-and-rerank, concatenate top-ranked spans and reapply the NLI model to support nuanced evidence-based judgments. Critically, NLI model scores prove to be superior retrieval signals relative to cosine similarity or dense embeddings, consistently bringing annotated evidence spans to the top in ~61% of ContractNLI cases.
However, performance declines sharply when document length exceeds NLI model limits and when the evidence is diffuse or heavily cross-referenced, highlighting continued challenges in input-level scaling.
6. Position within the Legal NLP and Law+AI Landscape
ContractNLI occupies a central role in legal AI research by:
- Establishing a realistic, evidence-based NLI benchmark reflective of actual contract review workflows
- Exposing the limitations of both standard pre-trained models and naive retrieval-based methods in the legal context
- Driving innovations in document segmentation, multi-span reasoning, and multi-task learning
- Facilitating downstream advances, such as retrieval-augmented generation (LegalBench-RAG (Pipitone et al., 19 Aug 2024)) and graph-augmented LLMs for contract risk evaluation (Zheng et al., 2023)
The dataset’s architecture, especially the fixed set of hypotheses and span-level evidence, supports direct benchmarking and method comparison, while providing a resource for probing complex legal logic and cross-document language phenomena. Emerging retrieval benchmarks (e.g., ACORD (Wang et al., 11 Jan 2025)) and clause-level extraction tasks (ContractEval (Liu et al., 5 Aug 2025)) build directly on the lessons and methodologies established by ContractNLI.
7. Future Directions and Open Challenges
Persistent deficits remain in several dimensions:
- Contradiction detection lags substantially behind entailment and neutral, particularly where negation by exception or cross-sectional logic is involved.
- The current approach is largely confined to NDAs; generalization to diverse contract types and regulatory regimes is an ongoing effort.
- Simultaneous modeling of multiple interacting evidence spans, better utilization of hypothesis linguistic structure, and more sophisticated context aggregation remain as active areas for research advancement.
A plausible implication is that future ContractNLI systems will require dynamic, contextually aware retrieval augmented by structured legal knowledge (e.g., term–definition relations, as in ConReader (Xu et al., 2022)) and adaptable aggregation strategies to handle both the depth and breadth of legal reasoning found in practical contract review.
In sum, ContractNLI defines the foundational problem of evidence-driven, document-level legal NLI. It offers both a high-quality dataset and a methodological baseline that catalyze continued research into explainable, robust, and contextually precise automation of legal contract analysis.