AugAbEx : Way Forward for Extractive Case Summarization (2511.12290v1)

Published 15 Nov 2025 in cs.CL

Abstract: Summarization of legal judgments poses a heavy cognitive burden on law practitioners due to the complexity of the language, context-sensitive legal jargon, and the length of the document. Therefore, the automatic summarization of legal documents has attracted serious attention from natural language processing researchers. Since the abstractive summaries of legal documents generated by deep neural methods remain prone to the risk of misrepresenting nuanced legal jargon or overlooking key contextual details, we envisage a rising trend toward the use of extractive case summarizers. Given the high cost of human annotation for gold standard extractive summaries, we engineer a light and transparent pipeline that leverages existing abstractive gold standard summaries to create the corresponding extractive gold standard versions. The approach ensures that the experts` opinions ensconced in the original gold standard abstractive summaries are carried over to the transformed extractive summaries. We aim to augment seven existing case summarization datasets, which include abstractive summaries, by incorporating corresponding extractive summaries and create an enriched data resource for case summarization research community. To ensure the quality of the augmented extractive summaries, we perform an extensive comparative evaluation with the original abstractive gold standard summaries covering structural, lexical, and semantic dimensions. We also compare the domain-level information of the two summaries. We commit to release the augmented datasets in the public domain for use by the research community and believe that the resource will offer opportunities to advance the field of automatic summarization of legal documents.

Summary

The paper introduces an abstractive-to-extractive transformation pipeline that converts human-annotated abstractive summaries into high-fidelity extractive case summaries.
It utilizes candidate sentence selection with ROUGE metrics and maximal marginal relevance to balance relevance and diversity in legal content.
Empirical results across seven datasets show that AugAbEx outperforms unsupervised baselines in legal entity retention, semantic similarity, and structural alignment.

Extractive Case Summarization via Abstractive-to-Extractive Transformation: The AugAbEx Framework

Motivation and Context

Legal judgment summarization poses a significant challenge due to the inherent complexity of legal language, the necessity of domain knowledge, and the requisite precision for downstream use by practitioners. While progress in generic text summarization has been dominated by abstractive methods, legal summarization introduces risk factors such as hallucinations and misinterpretation of legal entities, which are highly problematic in this domain. There is therefore increasing interest in extractive summarization, which, by construction, better preserves legal fidelity and transparency. However, the limited availability of gold-standard extractive summaries has been a critical bottleneck, due to the high annotation cost and expert time required.

The paper introduces AugAbEx, a pipeline for transforming existing human-annotated abstractive summaries present in prominent legal case datasets into aligned extractive gold-standard summaries. The proposed method is both scalable and transparent, ensuring the preservation of expert salience judgments while producing resource-efficient extractive references. The approach is tested across seven datasets spanning multiple jurisdictions and evaluation is carried out along multiple dimensions: legal entity preservation, lexical, semantic, and structural similarity.

Abstractive-to-Extractive Transformation Pipeline

The transformation pipeline comprises two main stages: candidate sentence selection and summary synthesis via maximal marginal relevance (MMR).

First, for each sentence in the original abstractive summary (OAG), the top- $k$ most similar sentences from the source judgment document are identified using averaged ROUGE-1, ROUGE-2, and ROUGE-L metrics. These high-overlap candidates populate the extractive pool while maintaining fidelity to expert-labeled salience inherent in the OAG.

Second, the candidate pool is pruned using MMR to maximize relevance to the summary intent (approximated by similarity to the OAG and parent document), while simultaneously minimizing redundancy via penalizing overlap with already-selected summary sentences. $\lambda$ is fixed at $0.5$ to balance relevance and diversity. The result is a Transformed Extractive Gold (TEG) summary that maintains length and information density comparable to its abstractive counterpart.

Figure 1: Pipeline to transform original abstractive gold (OAG) summary to transformed extractive gold (TEG) summary.

Automatic Evaluation Framework

A comprehensive evaluation framework, depicted in Figure 2, is put forth to critically assess the quality of the TEG summaries relative to both the OAG and alternative unsupervised extractive summaries. Comparison is multi-faceted:

Domain attributes: LegalNER-based entity counts and recall over referenced provisions measure domain informativeness.
Semantic attributes: LSA-based topic overlap, LegalBert embedding cosine similarity, and semantic proximity to the source document.
Lexical attributes: Vocabulary overlap (ROUGE), Jensen-Shannon distance on term distributions, and Kullback-Leibler divergence with respect to the parent document.
Structural attributes: Length-based (word and sentence), and Flesch-Kincaid reading ease.

Many-to-many and instance-level comparative assessments are carried out via the Bradley-Terry model, addressing limitations of mean/median aggregation that often obscure instance-wise strengths.

Figure 2: Automatic Evaluation Framework for Transformed Extractive Gold Summary.

Dataset Coverage and Statistical Diversity

The approach is systematically applied to seven datasets, including IN-Jud-Cit, ILC, IN-Abs, CivilSum, UK-Abs, Australian, and BillSum. These datasets vary widely in jurisdiction, document length, summary form (sentential vs. phrasal), and compression ratio, which establishes the generality and robustness of the proposed methodology.

Empirical Analysis

Domain Attribute Retention

TEG summaries are generally found to match or surpass OAG summaries in legal entity density, with notable exceptions in datasets featuring phrasal OAGs (CivilSum, Australian). Provision recall in TEG is maximized at $k=2$ candidate sentences, indicating optimal coverage of critical legal content with minimal redundancy.

Figure 3: Comparison of macro-Averaged recall score of provisions in the transformed extractive gold summaries for varying number of candidate sentences.

Semantic Alignment

TEG summaries achieve consistently high semantic similarity to OAG across both LSA and LegalBert embedding spaces, with LegalBert-derived metrics in excess of 0.95 for the majority of datasets. Semantic alignment to the full case document is also high, demonstrating that TEGs encapsulate core document content as effectively as their abstractive references, even across jurisdictional and format divergences.

Lexical and Structural Comparisons

Lexical overlap via ROUGE is high except where OAG is phrasal (CivilSum, Australian). Jensen-Shannon distances confirm strong alignment of term distributions. Structural analysis indicates that TEG lengths closely track OAG lengths for most datasets, with longer extractive outputs in datasets where the reference is notably compressed or phrasal.

Comparative Analysis to LSA Baseline

Across all evaluation dimensions, AugAbEx TEG summaries outperform unsupervised LSA-based extractive baselines. Improvements are observed in domain-level provision recall, semantic similarity (both latent and embedding-based), lexical overlap, and readability. This supports the claim that leveraging expert-anchored abstractive summaries delivers extractive references of superior utility compared to generic, unsupervised approaches.

Case Study: Dataset Nuances

Analysis of IN-Abs and IN-Jud-Cit (for which TEGs perform especially well) and CivilSum/Australian (where phrasal structure impedes extractive alignment) highlights the dependence of transformation quality on OAG style. For datasets constructed via dense phrasal abstraction, the TEG pipeline’s sentential constraint yields longer summaries and lower lexical overlap, but still retains high semantic fidelity.

Figure 4: IN-Abs

Figure 5: IN-Jud-Cit

Human Evaluation

Expert annotator grading demonstrates high concordance with automatic metrics: the top-scoring TEG summaries are rated as equally informative as their OAG references, with a Pearson correlation of 0.886 between expert grade and ROUGE-L F1. This evidences that the evaluation framework is robust and proxies real user utility.

Implications and Future Directions

AugAbEx introduces a scalable paradigm for curating high-quality extractive references informed by domain-expert abstraction. With the public release of the augmented datasets, this is likely to have tangible impact on future supervised and unsupervised extractive summarization methods in the legal domain. The paper empirically substantiates that extractive methods, when expertly anchored, excel in preserving legal fidelity and are preferred by practitioners.

The methodology is extensible to new datasets and jurisdictions, particularly those with limited extractive reference availability but containing high-value human abstraction. As the NLP field progresses toward more reliable, transparent, and domain-sensitive summarization, AugAbEx establishes a practical baseline. Future work should emphasize further legal entity enrichment in extraction strategies, domain adaptation for NER tools, and broader human-in-the-loop evaluations for cross-jurisdictional transfer.

Conclusion

AugAbEx addresses a fundamental bottleneck in legal case summarization by engineering a transparent, low-cost, and statistically robust pipeline to generate aligned extractive gold summaries using existing human-annotated abstractive resources. The transformed summaries exhibit strong numerical alignment with their abstractive counterparts in all salient evaluation dimensions, surpass LSA-extracted baselines, and score favorably in human evaluation. This resource addresses the legal NLP community’s data needs and sets a new benchmark for future modeling and evaluation. Extractive algorithms should further advance context- and entity-sensitive sentence selection to approach the information condensation prowess of human-annotated summaries.