BIGPATENT: A Large-Scale Dataset for Abstractive and Coherent Summarization (1906.03741v1)

Published 10 Jun 2019 in cs.CL and cs.LG

Abstract: Most existing text summarization datasets are compiled from the news domain, where summaries have a flattened discourse structure. In such datasets, summary-worthy content often appears in the beginning of input articles. Moreover, large segments from input articles are present verbatim in their respective summaries. These issues impede the learning and evaluation of systems that can understand an article's global content structure as well as produce abstractive summaries with high compression ratio. In this work, we present a novel dataset, BIGPATENT, consisting of 1.3 million records of U.S. patent documents along with human written abstractive summaries. Compared to existing summarization datasets, BIGPATENT has the following properties: i) summaries contain a richer discourse structure with more recurring entities, ii) salient content is evenly distributed in the input, and iii) lesser and shorter extractive fragments are present in the summaries. Finally, we train and evaluate baselines and popular learning models on BIGPATENT to shed light on new challenges and motivate future directions for summarization research.

PDF Abstract

Analyzing BigPatent: Advancements in Data for Abstractive Text Summarization Research

The paper "BigPatent: A Large-Scale Dataset for Abstractive and Coherent Summarization" by Eva Sharma, Chen Li, and Lu Wang presents a dataset aimed at addressing notable shortcomings observed in existing datasets used for text summarization tasks. BigPatent stands out due to its emphasis on challenging LLMs beyond the field of extractive summarization and fostering the development of abstractive methods that capture an input's global content structure.

Key Characteristics and Contributions

In contrast to most available summarization datasets, especially those derived from news articles, BigPatent presents a collection of 1.3 million U.S. patent documents along with human-created abstractive summaries. Some typical issues with the news-derived datasets include a flattened narrative structure and a high presence of extractive fragments within summaries. BigPatent transcends these issues by offering summaries that feature a richer discourse structure and contain fewer extractive fragments. This dataset compels models to better understand the overarching structure of inputs, facilitated by salient content being distributed more uniformly across the text.

Salient Features of BigPatent Include:

Complex Discourse Structures: The patent summaries display recurring entities through multiple sentences, indicating a need for models to handle dependencies more adeptly than in less structured summaries from news articles.
Uniform Distribution of Salient Content: Salient information in the text is spread widely rather than predominantly localized at the beginning, posing a challenge to models that rely heavily on positional bias for extracting meaningful content.
Higher Compression and Abstractiveness: With a notable degree of summary abstraction and fewer extractive fragments, this dataset challenges models to move away from simple extraction.

Evaluation and Observations

The paper benchmarks several popular abstractive summarization models along with BigPatent, contrasting the results with those achieved on existing news datasets like CNN/Daily Mail and New York Times. The research finds that current state-of-the-art systems often show decreased performance when shifting to BigPatent due to its unique set of challenges:

Performance Metrics: Many models demonstrate lower ROUGE scores on BigPatent, suggesting a potential gap in handling the dataset's complexities compared to more conventional datasets.
Entity Handling: While novel summaries generated from models are rich in abstract content, there is a tendency for these models to repeat entities or fabricate content excessively, highlighting limitations in current paradigms for abstract generation and semantic coherence.

Implications for Future Research

The introduction of BigPatent underscores critical areas for advancement in text generation systems, notably in achieving coherent discourse representation and sophisticated content modeling. The dataset sets a new benchmark for abstractiveness and entity distribution understanding, urging novel architectures and planning strategies. The dataset's inherent challenges guide research towards developing models with enhanced capabilities in understanding and generating complex, structured entities.

Theoretical and Practical Perspectives

The theoretical implications involve pushing neural summarization models to comprehend long-range dependencies and deeper semantic connections, ultimately fostering innovations that can bridge the gaps observed in current performance metrics. Practically, BigPatent's availability aids the development of applications capable of generating meaningful summaries across diverse domains beyond the news sector, such as in legal and patent analysis.

In conclusion, BigPatent represents a significant advancement in resources available for abstractive summarization research, catalyzing future research towards more robust, coherent, and contextually-aware summarization systems. The insights drawn from model evaluations on this dataset also point to critical developments necessary for deploying summarization solutions in increasingly complex and varied contexts.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

Eva Sharma (4 papers)
Chen Li (386 papers)
Lu Wang (329 papers)

Citations (209)

View on Semantic Scholar