Long Text and Multi-Table Summarization: Dataset and Method (2302.03815v1)

Published 8 Feb 2023 in cs.CL and cs.AI

Abstract: Automatic document summarization aims to produce a concise summary covering the input document's salient information. Within a report document, the salient information can be scattered in the textual and non-textual content. However, existing document summarization datasets and methods usually focus on the text and filter out the non-textual content. Missing tabular data can limit produced summaries' informativeness, especially when summaries require covering quantitative descriptions of critical metrics in tables. Existing datasets and methods cannot meet the requirements of summarizing long text and multiple tables in each report. To deal with the scarcity of available data, we propose FINDSum, the first large-scale dataset for long text and multi-table summarization. Built on 21,125 annual reports from 3,794 companies, it has two subsets for summarizing each company's results of operations and liquidity. To summarize the long text and dozens of tables in each report, we present three types of summarization methods. Besides, we propose a set of evaluation metrics to assess the usage of numerical information in produced summaries. Dataset analyses and experimental results indicate the importance of jointly considering input textual and tabular data when summarizing report documents.

PDF Abstract

Long Text and Multi-Table Summarization: Dataset and Method

Automatic document summarization is a critical task in natural language processing, which aims to generate concise summaries containing the most salient information from input documents. While traditional methods have focused primarily on text, documents like financial and technical reports often contain important information within tables. This paper addresses the limitations of existing datasets and methods by introducing FINDSum, the first large-scale dataset designed for long text and multi-table summarization, and presenting a comprehensive approach for summarizing such documents.

FINDSum Dataset

The FINDSum dataset encompasses 21,125 annual reports from 3,794 companies, creating two distinct subsets: FINDSum-ROO (Results of Operations) and FINDSum-Liquidity. Each report contains tens of thousands of words and multiple tables. The target summaries within FINDSum are unique in their heavy inclusion of numerical values, which are sparsely distributed in the input texts and often not found directly within the text. This makes the dataset particularly challenging and valuable for developing and evaluating summarization models that integrate both textual and tabular data.

Summarization Approach

The paper proposes a structured solution comprising three main steps: data preprocessing, content selection, and summarization. The approach uses distinct methods to handle textual and tabular data, aiming to overcome the challenges posed by long documents and heterogeneous content types.

Data Preprocessing: The text and tables within each report are extracted and processed separately. For the text, noises such as special characters are removed. Tables are transformed into tuples comprising row names, column names, cell values, and positions to maintain contextual information.
Content Selection: A vital intermediate step compresses the input data while preserving salient content. The Maximum Marginal Recall Gain (MMRG) method is utilized for text selection, while salient tuples from tables are selected using an XGBoost classifier trained on positional features and Glove embeddings. This selection step helps manage input length and model efficiency.
Summarization Methods:
- Generate-and-Combine (GC): Generates separate summaries for text and table content, then concatenates them. This method has limitations due to its inability to merge and adapt content dynamically.
- Combine-and-Generate (CG): Combines selected text segments and tuples into a single input for the summarizer to generate an integrated summary. This method ensures that the summarizer considers both data types during generation.
- Generate-Combine-and-Generate (GCG): Includes an additional tuple-to-text generation step before summarizing the combined textual input. While this can lead to some information loss, it simplifies the summarizer's task.

Evaluation Metrics

To assess the performance of summarization models on FINDSum, the paper introduces several evaluation metrics to gauge the integration of numerical information:

Number Precision (NP): Measures the accuracy of numerical values in the generated summary compared to the target.
Number Coverage (NC): Evaluates the extent to which the generated summary covers numerical values present in both the input and the target summary.
Number Selection (NS): The harmonic mean of NP and NC, providing a balanced assessment of how well the model incorporates numerical data.

Experimental Results

The performance of various advanced extractive and abstractive summarizers was benchmarked using FINDSum. Abstractive models such as BART, PEGASUS, and those with sparse attention mechanisms like LED and BigBird-PEGASUS performed better than traditional extractive methods like LexRank and TextRank. Among the proposed methods, the CG and GCG approaches demonstrated superior performance by effectively integrating text and table data, showcasing higher ROUGE, NP, NC, and NS scores. Particularly, the CG method outperformed others on the FINDSum-ROO subset, reflecting the heavy reliance on tables in this context.

Implications and Future Work

The introduction of FINDSum and the proposed methods mark significant progress in the domain of document summarization, particularly for content-rich reports that blend text and data tables. This research highlights the necessity of integrating heterogeneous data sources for effective summarization and paves the way for more sophisticated models capable of handling complex documents. Future work can explore further optimizations in model efficiency, more robust evaluation metrics for factual correctness and fidelity, and the application of these methods across diverse document types beyond financial reports.

Conclusion

The paper presents FINDSum, a pioneering dataset for long text and multi-table summarization, and a comprehensive methodology to address its inherent challenges. The methods introduced demonstrate the importance of joint consideration of textual and tabular data, establishing a foundation for further advancements in summarizing multifaceted documents. This work is a notable contribution to the fields of natural language processing and artificial intelligence, aiming to streamline information extraction and comprehension in complex, data-intensive documents.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Shuaiqi Liu (12 papers)
Jiannong Cao (73 papers)
Ruosong Yang (8 papers)
Zhiyuan Wen (11 papers)

Citations (15)

View on Semantic Scholar

Long Text and Multi-Table Summarization: Dataset and Method (2302.03815v1)