Long Text and Multi-Table Summarization: Dataset and Method
Automatic document summarization is a critical task in natural language processing, which aims to generate concise summaries containing the most salient information from input documents. While traditional methods have focused primarily on text, documents like financial and technical reports often contain important information within tables. This paper addresses the limitations of existing datasets and methods by introducing FINDSum, the first large-scale dataset designed for long text and multi-table summarization, and presenting a comprehensive approach for summarizing such documents.
FINDSum Dataset
The FINDSum dataset encompasses 21,125 annual reports from 3,794 companies, creating two distinct subsets: FINDSum-ROO (Results of Operations) and FINDSum-Liquidity. Each report contains tens of thousands of words and multiple tables. The target summaries within FINDSum are unique in their heavy inclusion of numerical values, which are sparsely distributed in the input texts and often not found directly within the text. This makes the dataset particularly challenging and valuable for developing and evaluating summarization models that integrate both textual and tabular data.
Summarization Approach
The paper proposes a structured solution comprising three main steps: data preprocessing, content selection, and summarization. The approach uses distinct methods to handle textual and tabular data, aiming to overcome the challenges posed by long documents and heterogeneous content types.
- Data Preprocessing: The text and tables within each report are extracted and processed separately. For the text, noises such as special characters are removed. Tables are transformed into tuples comprising row names, column names, cell values, and positions to maintain contextual information.
- Content Selection: A vital intermediate step compresses the input data while preserving salient content. The Maximum Marginal Recall Gain (MMRG) method is utilized for text selection, while salient tuples from tables are selected using an XGBoost classifier trained on positional features and Glove embeddings. This selection step helps manage input length and model efficiency.
- Summarization Methods:
- Generate-and-Combine (GC): Generates separate summaries for text and table content, then concatenates them. This method has limitations due to its inability to merge and adapt content dynamically.
- Combine-and-Generate (CG): Combines selected text segments and tuples into a single input for the summarizer to generate an integrated summary. This method ensures that the summarizer considers both data types during generation.
- Generate-Combine-and-Generate (GCG): Includes an additional tuple-to-text generation step before summarizing the combined textual input. While this can lead to some information loss, it simplifies the summarizer's task.
Evaluation Metrics
To assess the performance of summarization models on FINDSum, the paper introduces several evaluation metrics to gauge the integration of numerical information:
- Number Precision (NP): Measures the accuracy of numerical values in the generated summary compared to the target.
- Number Coverage (NC): Evaluates the extent to which the generated summary covers numerical values present in both the input and the target summary.
- Number Selection (NS): The harmonic mean of NP and NC, providing a balanced assessment of how well the model incorporates numerical data.
Experimental Results
The performance of various advanced extractive and abstractive summarizers was benchmarked using FINDSum. Abstractive models such as BART, PEGASUS, and those with sparse attention mechanisms like LED and BigBird-PEGASUS performed better than traditional extractive methods like LexRank and TextRank. Among the proposed methods, the CG and GCG approaches demonstrated superior performance by effectively integrating text and table data, showcasing higher ROUGE, NP, NC, and NS scores. Particularly, the CG method outperformed others on the FINDSum-ROO subset, reflecting the heavy reliance on tables in this context.
Implications and Future Work
The introduction of FINDSum and the proposed methods mark significant progress in the domain of document summarization, particularly for content-rich reports that blend text and data tables. This research highlights the necessity of integrating heterogeneous data sources for effective summarization and paves the way for more sophisticated models capable of handling complex documents. Future work can explore further optimizations in model efficiency, more robust evaluation metrics for factual correctness and fidelity, and the application of these methods across diverse document types beyond financial reports.
Conclusion
The paper presents FINDSum, a pioneering dataset for long text and multi-table summarization, and a comprehensive methodology to address its inherent challenges. The methods introduced demonstrate the importance of joint consideration of textual and tabular data, establishing a foundation for further advancements in summarizing multifaceted documents. This work is a notable contribution to the fields of natural language processing and artificial intelligence, aiming to streamline information extraction and comprehension in complex, data-intensive documents.