Generating Wikipedia by Summarizing Long Sequences (1801.10198v1)

Published 30 Jan 2018 in cs.CL

Abstract: We show that generating English Wikipedia articles can be approached as a multi- document summarization of source documents. We use extractive summarization to coarsely identify salient information and a neural abstractive model to generate the article. For the abstractive model, we introduce a decoder-only architecture that can scalably attend to very long sequences, much longer than typical encoder- decoder architectures used in sequence transduction. We show that this model can generate fluent, coherent multi-sentence paragraphs and even whole Wikipedia articles. When given reference documents, we show it can extract relevant factual information as reflected in perplexity, ROUGE scores and human evaluations.

Authors (7)

Peter J. Liu (30 papers)
Mohammad Saleh (19 papers)
Etienne Pot (9 papers)
Ben Goodrich (8 papers)
Ryan Sepassi (7 papers)
Lukasz Kaiser (40 papers)
Noam Shazeer (37 papers)

Citations (764)

View on Semantic Scholar

Summary

Essay: Generating Wikipedia by Summarizing Long Sequences

In the paper "Generating Wikipedia by Summarizing Long Sequences", the authors investigate a method for generating English Wikipedia articles through multi-document summarization. This research converts the challenge of Wikipedia article creation into the task of summarizing and distilling information from multiple related documents.

Approach and Methodology

The authors propose a two-stage approach. The first stage focuses on extractive summarization for identifying relevant information from a collection of documents. This coarse extraction of data is essential given the vast quantity of information in the input. The second stage utilizes a neural abstractive model to generate coherent text summaries, essentially writing new text rather than merely copying phrases from the source documents.

Extractive Summarization

To handle the extensive input data, different extractive methods were explored:

Identity: Using the first portion of the input.
tf-idf: Utilizing term frequency-inverse document frequency for relevance ranking.
TextRank: A graph-based ranking for text processing.
SumBasic: A method leveraging word frequency for sentence selection.
Cheating Method: A relevance score based on the overlap with the ground truth, serving as a performance upper bound.

Different extractive methods were evaluated for their effectiveness in providing a condensed yet informative text segment for the abstractive summarization model.

Abstractive Summarization

The paper introduces multiple model architectures to address the abstractive summarization stage:

Seq2seq with attention (LSTM) served as a conventional baseline.
Transformer Encoder-Decoder (T-ED), the state-of-the-art non-recurrent architecture.
Transformer Decoder-only (T-D) optimized for long sequences.
Transformer Decoder with Memory-Compressed Attention (T-DMCA), which incorporates local and memory-compressed attention for improved handling of long sequences.

The researchers highlight modifications to the baseline Transformer architecture, primarily leveraging a decoder-only model and implementing memory-compressed attention layers. This innovative architecture allows the handling of significantly longer input sequences while maintaining lower computational complexity.

Experimental Results

The models were benchmarked using ROUGE scores and perplexity:

The combined corpus (which includes citations and web search results) and tf-idf extraction method demonstrated the best performance.
The T-DMCA model with a mixture of experts layer further boosted the performance, achieving an impressive log-perplexity of $1.90325$ and ROUGE-L F1 score of $38.8$.

Performance was assessed on the quality of Wikipedia lead sections, presenting a significant advance over traditional seq2seq models. Local and memory-compressed attention mechanisms in T-DMCA facilitated the handling of very long sequences, critical in aggregating information from diverse documents.

Practical Implications

The proposed methodology demonstrates a promising approach to automated content creation for encyclopedic knowledge, showcasing potential applications in areas requiring synthesis of extensive information, such as academic literature summarization, report generation, and news aggregation.

Future Directions

The implications of this research suggest avenues for enhancing document summarization technology by focusing on improved extractive methods and handling even longer sequences more efficiently. The introduction of a supervised model for relevance extraction and advancements in the memory and computational efficiency of Transformer-based architectures are potential research trajectories.

Conclusion

The paper provides a noteworthy contribution to the field of multi-document summarization and neural text generation. By innovatively adapting the Transformer architecture and demonstrating its performance in generating coherent and factually accurate Wikipedia articles, it paves the way for future advancements in automated text generation and the deployment of sophisticated summarization systems on large-scale datasets.

PDF Markdown

Related Papers

GitHub

GitHub - tensorflow/tensor2tensor: Library of deep learning models and datasets designed to make deep learning more accessible and accelerate ML research. (15,024 stars)

Tweets

https://twitter.com/arankomatsuzaki/status/1758633726907748854

YouTube

Show All Videos