Z-Code++: A Pre-trained Language Model Optimized for Abstractive Summarization (2208.09770v2)

Published 21 Aug 2022 in cs.CL and cs.AI

Abstract: This paper presents Z-Code++, a new pre-trained LLM optimized for abstractive text summarization. The model extends the state of the art encoder-decoder model using three techniques. First, we use a two-phase pre-training process to improve model's performance on low-resource summarization tasks. The model is first pre-trained using text corpora for language understanding, and then is continually pre-trained on summarization corpora for grounded text generation. Second, we replace self-attention layers in the encoder with disentangled attention layers, where each word is represented using two vectors that encode its content and position, respectively. Third, we use fusion-in-encoder, a simple yet effective method of encoding long sequences in a hierarchical manner. Z-Code++ creates new state of the art on 9 out of 13 text summarization tasks across 5 languages. Our model is parameter-efficient in that it outperforms the 600x larger PaLM-540B on XSum, and the finetuned 200x larger GPT3-175B on SAMSum. In zero-shot and few-shot settings, our model substantially outperforms the competing models.

Authors (14)

Pengcheng He (60 papers)
Baolin Peng (72 papers)
Liyang Lu (15 papers)
Song Wang (313 papers)
Jie Mei (42 papers)
Yang Liu (2253 papers)
Ruochen Xu (35 papers)
Hany Hassan Awadalla (24 papers)
Yu Shi (153 papers)
Chenguang Zhu (100 papers)
Wayne Xiong (10 papers)
Michael Zeng (76 papers)
Jianfeng Gao (344 papers)
Xuedong Huang (22 papers)

Citations (40)

View on Semantic Scholar

Summary

An Overview of "Z-Code++: A Pre-trained LLM Optimized for Abstractive Summarization"

The paper "Z-Code++: A Pre-trained LLM Optimized for Abstractive Summarization" introduces a novel approach to enhancing abstractive summarization capabilities using a pre-trained LLM named Z-Code++. This model capitalizes on three key techniques: two-phase pre-training, disentangled attention, and the fusion-in-encoder method for handling long sequences.

Key Contributions and Techniques

Two-Phase Pre-Training: Z-Code++ implements a systematic two-phase pre-training regimen. Initially, the model undergoes traditional pre-training using tasks such as replaced token detection (RTD) and corrupted span prediction (CSP). This phase focuses on enhancing language understanding by distinguishing corrupted tokens and predicting masked spans. The unique introduction here is the combination of RTD and CSP, where CSP notably extends the gap-sentences generation approach used in prior models like PEGASUS.

In the subsequent phase, the model is continuously trained on summarization-specific datasets. This grounded pre-training aims to refine the model's capability to generate summaries that embody the original document's gist while being succinct and coherent. Experimental results highlighted in the paper indicate that this phase notably improves performance in low-resource setups.

Disentangled Attention (DA): Departing from traditional self-attention, Z-Code++ implements DA by representing each word with separate vectors for content and position, thereby disentangling these aspects. The authors argue that this method enhances the sensitivity of the model to both semantic content and syntactic structure, enabling more nuanced summarization.
Fusion-in-Encoder (FiE): To address the challenge imposed by long input sequences, Z-Code++ introduces FiE. This approach segments input sequences into smaller chunks for local attention processing, followed by global attention across these chunks. Such hierarchical processing is shown to handle large contexts more effectively compared to sparse attention mechanisms, without compromising on attention precision.

Experimental Evaluation

The empirical analysis in the paper spans 13 summarization datasets across five languages, including English, German, and Spanish. Notably, Z-Code++ establishes new benchmarks on nine distinct tasks. In particular, the model's performance in zero-shot and few-shot settings demonstrates substantial efficiency. It surpasses the performance of considerably larger models, such as PaLM\textsubscript{540B} and GPT3\textsubscript{175B}, showcasing parameter efficiency and robustness.

Implications and Future Directions

Z-Code++ represents significant progress in generative LLMing, especially for abstractive summarization. Its ability to achieve high performance with far fewer parameters than its counterparts suggests practical implications for deploying these models in resource-constrained environments, such as mobile or embedded systems.

Theoretically, the use of disentangled attention and two-phase pre-training could inspire future innovations in encoder-decoder architectures, not only limited to summarization but potentially extending to diverse natural language tasks.

Going forward, potential areas of exploration include refining the grounding pre-training with more diverse data and exploring its impact on reducing hallucinations in generated texts—an acknowledged limitation in current abstractive models. Additionally, integrating these methodologies into broader application domains remains an open question.

Overall, Z-Code++ is presented as a robust and efficient model that enriches the domain of abstractive summarization, setting a precedent for future endeavors in the field.

Related Papers

Find Related Papers

Tweets

https://twitter.com/wendyxiao06091/status/1750388943185666331

YouTube

Show All Videos