DeltaLM: Encoder-Decoder Pre-training for Language Generation and Translation by Augmenting Pretrained Multilingual Encoders (2106.13736v2)

Published 25 Jun 2021 in cs.CL

Abstract: While pretrained encoders have achieved success in various natural language understanding (NLU) tasks, there is a gap between these pretrained encoders and natural language generation (NLG). NLG tasks are often based on the encoder-decoder framework, where the pretrained encoders can only benefit part of it. To reduce this gap, we introduce DeltaLM, a pretrained multilingual encoder-decoder model that regards the decoder as the task layer of off-the-shelf pretrained encoders. Specifically, we augment the pretrained multilingual encoder with a decoder and pre-train it in a self-supervised way. To take advantage of both the large-scale monolingual data and bilingual data, we adopt the span corruption and translation span corruption as the pre-training tasks. Experiments show that DeltaLM outperforms various strong baselines on both natural language generation and translation tasks, including machine translation, abstractive text summarization, data-to-text, and question generation. The code and pretrained models are available at \url{https://aka.ms/deltalm}.

PDF Abstract

DeltaLM: Enhancing Language Generation and Translation with Encoder-Decoder Pre-training

The paper presents DeltaLM, a multilingual pre-trained encoder-decoder model designed to bridge the notable gap between natural language understanding (NLU) and natural language generation (NLG) tasks. This model leverages off-the-shelf pre-trained multilingual encoders by augmenting them with a newly devised decoder, thereby elevating their application from predominantly understanding tasks to effective generation tasks such as machine translation and text summarization.

Methodology

DeltaLM applies an innovative approach by considering the decoder as a task-specific layer, initialized and trained in conjunction with a pre-existing multilingual encoder. Two self-supervised techniques, span corruption and translation span corruption, are employed to enhance the utility of both monolingual and bilingual data within the training process. The adoption of these techniques facilitates significant improvements in the model’s cross-lingual transfer capabilities and overall performance in NLG tasks.

A key innovation of this paper is the interleaved Transformer decoder architecture. This design aims to maximize the consistency between the encoder and decoder structures, allowing the decoder to more effectively utilize the full range of weights from the pre-trained encoder. This structural alignment is crucial for enhancing the model’s ability to generate coherent language outputs across multiple languages.

Experimental Results

DeltaLM was subjected to an extensive battery of tests over various language generation and translation tasks, demonstrating its capacity to outperform several strong baselines. Notably, DeltaLM showed superior performance in machine translation across multiple language pairs, marking an average increase of 1.5 BLEU points over state-of-the-art models such as mBART and M2M-100. Additionally, it excelled in tasks like cross-lingual text summarization and data-to-text generation, indicating robust improvements in both the quality and consistency of generated text.

In zero-shot cross-lingual transfer scenarios, where the model is required to perform tasks in languages not explicitly part of the training datasets, DeltaLM exhibited substantial gains over previous models, reaffirming its enhanced cross-lingual generalization abilities. These tests validate the model's theoretical design choices and implementation, particularly in terms of its multilingual and translation tasks.

Implications and Future Directions

The advancements reported with DeltaLM have substantial implications for both theoretical and practical applications within natural language processing. Theoretically, the research challenges the traditional separation of NLU and NLG by demonstrating the feasibility of repurposing encoder-centric architectures for generation-rich tasks. Practically, DeltaLM’s ability to efficiently leverage large-scale multilingual data sets aligns with current trends towards models that support high computational efficiency and scalability.

Future research directions should concentrate on scaling the DeltaLM framework to accommodate a broader range of languages, particularly those from underrepresented linguistic families. Moreover, adaption and fine-tuning of such models for domain-specific applications could further enhance their practical utility. Finally, investigating the ethical implications and potential biases inherent in these large-scale multilingual models remains an ongoing concern.

DeltaLM stands as a testament to the evolving landscape of LLMs, where traditional boundaries between understanding and generation are increasingly converging to create more unified, versatile systems.

PDF Markdown Bookmark Chat (Pro)

Authors (9)

Shuming Ma (83 papers)
Li Dong (154 papers)
Shaohan Huang (79 papers)
Dongdong Zhang (79 papers)
Alexandre Muzio (8 papers)
Saksham Singhal (14 papers)
Hany Hassan Awadalla (24 papers)
Xia Song (38 papers)
Furu Wei (291 papers)

Citations (78)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos