Distilling Knowledge Learned in BERT for Text Generation (1911.03829v3)

Published 10 Nov 2019 in cs.CL and cs.LG

Abstract: Large-scale pre-trained LLM such as BERT has achieved great success in language understanding tasks. However, it remains an open question how to utilize BERT for language generation. In this paper, we present a novel approach, Conditional Masked LLMing (C-MLM), to enable the finetuning of BERT on target generation tasks. The finetuned BERT (teacher) is exploited as extra supervision to improve conventional Seq2Seq models (student) for better text generation performance. By leveraging BERT's idiosyncratic bidirectional nature, distilling knowledge learned in BERT can encourage auto-regressive Seq2Seq models to plan ahead, imposing global sequence-level supervision for coherent text generation. Experiments show that the proposed approach significantly outperforms strong Transformer baselines on multiple language generation tasks such as machine translation and text summarization. Our proposed model also achieves new state of the art on IWSLT German-English and English-Vietnamese MT datasets. Code is available at https://github.com/ChenRocks/Distill-BERT-Textgen.

PDF Abstract

Utilizing BERT for Enhanced Text Generation Through Conditional Masked LLMing

The paper "Distilling Knowledge Learned in BERT for Text Generation" presents an innovative approach to apply BERT, a bidirectional LLM renowned for its prowess in language understanding, to the nuanced domain of text generation tasks. This endeavor targets the existing lacuna in effectively utilizing models like BERT, traditionally employed for tasks such as natural language inference and question answering, to enhance the quality of generated text.

Key Contributions

The research delineates a novel methodology termed Conditional Masked LLMing (C-MLM). By fine-tuning BERT on specific text generation tasks, the paper explores how the resultant model can act as a teacher to augment conventional Seq2Seq (Sequence-to-Sequence) models. In essence, the paper proposes leveraging BERT's inherent ability to utilize context from both left and right directions, thus endowing Seq2Seq models with improved global coherence in generated text.

The model proposed in the paper achieved notable performance improvements, exceeding strong Transformer-based baselines in language generation tasks like machine translation and text summarization. Particularly, the model set a new state of the art on the IWSLT German-English and English-Vietnamese translation benchmarks.

Methodology and Results

The researchers initiated by fine-tuning BERT with the C-MLM task, a derivative of Masked LLMing (MLM) that requires additional conditioning input. This enabled BERT to process an entire sequence, predicting masked tokens not only from preceding but also succeeding context, thereby encompassing a more holistic sentence structure during training phases.

Subsequently, they employed a knowledge distillation process where the fine-tuned BERT model predicted sequences of word probabilities for training samples. A Seq2Seq model, treated as a student, learned from these distributions to imitate BERT's outputs, thus indirectly integrating BERT's bidirectional insights into its autoregressive training regime.

In their empirical evaluations, the authors demonstrated the efficacy of their approach across several text generation datasets. The experiments revealed substantial gains over baseline models, particularly in tasks demanding long-range coherence, thanks to the strategic guidance from the bidirectionally trained BERT.

Implications and Future Work

The findings from this research expand the potential applications of BERT beyond language understanding into the domain of text generation. This integration paves the way for more coherent and contextually rich text generation systems. The novel method also introduces a model-agnostic framework, allowing its application across varied architectures without necessitating increased model sizes, unlike alternative methods that directly integrate BERT's parameters into Seq2Seq models.

The implications of this work are twofold: practically, it provides a pathway to significantly enhancing translation and summarization tasks; theoretically, it opens avenues to explore further synergistic integrations of generative and bidirectional models in AI. Looking ahead, exploring the combination of C-MLM with multimodal inputs, such as those from image captioning tasks, presents an exciting opportunity to deepen the versatility and applicability of this methodology.

In conclusion, the paper furnishes a robust approach to utilize BERT for text generation, underscoring the utility of fine-tuning pretrained models in novel contexts. As AI continues to evolve, methods like these will become integral to developing sophisticated, context-aware generation systems.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Yen-Chun Chen (33 papers)
Zhe Gan (135 papers)
Yu Cheng (354 papers)
Jingzhou Liu (19 papers)
Jingjing Liu (139 papers)

Citations (28)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - ChenRocks/Distill-BERT-Textgen: Research code for ACL 2020 paper: "Distilling Knowledge Learned in BERT for Text Generation". (131 stars)