Reducing Transformer Depth on Demand with Structured Dropout (1909.11556v1)

Published 25 Sep 2019 in cs.LG, cs.CL, and stat.ML

Abstract: Overparameterized transformer networks have obtained state of the art results in various natural language processing tasks, such as machine translation, LLMing, and question answering. These models contain hundreds of millions of parameters, necessitating a large amount of computation and making them prone to overfitting. In this work, we explore LayerDrop, a form of structured dropout, which has a regularization effect during training and allows for efficient pruning at inference time. In particular, we show that it is possible to select sub-networks of any depth from one large network without having to finetune them and with limited impact on performance. We demonstrate the effectiveness of our approach by improving the state of the art on machine translation, LLMing, summarization, question answering, and language understanding benchmarks. Moreover, we show that our approach leads to small BERT-like models of higher quality compared to training from scratch or using distillation.

PDF Abstract

Reducing Transformer Depth on Demand with Structured Dropout

The paper "Reducing Transformer Depth on Demand with Structured Dropout" presents a novel method for improving the computational efficiency of transformer networks, a prevalent architecture in natural language processing tasks. The authors introduce an approach called LayerDrop, which integrates a structured form of dropout aimed at reducing both overfitting and inference time in large models without sacrificing performance.

Core Methodology

Transformers are known for their state-of-the-art performance across various NLP tasks, yet they are also characterized by substantial parameter counts and computational demands. The authors address this challenge by proposing LayerDrop, a technique that allows significant pruning at inference time while maintaining accuracy. LayerDrop operates by randomly dropping entire layers during training, thus necessitating fewer active layers and regularizing the model against overfitting. This approach facilitates the extraction of performant sub-networks at different depths from a single trained model without additional fine-tuning.

Results and Findings

The paper demonstrates the effectiveness of LayerDrop across multiple tasks:

Machine Translation: Improved BLEU scores on WMT14 English-German benchmarks with deeper encoder architectures.
LLMing: Achieved better perplexities on the WikiText-103 dataset, outperforming existing transformer-based approaches.
Summarization and QA: Enhanced ROUGE scores for tasks such as CNN-Dailymail summarization and ELI5 question answering, underscoring the model's robustness across varied NLP applications.

An interesting outcome of LayerDrop is its capability to produce small BERT-like models that exhibit superior quality compared to those trained from scratch or via model distillation techniques. Consequently, LayerDrop not only stabilizes the training process for deeper transformers but also supports on-the-fly model depth adjustments to meet specific computational constraints.

Implications and Future Directions

The implications of this work are manifold. Practically, the ability to dynamically adjust model depth without additional training opens avenues for deploying high-performance LLMs on resource-constrained devices. Theoretically, the results encourage further exploration into structured regularization techniques, providing insights into balancing model complexity with computational efficiency.

Future research might explore integrating LayerDrop with other efficient transformer variants or extending its application beyond NLP to other domains where transformer architectures are utilized. Additionally, adapting the LayerDrop technique to work with emerging model architectures or incorporating learnable dropping rates could further optimize model performance and flexibility.

In summary, the LayerDrop method innovatively enhances transformer efficiency, particularly during inference, while setting a new standard for managing model depth and complexity in large-scale language processing tasks.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

Angela Fan (49 papers)
Edouard Grave (56 papers)
Armand Joulin (81 papers)

Citations (557)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/xl_nlp/status/1775924346407801063

YouTube

Show All Videos