MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices (2004.02984v2)

Published 6 Apr 2020 in cs.CL and cs.LG

Abstract: NLP has recently achieved great success by using huge pre-trained models with hundreds of millions of parameters. However, these models suffer from heavy model sizes and high latency such that they cannot be deployed to resource-limited mobile devices. In this paper, we propose MobileBERT for compressing and accelerating the popular BERT model. Like the original BERT, MobileBERT is task-agnostic, that is, it can be generically applied to various downstream NLP tasks via simple fine-tuning. Basically, MobileBERT is a thin version of BERT_LARGE, while equipped with bottleneck structures and a carefully designed balance between self-attentions and feed-forward networks. To train MobileBERT, we first train a specially designed teacher model, an inverted-bottleneck incorporated BERT_LARGE model. Then, we conduct knowledge transfer from this teacher to MobileBERT. Empirical studies show that MobileBERT is 4.3x smaller and 5.5x faster than BERT_BASE while achieving competitive results on well-known benchmarks. On the natural language inference tasks of GLUE, MobileBERT achieves a GLUEscore o 77.7 (0.6 lower than BERT_BASE), and 62 ms latency on a Pixel 4 phone. On the SQuAD v1.1/v2.0 question answering task, MobileBERT achieves a dev F1 score of 90.0/79.2 (1.5/2.1 higher than BERT_BASE).

PDF Abstract

Overview of MobileBERT: A Compact Task-Agnostic BERT for Resource-Limited Devices

The paper introduces MobileBERT, a novel compressed version of the BERT (Bidirectional Encoder Representations from Transformers) model designed for resource-limited mobile devices. The primary goal of MobileBERT is to mitigate the heavy computational and memory demands of the original BERT model while retaining its task-agnostic nature and competitive performance across various NLP benchmarks.

Key Contributions

To achieve these objectives, the authors present several key innovations:

Bottleneck and Inverted-Bottleneck Structures: MobileBERT employs a deep but narrow architecture, incorporating bottleneck and inverted-bottleneck modules to reduce the model’s width. This design balances the computational burden between self-attention mechanisms and feed-forward networks.
Training with a Teacher Model: Instead of training MobileBERT from scratch, the authors propose using a specially designed teacher model, IB-BERT (Inverted-Bottleneck BERT). This model preserves the features of BERT_LARGE while enabling effective knowledge transfer to MobileBERT.
Stacked Feed-Forward Networks: To address the imbalance caused by the narrow bottleneck layers, MobileBERT uses multiple stacked feed-forward networks, ensuring a consistent ratio of parameters between multi-head attention and feed-forward layers.
Embedding Factorization: The embedding layer, a significant contributor to model size, is optimized by reducing its dimension and applying a 1D convolution to maintain the output dimension.
Operational Optimizations: The paper also introduces practical optimizations, such as replacing layer normalization and GELU activation to reduce inference latency on mobile devices.

Training Strategies

The authors explore different strategies for training MobileBERT efficiently:

Auxiliary Knowledge Transfer (AKT): This strategy incorporates intermediate knowledge transfer as an auxiliary task to the main objective of knowledge distillation.
Joint Knowledge Transfer (JKT): Here, all layer-wise knowledge transfer losses are jointly optimized, followed by a separate distillation phase.
Progressive Knowledge Transfer (PKT): The model is trained progressively, layer by layer, ensuring more stable and effective knowledge transfer.

Empirical Evaluations

The performance of MobileBERT is thoroughly evaluated on prominent NLP benchmarks, including the General Language Understanding Evaluation (GLUE) and the Stanford Question Answering Dataset (SQuAD). Noteworthy outcomes from these evaluations include:

GLUE Benchmark: MobileBERT achieves a GLUE score of 77.7, which is comparable to BERT_BASE, while being 4.3 times smaller and 5.5 times faster. The GLUE score of MobileBERT_TINY is 75.8, demonstrating that even more compact variants maintain robust performance.
SQuAD Dev Set: MobileBERT attains an F1 score of 90.0 on SQuAD v1.1, outperforming BERT_BASE by 1.5 points with significantly lower latency and model size.

Implications and Future Directions

The design and empirical success of MobileBERT have several practical and theoretical implications:

Practical Deployment: MobileBERT enables the deployment of high-performing NLP models on mobile devices, facilitating applications such as mobile-based machine translation and dialogue modeling.
Model Compression Techniques: The paper emphasizes the efficacy of deep and narrow networks, bottleneck structures, and progressive training, which can be beneficial for other model compression endeavors.
Transfer Learning: The techniques in MobileBERT can be extended to improve transfer learning approaches, where large models can be distilled into more compact, efficient variants without significant performance drops.

Conclusion

MobileBERT is a pioneering work in developing a compact, task-agnostic BERT model suited for resource-limited devices. By leveraging innovative architectural modifications and knowledge transfer techniques, MobileBERT achieves a remarkable trade-off between model size, speed, and performance. Future research can build on these concepts to further refine and generalize model compression strategies, promoting the broader deployment of sophisticated NLP models in constrained environments.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Zhiqing Sun (35 papers)
Hongkun Yu (17 papers)
Xiaodan Song (13 papers)
Renjie Liu (8 papers)
Yiming Yang (151 papers)
Denny Zhou (65 papers)

Citations (727)

View on Semantic Scholar