Well-Read Students Learn Better: On the Importance of Pre-training Compact Models

Published 23 Aug 2019 in cs.CL | (1908.08962v2)

Abstract: Recent developments in natural language representations have been accompanied by large and expensive models that leverage vast amounts of general-domain text through self-supervised pre-training. Due to the cost of applying such models to down-stream tasks, several model compression techniques on pre-trained language representations have been proposed (Sun et al., 2019; Sanh, 2019). However, surprisingly, the simple baseline of just pre-training and fine-tuning compact models has been overlooked. In this paper, we first show that pre-training remains important in the context of smaller architectures, and fine-tuning pre-trained compact models can be competitive to more elaborate methods proposed in concurrent work. Starting with pre-trained compact models, we then explore transferring task knowledge from large fine-tuned models through standard knowledge distillation. The resulting simple, yet effective and general algorithm, Pre-trained Distillation, brings further improvements. Through extensive experiments, we more generally explore the interaction between pre-training and distillation under two variables that have been under-studied: model size and properties of unlabeled task data. One surprising observation is that they have a compound effect even when sequentially applied on the same data. To accelerate future research, we will make our 24 pre-trained miniature BERT models publicly available.

Abstract PDF Upgrade to Chat

Authors (4)

Citations (213)

View on Semantic Scholar

Summary

The paper shows that pre-training compact models with knowledge distillation results in near teacher-level performance while reducing computational demands.
It leverages full pre-training on unlabeled data to systematically explore architecture configurations, highlighting depth over width.
The study provides actionable insights and publicly released models to guide future research on efficient model compression and deployment.

Evaluating the Impact of Pre-training on Compact Models

The paper presents an empirical study on pre-training compact models in NLP, assessing how this can be optimized by integrating standard techniques like knowledge distillation. The rationale guiding this work lies in the observation that state-of-the-art LLMs, such as BERT or XLNet, are computationally intensive due to their large parameter counts. This necessitates methods to achieve comparable performance under constrained memory and latency metrics.

Research Objective and Hypothesis

A significant hypothesis in the study was to explore if compact LLMs, benefiting from both pre-training and fine-tuning, could achieve desirable performance without resorting to powerful model compression techniques. This straightforward hypothesis appears to have been largely overlooked, and this paper fills that gap by implementing and thoroughly evaluating a variety of compact model setup conditions.

Methodology

The researchers conducted a comprehensive methodological evaluation involving:

Pre-training on Compact Architectures: They demonstrate the effectiveness of pre-training small models using unlabeled data.
Extensive Use of Knowledge Distillation: They transfer learned information from large, fine-tuned teacher models to smaller student models. This is implemented in both task-specific and general pre-training stages.

Experimental Design and Key Findings

Moreover, in contrast to previous models that focus solely on truncating pre-trained models for compaction, they apply full pre-training to various architecture configurations. They examine both the width and depth dimensions to identify how compact models can best utilize their parameter budget. The study found that depth is generally more fruitful than width, further optimizing parameter utilization through hierarchically deeper structures.

Through a detailed experimental setup, involving tasks such as sentiment analysis and natural language inference, they showcase how models like MINI can recover teacher-level accuracy with substantial computational savings. These include using modestly sized transfer datasets effectively, highlighting the robustness of Pre-trained Distillation against variations in transfer data size and domain relevance.

Implications of the Findings

Key implications from this research suggest that full pre-training is beneficial irrespective of model size and proves more impactful when coupled with knowledge distillation. By making their trained compact models publicly available, the authors aim to aid and accelerate future research in efficient model deployment, especially on resource-constrained platforms.

Speculating on the Future Developments

Future research may explore synergizing pre-training with other compression techniques like quantization or pruning, or extending insights to other neural architectures beyond Transformers. Additionally, further exploration of multi-task distillation or shared pre-training paradigms can vastly benefit currently isolated task-specific training.

By couching pre-training in the context of compact models and confirming its efficacy alongside distillation, this paper substantively contributes to improving large-scale NLP which balances performance and resource efficiency.

Markdown Report Issue