ALBERT: A Lite BERT for Self-supervised Learning of Language Representations (1909.11942v6)

Published 26 Sep 2019 in cs.CL and cs.AI

Abstract: Increasing model size when pretraining natural language representations often results in improved performance on downstream tasks. However, at some point further model increases become harder due to GPU/TPU memory limitations and longer training times. To address these problems, we present two parameter-reduction techniques to lower memory consumption and increase the training speed of BERT. Comprehensive empirical evidence shows that our proposed methods lead to models that scale much better compared to the original BERT. We also use a self-supervised loss that focuses on modeling inter-sentence coherence, and show it consistently helps downstream tasks with multi-sentence inputs. As a result, our best model establishes new state-of-the-art results on the GLUE, RACE, and \squad benchmarks while having fewer parameters compared to BERT-large. The code and the pretrained models are available at https://github.com/google-research/ALBERT.

Citations (6,042)

View on Semantic Scholar

Summary

The paper introduces factorized embedding parameterization and cross-layer parameter sharing to reduce memory usage and speed up training without sacrificing performance.
It replaces the traditional NSP task with Sentence Order Prediction to enhance inter-sentence coherence in self-supervised learning.
Empirical results show ALBERT achieves state-of-the-art scores on benchmarks like GLUE and SQuAD using far fewer parameters than BERT.

Overview of ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

The paper "ALBERT: A Lite BERT for Self-supervised Learning of Language Representations" addresses critical limitations in scaling pre-trained LLMs, specifically BERT, due to memory constraints and training time. The authors introduce ALBERT (A Lite BERT), which utilizes two parameter-reduction techniques designed to increase training efficiency and decrease memory consumption without sacrificing performance. Additionally, ALBERT introduces a new self-supervised objective aimed at modeling inter-sentence coherence more effectively than the Next Sentence Prediction (NSP) task used in BERT.

Key Contributions

1. Factorized Embedding Parameterization:

The ALBERT architecture employs a factorized embedding parameterization to decouple the size of the hidden layers from the size of the vocabulary embeddings. This decoupling is achieved by decomposing the large vocabulary embedding matrix into two smaller matrices, significantly reducing the number of parameters from $O(V \times H)$ to $O(V \times E + E \times H)$ where $V$ is the vocabulary size, $E$ is the embedding size, and $H$ is the hidden size. This allows for increased hidden layer sizes without a corresponding increase in vocabulary embedding parameters, greatly enhancing parameter efficiency.

2. Cross-layer Parameter Sharing:

ALBERT implements cross-layer parameter sharing to prevent the number of parameters from growing with network depth. By sharing parameters across different layers, ALBERT reduces its overall parameter count while maintaining competitive performance. Experiments indicate that although some performance drop is observed with parameter sharing, the trade-off is justified by the significant reduction in model size and training duration.

3. Sentence Order Prediction (SOP) Loss:

To replace the NSP task in BERT, which has been shown to have limited efficacy, ALBERT introduces Sentence Order Prediction (SOP) as its secondary training objective. SOP focuses on predicting the correct order of two consecutive segments, honing in on inter-sentence coherence rather than topic prediction. Experimental results demonstrate that SOP leads to better downstream task performance as it forces the model to learn finer-grained distinctions about discourse-level coherence properties.

Empirical Results

ALBERT achieves state-of-the-art results on several benchmarks with significantly fewer parameters than BERT-large.

GLUE: ALBERT establishes new records with an average score of 89.4.
SQuAD: It surpasses previous models with an F1 score of 92.2 on SQuAD 2.0.
RACE: ALBERT achieves a test accuracy of 89.4%, marking a substantial improvement over prior models, demonstrating the efficacy of inter-sentence coherence modeling through SOP.

Implications and Future Directions

The introduction of ALBERT represents a significant stride towards more memory-efficient and faster training of large-scale LLMs. The factorized embedding parameterization and cross-layer parameter sharing techniques pave the way for further optimizations in LLM architectures. The innovations in inter-sentence coherence tasks suggest that there are still untapped dimensions of self-supervised learning objectives that can enhance language understanding capabilities.

Future work may include investigating other parameter-sharing configurations and exploring alternative self-supervised objectives. Moreover, additional research could focus on integrating efficient training techniques like sparse attention and block attention to further accelerate training times and improve inference speeds.

ALBERT's architecture and methodologies have set a precedent for designing resource-efficient models while pushing the boundaries of NLP performance. The field should expect to see these techniques refined and expanded upon in subsequent research, potentially unveiling more sophisticated models that maintain an optimal balance between parameter size and model accuracy.