RoBERTa: A Robustly Optimized BERT Pretraining Approach (1907.11692v1)

Published 26 Jul 2019 in cs.CL

Abstract: LLM pretraining has led to significant performance gains but careful comparison between different approaches is challenging. Training is computationally expensive, often done on private datasets of different sizes, and, as we will show, hyperparameter choices have significant impact on the final results. We present a replication study of BERT pretraining (Devlin et al., 2019) that carefully measures the impact of many key hyperparameters and training data size. We find that BERT was significantly undertrained, and can match or exceed the performance of every model published after it. Our best model achieves state-of-the-art results on GLUE, RACE and SQuAD. These results highlight the importance of previously overlooked design choices, and raise questions about the source of recently reported improvements. We release our models and code.

Authors (10)

Yinhan Liu (8 papers)
Myle Ott (33 papers)
Naman Goyal (37 papers)
Jingfei Du (16 papers)
Mandar Joshi (24 papers)
Danqi Chen (84 papers)
Omer Levy (70 papers)
Mike Lewis (78 papers)
Luke Zettlemoyer (225 papers)
Veselin Stoyanov (21 papers)

Citations (22,147)

View on Semantic Scholar

Summary

The paper demonstrates that optimizing hyperparameters and extended training regimes significantly boosts performance over the original BERT.
The authors remove the NSP objective, employ dynamic masking, and increase sequence lengths to achieve state-of-the-art results on benchmark datasets.
They introduce a large-scale CC-News dataset and release open-source code, enhancing reproducibility and practical impact in NLP research.

RoBERTa: A Robustly Optimized BERT Pretraining Approach

The paper "RoBERTa: A Robustly Optimized BERT Pretraining Approach" by Liu et al. contributes to the ongoing improvement of BERT-based models by evaluating and enhancing various design choices during pretraining. The authors undertake a detailed replication paper of BERT pretraining, addressing significant factors that impact the performance of such models, including hyperparameter choices, training data sizes, and the influence of specific training objectives.

Key Contributions

Replication and Optimization: The authors replicate BERT's pretraining to analyze the impact of various hyperparameters and training data sizes meticulously. They find that BERT was significantly undertrained and that with optimized hyperparameters, it is possible to surpass the performance of all subsequent models.
New Training Strategies: The paper proposes several modifications:
- Training the model for longer periods with larger batch sizes over more extensive datasets.
- Removing the next sentence prediction (NSP) objective.
- Training on longer sequences.
- Introducing dynamic masking patterns during training.
Large Dataset Collection: The paper includes the construction of a large new dataset, CC-News, comparable in size to many private datasets used in other works, enhancing the reproducibility and comparability of their findings.
Empirical Results: Through rigorous experimentation, the authors demonstrate that their training improvements lead to state-of-the-art results on benchmarks like GLUE, RACE, and SQuAD.
Open Source Release: They release their models and code, facilitating further research and verification by the community.

Experimental Setup and Findings

Analysis of Training Procedure

The paper dissects different aspects of BERT’s pretraining, such as static vs. dynamic masking, model input format, NSP loss, training with large batches, and text encoding strategies. Several observations stand out:

Dynamic Masking: This approach marginally outperforms static masking. Given its efficiency, dynamic masking is adopted for subsequent experiments.
NSP Loss and Input Format: Contrary to BERT’s original findings, removing the NSP loss and employing full-sentence inputs without segment pairs led to improved performance. The results indicated that models trained without NSP and with full-sentence sequences (full-sentences) performed best across various tasks.
Batch Size: Larger batches (up to 8K sequences) yielded better perplexity and end-task performance, reaffirming previous findings in neural machine translation.
Text Encoding: Using a byte-level BPE vocabulary of 50K units, although slightly worse in some tasks compared to character-level BPE, offers a more universal encoding scheme.

RoBERTa Configuration and Results

The combined improvements lead to the creation of RoBERTa, which is pretrained for 100K steps over a large dataset (BooksCorpus + Wikipedia), and then further with additional datasets (totaling 160GB) and for extended steps (up to 500K). This configuration, known as RoBERTa, consistently outperforms the original BERT and even other recent models across various benchmarks.

GLUE Benchmark: RoBERTa achieves state-of-the-art results on all nine GLUE task development sets, illustrating the effectiveness of the proposed modifications. Test set evaluations via the GLUE leaderboard show RoBERTa leading on several tasks, with notable performance improvements without relying on multi-task finetuning.
SQuAD and RACE Benchmarks: On SQuAD and RACE, RoBERTa also sets new state-of-the-art scores, showcasing the robustness of the optimized pretraining approach.

Practical and Theoretical Implications

The practical implications of this work are profound. By demonstrating that existing models can achieve superior performance through better-tuned pretraining procedures, the research underscores the importance of optimizing hyperparameters and training regimes over developing entirely new model architectures. This has significant cost-efficiency implications for large-scale model training in both academic and industry settings.

Theoretically, the findings challenge the narrative around the necessity of objectives like NSP, suggesting that simpler training objectives, when effectively optimized and supported by extensive and diverse data, can achieve competitive, if not superior, results.

Future Directions

Future research may build upon these findings by exploring even larger batch sizes and datasets, understanding the interplay between data size and diversity more granularly, and potentially developing new pretraining objectives that leverage the insights gained from this paper. Moreover, incorporating multi-task finetuning can further enhance the effectiveness of such pretrained models.

In conclusion, Liu et al.’s work on RoBERTa not only provides a compelling direction for pretraining optimization but also sets a high benchmark for future advancements in the field of natural language understanding. The open-sourcing of their models and code significantly enhances the accessibility and reproducibility of this research, promoting further exploration and innovation in the community.

PDF Markdown

Related Papers

Tweets

https://twitter.com/thegautamkamath/status/1771167168765292783

https://twitter.com/Kamran12341/status/1811920707347001635

YouTube

Show All Videos