DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing (2111.09543v4)

Published 18 Nov 2021 in cs.CL and cs.LG

Abstract: This paper presents a new pre-trained LLM, DeBERTaV3, which improves the original DeBERTa model by replacing mask LLMing (MLM) with replaced token detection (RTD), a more sample-efficient pre-training task. Our analysis shows that vanilla embedding sharing in ELECTRA hurts training efficiency and model performance. This is because the training losses of the discriminator and the generator pull token embeddings in different directions, creating the "tug-of-war" dynamics. We thus propose a new gradient-disentangled embedding sharing method that avoids the tug-of-war dynamics, improving both training efficiency and the quality of the pre-trained model. We have pre-trained DeBERTaV3 using the same settings as DeBERTa to demonstrate its exceptional performance on a wide range of downstream natural language understanding (NLU) tasks. Taking the GLUE benchmark with eight tasks as an example, the DeBERTaV3 Large model achieves a 91.37% average score, which is 1.37% over DeBERTa and 1.91% over ELECTRA, setting a new state-of-the-art (SOTA) among the models with a similar structure. Furthermore, we have pre-trained a multi-lingual model mDeBERTa and observed a larger improvement over strong baselines compared to English models. For example, the mDeBERTa Base achieves a 79.8% zero-shot cross-lingual accuracy on XNLI and a 3.6% improvement over XLM-R Base, creating a new SOTA on this benchmark. We have made our pre-trained models and inference code publicly available at https://github.com/microsoft/DeBERTa.

Citations (973)

View on Semantic Scholar

Summary

The paper replaces DeBERTa's masked language modeling with ELECTRA-style replaced token detection, enhancing pre-training sample efficiency.
The paper introduces gradient-disentangled embedding sharing (GDES) to decouple conflicting gradient flows between the generator and discriminator.
Extensive experiments on NLU benchmarks demonstrate that DeBERTaV3 significantly outperforms previous models with new state-of-the-art scores.

Improving DeBERTa with ELECTRA-Style Pre-Training via Gradient-Disentangled Embedding Sharing

The paper "NewModel: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing" introduces NewModel, a pre-trained LLM that builds on the foundation of DeBERTa by incorporating techniques from ELECTRA. Specifically, the authors replace the masked LLMing (MLM) approach in DeBERTa with replaced token detection (RTD) from ELECTRA and propose a novel embedding sharing mechanism named gradient-disentangled embedding sharing (GDES). This approach aims to mitigate the training inefficiencies arising from conflicting training losses between the generator and the discriminator.

Key Contributions

Integration of RTD in DeBERTa: The authors demonstrate that replacing the MLM task in DeBERTa with RTD significantly enhances the model's performance. By doing so, they leverage ELECTRA's sample efficiency, which enables more effective pre-training.
Gradient-Disentangled Embedding Sharing (GDES): The central innovation of this paper is the GDES method, which addresses the tug-of-war dynamics caused by opposing training losses in the generator and discriminator when traditional embedding sharing is used. GDES allows gradient flow from the MLM loss to update the generator's embeddings while preventing gradients from the discriminator's RTD loss from interfering, thereby preserving semantic coherence in the embeddings.
Empirical Validation: Extensive experiments demonstrate that GDES outperforms traditional embedding sharing methods. This improvement is consistent across several metrics, including training efficiency, embedding quality, and fine-tuning performance on downstream tasks.

Detailed Evaluation

Performance on NLU Tasks

The authors pre-train several variants of NewModel (large, base, small, and xsmall) and evaluate their performance on prominent NLU benchmarks, including GLUE, SQuAD, RACE, ReCoRD, SWAG, and CoNLL-2003. For instance, NewModel\textsubscript{large} achieves a significant improvement on the GLUE benchmark, setting a new state-of-the-art with an average score of 91.37%, surpassing DeBERTa by 1.37% and ELECTRA by 1.91%. This underscores the model's robustness and generalization capability.

Multilingual Extension

The authors also extend their approach to multilingual settings, establishing mNewModel\textsubscript{base}, which is trained on the CC100 dataset. This model significantly outperforms the XLM-R\textsubscript{base} on the XNLI benchmark, achieving a 3.6% higher zero-shot cross-lingual accuracy. This indicates the potential of GDES and the RTD approach to enhance multilingual understanding.

Implications and Future Directions

Theoretical Implications:

The success of GDES in pre-training suggests that disentangling gradients in shared embeddings leads to better alignment between the generator and discriminator's training objectives. This could be a key insight for future pre-training techniques that employ adversarial training mechanisms.
The findings also emphasize the importance of pre-training tasks' selection and arrangement, demonstrating that integrating ELECTRA's RTD with DeBERTa's architecture creates a synergistic effect that boosts overall performance.

Practical Implications:

Models like NewModel can be particularly useful in resource-constrained environments where pre-training efficiency is critical. By maintaining high performance with fewer parameters and less computational cost, NewModel provides a more economical alternative to other large-scale transformers.
The improved generalization capabilities of NewModel make it a valuable asset for a wide variety of NLP applications, ranging from sentiment analysis to question answering and beyond.

Conclusion

This paper successfully combines the strengths of DeBERTa and ELECTRA by addressing the inefficiencies in their training regimes via GDES. The results highlight both practical and theoretical advancements in pre-trained LLMs, showcasing a path forward for creating more efficient and capable NLP systems. Future research could explore further optimizations in embedding sharing mechanisms and examine the applicability of GDES to other architectures and tasks, pushing the boundaries of what is achievable with pre-trained LLMs.

PDF Markdown

Related Papers

GitHub

GitHub - microsoft/DeBERTa: The implementation of DeBERTa (1,885 stars)

Tweets

https://twitter.com/chrmanning/status/1779975070187934167