DeBERTa: Decoding-enhanced BERT with Disentangled Attention (2006.03654v6)

Published 5 Jun 2020 in cs.CL and cs.LG

Abstract: Recent progress in pre-trained neural LLMs has significantly improved the performance of many NLP tasks. In this paper we propose a new model architecture DeBERTa (Decoding-enhanced BERT with disentangled attention) that improves the BERT and RoBERTa models using two novel techniques. The first is the disentangled attention mechanism, where each word is represented using two vectors that encode its content and position, respectively, and the attention weights among words are computed using disentangled matrices on their contents and relative positions, respectively. Second, an enhanced mask decoder is used to incorporate absolute positions in the decoding layer to predict the masked tokens in model pre-training. In addition, a new virtual adversarial training method is used for fine-tuning to improve models' generalization. We show that these techniques significantly improve the efficiency of model pre-training and the performance of both natural language understanding (NLU) and natural langauge generation (NLG) downstream tasks. Compared to RoBERTa-Large, a DeBERTa model trained on half of the training data performs consistently better on a wide range of NLP tasks, achieving improvements on MNLI by +0.9% (90.2% vs. 91.1%), on SQuAD v2.0 by +2.3% (88.4% vs. 90.7%) and RACE by +3.6% (83.2% vs. 86.8%). Notably, we scale up DeBERTa by training a larger version that consists of 48 Transform layers with 1.5 billion parameters. The significant performance boost makes the single DeBERTa model surpass the human performance on the SuperGLUE benchmark (Wang et al., 2019a) for the first time in terms of macro-average score (89.9 versus 89.8), and the ensemble DeBERTa model sits atop the SuperGLUE leaderboard as of January 6, 2021, out performing the human baseline by a decent margin (90.3 versus 89.8).

PDF Abstract

DeBERTa: Decoding-enhanced BERT with Disentangled Attention

Introduction

In the landscape of pre-trained LLMs, the trend has increasingly shifted towards leveraging Transformer architectures, as evidenced by BERT, RoBERTa, and GPT-3. These models have often introduced incremental architectural innovations that yielded performance gains across a myriad of NLP tasks. In this context, DeBERTa represents a significant advancement, proposing two key architectural enhancements: a disentangled attention mechanism and an enhanced mask decoder.

Disentangled Attention Mechanism

Unlike traditional BERT, which amalgamates content and positional information in a single token representation, DeBERTa separates these two forms of information. Each token is represented by two distinct vectors: one for content and one for position. Consequently, attention weights are computed using matrices that independently handle content-content, content-position, and position-content interactions. This disentanglement allows for richer contextual encoding and improved attention dynamics.

Existing models like RoBERTa or XLNet primarily utilize either absolute or relative positional encodings. DeBERTa's mechanism captures the syntactical nuances and dependencies between tokens more effectively by leveraging disentangled matrices that couple contents and positions. Empirical assessments exhibit that this approach reduces redundancy and enhances representational capacity, particularly notable in the attention patterns produced by the model.

Enhanced Mask Decoder

The standard MLM pre-training task used in BERT, which incorporates absolute positional information at the input layer, can limit the model's ability to generalize. DeBERTa addresses this by introducing an Enhanced Mask Decoder (EMD), which incorporates absolute position embeddings at the decoding layer before the softmax function. This approach enables the model to integrate both relative and absolute positional information — the former ingrained in the intermediate layers and the latter applied during final token prediction. This modification is central to capturing sentence-level syntactic dependencies more effectively, fostering more accurate predictions even when contextual clues are ambiguous.

Virtual Adversarial Training

DeBERTa employs a novel virtual adversarial training method dubbed Scale-invariant-Fine-Tuning (SiFT), which enhances model generalization. SiFT normalizes word embeddings before applying perturbations during fine-tuning, stabilizing the training process and effectively handling varying embedding norms. This method shows particular promise in refining large-scale models, evidenced by its successful application in DeBERTa $_{1.5B}$ on the SuperGLUE benchmark.

Empirical Performance and Implications

Comprehensive empirical evaluations reveal that DeBERTa outperforms leading models like RoBERTa and XLNet across numerous benchmarks. Specific gains are observed on the GLUE and SuperGLUE benchmarks:

GLUE Tasks: DeBERTa showed consistent improvements over RoBERTa (trained on half the data), notably achieving an MNLI accuracy improvement (+0.9%), SQuAD v2.0 improvement (+2.3%), and a significant boost on RACE (+3.6%).
SuperGLUE Benchmark: The larger DeBERTa model (1.5 billion parameters) surpassed human performance, achieving an average score of 89.9 compared to the human baseline of 89.8, and the ensemble model further improved to 90.3.

These results underscore DeBERTa's capacity not only in natural language understanding but also in generation tasks, as exemplified by its performance on the Wikitext-103 dataset, where it reduced perplexity from RoBERTa's 21.6 to 19.5.

Future Developments

While DeBERTa marks significant progress towards general AI by effectively modeling syntactical and contextual relationships, future research avenues exist. One such area is the explicit integration of compositional structures to enhance model comparability with human cognitive capabilities. Understanding how neural and symbolic computations can be dynamically synthesized, similar to human language processing, remains an open challenge worth exploring.

Conclusion

DeBERTa stands as a landmark model in the evolution of pre-trained LLMs, demonstrating substantial gains through its innovative disentangled attention mechanism and enhanced mask decoder. These architectural advancements, coupled with robust training methodologies like SiFT, furnish the model with superior capabilities in understanding and generating human language. As NLP research continues to evolve, DeBERTa offers a potent foundation for developing more efficient, generalized AI systems capable of performing across a diverse array of linguistic tasks.