ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators

Published 23 Mar 2020 in cs.CL | (2003.10555v1)

Abstract: Masked language modeling (MLM) pre-training methods such as BERT corrupt the input by replacing some tokens with [MASK] and then train a model to reconstruct the original tokens. While they produce good results when transferred to downstream NLP tasks, they generally require large amounts of compute to be effective. As an alternative, we propose a more sample-efficient pre-training task called replaced token detection. Instead of masking the input, our approach corrupts it by replacing some tokens with plausible alternatives sampled from a small generator network. Then, instead of training a model that predicts the original identities of the corrupted tokens, we train a discriminative model that predicts whether each token in the corrupted input was replaced by a generator sample or not. Thorough experiments demonstrate this new pre-training task is more efficient than MLM because the task is defined over all input tokens rather than just the small subset that was masked out. As a result, the contextual representations learned by our approach substantially outperform the ones learned by BERT given the same model size, data, and compute. The gains are particularly strong for small models; for example, we train a model on one GPU for 4 days that outperforms GPT (trained using 30x more compute) on the GLUE natural language understanding benchmark. Our approach also works well at scale, where it performs comparably to RoBERTa and XLNet while using less than 1/4 of their compute and outperforms them when using the same amount of compute.

Abstract PDF Upgrade to Chat

Summary

The paper introduces replaced token detection, where a discriminator distinguishes replaced tokens, offering improved computational efficiency over traditional MLM.
It employs a generator-discriminator framework that alleviates pre-train fine-tune discrepancies by eliminating the use of [MASK] tokens in downstream tasks.
Experiments on benchmarks like GLUE and SQuAD demonstrate that ELECTRA outperforms comparable models, with smaller variants gaining up to 5 points in accuracy.

ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators

Introduction

The paper "ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators" (2003.10555) introduces a novel approach to pre-training text encoders that contrasts with traditional masked language modeling (MLM) tasks used in models such as BERT. While MLM replaces certain input tokens with the [MASK] symbol and predicts the masked tokens' identities, ELECTRA modifies this by training a model to discriminate between tokens that were part of the original input and those replaced by a generator model. This approach, termed replaced token detection, enhances computational efficiency as it operates over all tokens in the input sequence, unlike MLM which focuses on a subset of tokens.

Methodology

ELECTRA comprises two components: a generator and a discriminator. The generator, typically a small masked LLM, suggests plausible token replacements, while the discriminator identifies if a token in the corrupted sequence was replaced. This setup resembles a GAN structure, but ELECTRA avoids adversarial training for the generator due to challenges with textual data (Figure 1).

Figure 1: An overview of replaced token detection.

Training is conducted by initially corrupting the input sequence using the generator and then fine-tuning the discriminator, which will be used for downstream tasks. Importantly, this method circumvents the pre-train fine-tune discrepancy found in BERT, where $[MASK]$ tokens do not appear during fine-tuning stages.

Experiments and Results

The utility of ELECTRA is demonstrated through extensive experiments against benchmarks like GLUE and SQuAD. These show that ELECTRA not only reduces the computational requirements compared to models such as BERT and XLNet but achieves higher accuracy with the same compute budget, particularly in smaller models. Remarkably, ELECTRA-Small outperformed a comparable BERT model on GLUE by 5 points, and even surpassed the much larger GPT model (Figure 2).

Figure 2: Replaced token detection pre-training consistently outperforms masked LLM pre-training given the same compute budget.

Large-scale experiments further reveal that ELECTRA maintains high performance while being compute-efficient. For example, ELECTRA-Large was competitive with RoBERTa and XLNet, despite using a fraction of the computational resources. These findings indicate the discriminator's effectiveness at extracting rich contextual representations beyond those derived through generative tasks.

Model Extensions

To optimize performance, several modifications to the ELECTRA framework are evaluated. These include weight sharing between generator and discriminator, smaller generators, and alternative training algorithms. ELECTRA benefits notably from shared token embeddings between generator and discriminator and achieves optimal performance with generators significantly smaller than the discriminator (Figure 3).

Figure 3: GLUE scores for different generator/discriminator sizes.

Implications and Future Directions

ELECTRA's approach holds significant implications for the development of efficient NLP models. By minimizing compute while maximizing accuracy, it broadens access to powerful language representations without prohibitive computational costs. Future work could explore extensions such as auto-regressive generators and cross-modal applications to other data forms.

Conclusion

The paper provides a compelling case for reconsidering the paradigm of pre-training text encoders. By positioning token replacement detection as a central task, ELECTRA achieves a delicate balance between computational efficiency and task performance. This research advocates for efficiency metrics as critical alongside absolute performance, setting a precedent for future language representation learning endeavors.

Markdown

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

Continue Learning

Authors (4)

Collections

Tweets

YouTube

Show All Videos

HackerNews

Electra: Pre-Training Text Encoders as Discriminators Rather Than Generators (2020) (65 points, 9 comments)

ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators

Summary

ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators

Introduction

Methodology

Experiments and Results

Model Extensions

Implications and Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Related Papers

Authors (4)

Collections

Tweets

YouTube

HackerNews