ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators (2003.10555v1)

Published 23 Mar 2020 in cs.CL

Abstract: Masked LLMing (MLM) pre-training methods such as BERT corrupt the input by replacing some tokens with [MASK] and then train a model to reconstruct the original tokens. While they produce good results when transferred to downstream NLP tasks, they generally require large amounts of compute to be effective. As an alternative, we propose a more sample-efficient pre-training task called replaced token detection. Instead of masking the input, our approach corrupts it by replacing some tokens with plausible alternatives sampled from a small generator network. Then, instead of training a model that predicts the original identities of the corrupted tokens, we train a discriminative model that predicts whether each token in the corrupted input was replaced by a generator sample or not. Thorough experiments demonstrate this new pre-training task is more efficient than MLM because the task is defined over all input tokens rather than just the small subset that was masked out. As a result, the contextual representations learned by our approach substantially outperform the ones learned by BERT given the same model size, data, and compute. The gains are particularly strong for small models; for example, we train a model on one GPU for 4 days that outperforms GPT (trained using 30x more compute) on the GLUE natural language understanding benchmark. Our approach also works well at scale, where it performs comparably to RoBERTa and XLNet while using less than 1/4 of their compute and outperforms them when using the same amount of compute.

PDF Abstract

Analysis of the ELECTRA Pre-training Framework

The paper "ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators" introduces an innovative approach to self-supervised learning for language representation, diverging from traditional masked LLMing (MLM) techniques like those employed in BERT. Instead of training models to generate masked tokens, ELECTRA optimizes them to distinguish between real input tokens and plausible replacements sampled from a small generator network. This methodological shift positions the model as a discriminator, enhancing its sample efficiency and computational performance.

Methodological Advancements

ELECTRA introduces a pre-training task known as replaced token detection. Unlike MLM, where a subset of tokens is masked out, ELECTRA replaces some tokens with samples from a generator, compelling the model to predict whether each token in the input is authentic or synthesized. This approach utilizes all input tokens for training rather than just a subset, which significantly reduces computational overhead and enhances efficiency.

The framework comprises a generator, which performs masked LLMing, and a discriminator, which identifies whether tokens have been replaced. Notably, the generator is trained via maximum likelihood, rather than in an adversarial manner, due to practical concerns with adversarial training on text data. Furthermore, to improve efficiency, ELECTRA employs a smaller generator model, sharing weights where feasible, thus retaining effectiveness even at reduced computational scales.

Empirical Evaluation

ELECTRA's efficacy is substantiated through extensive experiments on benchmarks like the GLUE and SQuAD datasets. Results consistently indicate that ELECTRA outperforms contemporary approaches such as BERT, GPT, RoBERTa, and XLNet across various configurations, particularly with smaller and more computationally efficient models. For instance, ELECTRA-Small, trained on modest hardware, can surpass larger models like GPT on GLUE tasks despite significant differences in parameters and computational investments.

When scaled to larger models, ELECTRA retains competitiveness, achieving comparable performance to state-of-the-art models while requiring less pre-training compute. ELECTRA-400K, when trained with less than a quarter of the compute used by RoBERTa or XLNet, maintains performance, and the more extensively trained ELECTRA-1.75M sets new benchmarks on SQuAD 2.0, highlighting its robustness at larger scales.

Theoretical and Practical Implications

Practically, ELECTRA provides a more compute-efficient alternative to existing pre-training methods, reducing the cost and time barriers often associated with developing state-of-the-art LLMs. The model's ability to derive meaningful representations from all input tokens could democratize access to NLP models, enabling broader research and application by institutions with limited computational resources.

Theoretically, the results contribute insights into the potential for discriminative pre-training tasks to outperform generative ones in certain contexts. By circumventing the necessity for $[MASK]$ tokens, ELECTRA mitigates discrepancies between pre-training and fine-tuning phases, addressing a prominent limitation in previous models such as BERT.

Future Prospects

Given ELECTRA's promising performance, several avenues for future development arise. Refinements could explore different configurations of generator-discriminator architecture, possibly integrating adversarial strategies more effectively without the pitfalls observed in current experimentation. Additionally, extending ELECTRA to multilingual domains could vastly broaden its applicability, capitalizing on its sample efficiency in diverse linguistic settings.

This paper's contributions are technical and notable, emphasizing a shift from generative to discriminative pre-training without inflating computational costs. As the landscape of NLP continues to evolve, methodologies like ELECTRA offer significant enhancements in both efficiency and accuracy, pushing the boundaries of what's achievable with LLMs.