Replaced-Token Detection (RTD)

Updated 28 December 2025

RTD is a pre-training objective that detects replaced tokens using a generator-discriminator framework, offering dense per-token binary signals.
It underpins models like ELECTRA, CodeBERT, and BudgetLongformer, delivering faster convergence and improved downstream performance.
RTD extends its application to multimodal and geometric domains, supporting tasks such as prompt-based learning and robust commonsense reasoning.

Replaced-Token Detection (RTD) is a pre-training objective for transformer-based models in which a discriminative model learns to identify whether each token in a sequence is original or has been replaced by a generator. This design provides a dense, per-position learning signal, in contrast to masked language modeling (MLM) where the objective is restricted to reconstructing a small fraction of masked tokens. RTD forms the foundation of several high-performance architectures, including ELECTRA and its derivatives, and has been extended beyond natural language text to multimodal and non-linguistic domains.

1. Fundamental Principles of RTD

RTD operates by coupling two modules:

A generator $G$ , typically a small masked LLM, receives a sequence with a subset of positions masked and predicts plausible candidates for each mask.
A discriminator $D$ ingests a “corrupted” input $\tilde x$ , in which some tokens have been replaced by samples from $G$ , and predicts for every position the probability that the token is original ( $r_t=0$ ) or replaced ( $r_t=1$ ).

The RTD loss is the sum of binary cross-entropy losses over all tokens: $L_\mathrm{RTD} = - \sum_{t=1}^n [ r_t \log P(r_t=1 | \tilde x) + (1-r_t) \log P(r_t=0 | \tilde x) ]$ where $r_t$ is the true label at position $t$ , indicating whether $x_t$ has been replaced. This structure engages the full sequence length during training, ensuring greater sample efficiency and faster convergence relative to classical MLM (He et al., 2021).

2. Architectural Instantiations and Training Protocols

RTD has been embedded within a variety of architectures across NLP, code, and multimodal settings. The most widely cited realization is ELECTRA, where $G$ and $D$ are both transformers, but $G$ is smaller (fewer layers, parameters).

ELECTRA: Samples $\sim$ 15% of input positions for masking; $G$ replaces each with a candidate; $D$ performs dense per-position binary classification (He et al., 2021).
DeBERTaV3: Employs gradient-disentangled embedding sharing, which decouples generator and discriminator gradients at the embedding level, resolving “tug-of-war” conflicts and yielding improved convergence and downstream accuracy (He et al., 2021).
CodeBERT: For code and natural language, RTD is adapted using n-gram LLM-based generators for both modalities, enabling efficient learning from large unimodal corpora via discriminator-based RTD loss (Feng et al., 2020).
BudgetLongformer: Incorporates RTD given Longformer's efficient sparse attention, scaling RTD to long-input domains such as legal text (sequence lengths up to 4096) and achieving fast convergence and strong downstream summarization and classification performance with drastically reduced compute (Niklaus et al., 2022).
SpacTor-T5: Combines RTD with span corruption in a two-stage curriculum, removing RTD after an initial phase. This hybrid objective on T5-like encoder-decoders dramatically reduces pretraining cost while maintaining or improving task performance (Ye et al., 24 Jan 2024).
Token Drop in NMT: Integrates RTD as an auxiliary objective where tokens are randomly replaced with a special marker (e.g., <unk>), and the encoder is regularized to predict dropped positions (Zhang et al., 2020).
Point-RTD: Extends the RTD paradigm to 3D point clouds, where tokens are geometric embeddings and corruption samples are drawn from other shapes or classes. The RTD discriminator operates on contextualized embeddings, and subsequent “cleaning” is performed via a generator, with notable gains in reconstruction and classification (Stone et al., 21 Sep 2025).

3. Supervised and Prompt-Based Reformulations

RTD is not merely a pre-training strategy. It can be adapted for prompt-based, few-shot, and zero-shot learning:

Prompt Reformulation for Classification: By constructing inputs with all candidate label description words (“great”, “terrible”, etc.) filling pre-defined template slots, RTD models rank classes by the probability that each candidate word is “original” in context. This approach, used in both zero-shot and few-shot modes, delivers superior performance to MLM-based prompt methods (Li et al., 2022, Ni et al., 2022).
Regression Reformulation: For regression, RTD reduces the task to a binary choice between poles (e.g., “low” vs “high”), and interpolates the prediction from the normalized non-replacement probabilities (Li et al., 2022).
Zero-Shot Commonsense Reasoning: Using the Non-Replacement Confidence (NRC) metric, which averages the negative log-probabilities that each token is original, RTD-based models surpass MLM-based methods in commonsense reasoning, tuple and sentence selection, as well as multiple QA tasks. NRC avoids the sum-to-one constraint of softmax, allowing independent binary “in-context” judgments that better capture contextual fit, especially for rare or semantically distinctive tokens (Peng et al., 2022).

4. Corruption Strategies and Objective Variants

The generator’s mechanism for corruption and the discriminator’s input design vary across domains:

Generator Sampling: In standard RTD, generator samples are drawn from a small MLM (as in ELECTRA). CodeBERT uses n-gram LMs per modality for efficient corruption (Feng et al., 2020).
Special Token Replacement: In Token Drop, certain tokens are replaced by a fixed special token, enabling the RTD head to train on uniform corruption (as with <unk>), rather than model-generated samples (Zhang et al., 2020).
Cross-Sample Mixup: Point-RTD utilizes batchwise mixup—replacing geometric token embeddings with others from different shapes/classes—to suit unordered and non-linguistic domains (Stone et al., 21 Sep 2025).
Hybrid Objectives: SpacTor combines (i) span-masking, (ii) per-token corruption by generator sampling, and (iii) RTD loss, then curriculum switches to span corruption only after a set number of steps. Lambda weights are calibrated to balance loss contributions (Ye et al., 24 Jan 2024).

5. Sample Efficiency and Empirical Benefits

RTD yields denser supervision than MLM, as every token position is an independent binary supervision signal. This increase in gradient flow consistently results in:

Faster convergence: Pre-training steps reduced (e.g., SpacTor achieves baseline downstream results with 40% fewer FLOPs and 50% fewer steps (Ye et al., 24 Jan 2024); BudgetLongformer reaches stable pretrain loss in 100k steps (Niklaus et al., 2022)).
Downstream Gains: RTD-based pretraining improves or matches prior SOTA on GLUE, SQuAD, and a multitude of NLU tasks (He et al., 2021).
Prompt Learning Outperformance: In prompt-based few-shot and zero-shot tasks, RTD methods outperform MLM-based alternatives by significant margins, particularly on binary classification and commonsense-reasoning (Li et al., 2022, Ni et al., 2022, Peng et al., 2022).

Empirical results comparing RTD and MLM (all from (He et al., 2021)):

Model/Objective	SQuAD v2.0 F1	GLUE Avg (Large)
MLM (base)	82.5	—
RTD + ES (base)	86.3	—
RTD + GDES (base)	87.2	—
MLM (large)	—	88.8
ELECTRA (large)	—	89.46
DeBERTaV3 (large)	—	91.37

RTD consistently yields either higher or comparable accuracy with faster convergence and better sample efficiency than MLM alone.

6. Domain Extensions and Modality Generalization

RTD has seen successful adaptation beyond classical text-based transformers:

Programming Languages: CodeBERT integrates RTD for cross-modal (natural and programming language) pre-training, enabling robust code search and documentation generation; the RTD objective enhances robustness to noisy, unimodal, and code-only data sources (Feng et al., 2020).
Point Clouds and Geometry: Point-RTD demonstrates that the RTD paradigm can handle unordered, high-dimensional geometric data by tokenizing local spatial regions, employing corruption using cross-sample semantics, and yielding state-of-the-art reconstruction and classification (Stone et al., 21 Sep 2025).
Efficient Models for Long Inputs: BudgetLongformer scales RTD to very long sequences using windowed attention, achieving domain-specific LLM pre-training with dramatically reduced computational resources (Niklaus et al., 2022).

7. RTD in Practice: Implementation and Hyperparameters

Certain practical decisions are common across RTD literature:

Discriminator Head: A light, per-token binary classifier—typically a single-layer MLP or linear projection—is appended to the transformer encoder at every position.
Generator Design: Always shallower and/or thinner than the discriminator to both speed corruption and reduce parameter redundancy.
Loss Weighting: Joint loss optimization usually weights the RTD component (e.g., λ=50 in ELECTRA and BudgetLongformer, λ₁=10, λ₂=10 in SpacTor-T5) to balance gradients and ensure the smaller signals from the discriminator are not swamped by the generator MLM loss.
Hyperparameter Sensitivity: Performance is highly sensitive to template and label word design for prompt-based learning, mask/corruption ratios, corruption method (random, class-based, or context-aware), and in multimodal regimes, to efficient negative example generation (Li et al., 2022, He et al., 2021, Ye et al., 24 Jan 2024).
Training Efficiency: Due to the per-token signal, RTD-based models often require significantly fewer pre-training steps to reach downstream parity or superiority compared to traditional MLM.

References

"Pre-trained Token-replaced Detection Model as Few-shot Learner" (Li et al., 2022)
"DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing" (He et al., 2021)
"ELECTRA is a Zero-Shot Learner, Too" (Ni et al., 2022)
"Evaluate Confidence Instead of Perplexity for Zero-shot Commonsense Reasoning" (Peng et al., 2022)
"CodeBERT: A Pre-Trained Model for Programming and Natural Languages" (Feng et al., 2020)
"BudgetLongformer: Can we Cheaply Pretrain a SotA Legal LLM From Scratch?" (Niklaus et al., 2022)
"SpacTor-T5: Pre-training T5 Models with Span Corruption and Replaced Token Detection" (Ye et al., 24 Jan 2024)
"Token Drop mechanism for Neural Machine Translation" (Zhang et al., 2020)
"Point-RTD: Replaced Token Denoising for Pretraining Transformer Models on Point Clouds" (Stone et al., 21 Sep 2025)