XLNet: Generalized Autoregressive Pretraining

Updated 24 December 2025

The paper introduces a permutation-based pretraining objective that enables bidirectional context modeling without token masking.
The model leverages two-stream attention and Transformer-XL recurrence to capture long-range dependencies and stabilize training.
Empirical results demonstrate that XLNet outperforms BERT and RoBERTa across benchmarks such as SQuAD, RACE, and GLUE.

Generalized Autoregressive Pretraining for Language Understanding (XLNet) is a pretraining technique that unifies the strengths of autoregressive LLMs and bidirectional masked LLMs within a single generalized framework. XLNet introduces a permutation-based training objective, two-stream attention mechanisms, and segment-level recurrence, resulting in empirically superior performance across a broad range of language understanding benchmarks relative to prior approaches such as BERT. The technique is formally related to probabilistically masked language modeling with uniform masking ratio priors, yielding equivalence with permutation-based autoregressive objectives and supporting flexible text generation and representation learning (Yang et al., 2019, Liao et al., 2020).

1. Permutation Language Modeling Objective

XLNet’s core pretraining objective replaces conventional left-to-right (or masked) language modeling with a generalized permutation-based factorization. Let $x = (x_1, ..., x_T)$ represent a sequence of $T$ tokens, and let $Z_T$ denote all permutations of $[1,\dots,T]$ . For any permutation $z\in Z_T$ , the autoregressive factorization under order $z$ is:

$p(x | \theta; z) = \prod_{t=1}^T p_\theta(x_{z_t} \mid x_{z_{<t}})$

The XLNet loss maximizes the expected log-likelihood over all $z$ :

$\mathcal{L}_{\text{XLNet}}(\theta) = \mathbb{E}_{z \sim \text{Uniform}(Z_T)} \left[ \sum_{t=1}^{T} \log p_\theta( x_{z_t} \mid x_{z_{<t}} ) \right] \quad \text{(Eq. 3, [1906.08237])}$

By sharing $\theta$ across all permutations, this objective exposes every token $x_t$ to both left- and right-context in expectation, while each prediction remains fully autoregressive with no independence assumptions among targets.

A formal connection is established to probabilistically masked LLMs (PMLM) under a uniform prior on the masking ratio (“u-PMLM”), as proven in (Liao et al., 2020): training with all possible binary masks drawn from a uniform masking ratio is equivalent, up to constants, to maximizing the log-likelihood over all token orderings—a direct correspondence to XLNet’s permutation objective.

2. Model Architecture: Two-Stream Attention and Long-Range Recurrence

To enable permutation-based factorization, XLNet modifies the standard Transformer architecture with two-stream self-attention and integrates mechanisms from Transformer-XL for long-range context:

Two-stream attention: At each layer $m$ $m$ , two parallel streams are maintained per position $i$ $i$ :
- Content stream $h_i^{(m)}$ encodes the actual token $x_i$ together with all allowed context, analogous to conventional self-attention.
- Query stream $g_i^{(m)}$ encodes only the context and position $i$ , excluding token content $x_i$ , which is critical for predicting arbitrary targets under permutation.

Layer-wise updates: - $g_{z_t}^{(m)} \leftarrow$ Attention( $Q=g_{z_t}^{(m-1)}$ , $KV$ = [memory $^{(m-1)}$ ; $h_{z_{<t}}^{(m-1)}$ ] ) - $h_{z_t}^{(m)} \leftarrow$ Attention( $Q=h_{z_t}^{(m-1)}$ , $KV$ = [memory $^{(m-1)}$ ; $h_{z_{\le t}}^{(m-1)}$ ] )

Prediction: The conditional for position $z_t$ is computed via the query stream:

$p_\theta(X_{z_t} = x \mid x_{z_{<t}}) = \mathrm{softmax}_{x'} [ e(x)^T g_{z_t}^{(M)} ]$

Transformer-XL recurrence: Relative positional encoding replaces absolute position vectors, and segment-level recurrence caches previous segment activations, extending effective context and permitting sequences beyond 512 tokens.
Partial prediction and multi-segment modeling: To stabilize and accelerate pretraining, only the final $K$ fraction of tokens in each random order is predicted. Multi-segment inputs are supported with relative segment encodings and memory is reused only from the original document.

3. Theoretical Comparison with Masked and Autoregressive Models

XLNet’s permutation-based objective generalizes and strictly subsumes conventional masked language modeling (MLM) and unidirectional autoregressive (AR) objectives. BERT’s MLM corrupts the input by replacing tokens with a [MASK], predicts missing tokens independently, and assumes conditional independence among masked positions:

$\log p_\theta( \text{masked tokens} \mid \text{corrupted input} ) \approx \sum_{t: m_t=1} \log p_\theta(x_t \mid x_{\text{masked}=\cdot}) \quad \text{(Eq. 2, [1906.08237])}$

This results in no modeling of dependencies among masked tokens and a discrepancy between pretraining and downstream finetuning (since [MASK] never appears at finetuning time).

In contrast, XLNet’s full autoregressive factorization over permutation orders:

Requires no input corruption
Makes no conditional independence assumptions
Achieves bidirectionality in expectation over the distribution of permutation orders
Enables training on more (token, context) pairs than MLM

Formally, given any subset of “target” tokens $T$ and context $N$ , BERT trains only $p(x|N)$ , while XLNet covers $p(x | N \cup \text{earlier-targets})$ for all permutations—leading to strictly more context/modeling coverage.

4. Training Configuration and Implementation

XLNet is pretrained on large-scale corpora including BooksCorpus, English Wikipedia, Giga5, ClueWeb12B, and filtered CommonCrawl, totaling 32B tokens. Tokenization uses SentencePiece with a 32K vocabulary. Pretraining utilizes 512 TPUv3 cores, a batch size of 8,192, input length 512, and optimized with Adam ( $\beta_1=0.9$ , $\beta_2=0.999$ , $\epsilon=10^{-6}$ , weight decay $0.01$). Learning rate is 4e-4 (with 40K step warmup and linear decay), and typical training is run for 500K steps ( $\sim$ 5.5 days).

XLNet-Base and XLNet-Large mirror the BERT-Base (12-layer, hidden=768, 12-head) and BERT-Large (24-layer, hidden=1024, 16-head) model configurations, respectively. Partial prediction ratios are set to $K=6$ (Large) and $K=7$ (Base). Fine-tuning procedures align with standard BERT protocols, utilizing task-specific output heads.

5. Empirical Performance and Comparative Results

XLNet demonstrates empirical gains over BERT and RoBERTa across diverse NLU benchmarks, including SQuAD, RACE, GLUE, and document ranking. For fair comparison, identical data sources and fine-tuning recipes are used.

Task	BERT-Large	XLNet-Large	RoBERTa	XLNet (RoBERTa setting)
SQuAD 1.1 F1	90.9	95.1	93.3	95.1
SQuAD 2.0 F1	81.8	90.6	89.8	90.7
RACE Accuracy	75.1	77.4	83.2	85.4
GLUE Overall	~82.1	~86.3	—	Matches/exceeds RoBERTa
ClueWeb09-B (NDCG@20)	—	31.10	30.53	—
IMDB Error	4.51	3.20	—	—

XLNet matches or exceeds RoBERTa performance on all 9 single-task GLUE dev categories, e.g., MNLI (+0.6), QQP (+0.1), MRPC (+1.8), CoLA (+1.0) (Yang et al., 2019). The inclusion of permutation objective and segment recurrence (Transformer-XL) contributes critically, as demonstrated by ablation: disabling memory drops RACE by $\sim$ 1.1, while span-based prediction and bidirectional data pipelining individually cost $\sim$ 0.2–0.5 points.

6. Relation to Probabilistically Masked LLMs

(Liao et al., 2020) establishes that XLNet’s objective is theoretically equivalent to training a u-PMLM—a masked LLM with the masking ratio drawn from a uniform prior $p(r)\sim \text{Uniform}[0,1]$ . Under this scheme, the expected log-likelihood objective over all binary masks $M$ ,

$L_{\text{u-pmlm}}(X;\theta) = \sum_{M\subset[N]} \frac{1}{K} \sum_{n\in M} \log p(x_n \mid X_{\setminus M}; \theta) \cdot \frac{K! (N-K)!}{(N+1)!}$

is shown to be proportional to the permutation language modeling objective in XLNet, thereby unifying the two perspectives.

u-PMLM, like XLNet, outperforms BERT on language modeling (WikiText-103, One-Billion-Word) and downstream NLU tasks (GLUE, SQuAD), and supports generation in arbitrary token order. Relative positional embeddings further boost NLU, while NLG (measured via perplexity and sequence generation quality) is insensitive to the coincidence of encoding scheme. A plausible implication is the core gains of XLNet can be attributed to the full coverage of permutation-dependent context sets by the equivalence to u-PMLM, rather than architectural factors alone.

7. Analysis, Ablations, and Theoretical Insights

XLNet’s ablation studies reveal:

The permutation-based objective is essential; replacing it with BERT’s denoising autoencoding with a Transformer-XL backbone yields only partial improvements over BERT.
Excluding segment recurrence or memory caching impairs long-range task performance (e.g., RACE by $\sim$ 1.1 points).
Next-sentence prediction heads confer little benefit.
Novel relative-position attention patterns emerge under permutation objective (e.g., self-exclusion stripes, stride patterns), which are unattainable with standard Transformers and may aid long-range reasoning tasks.
Qualitative and quantitative analyses confirm that XLNet’s expected conditioning coverage is a strict superset of both left-to-right AR and MLM models; tokens are conditioned on arbitrary subsets of “target” tokens occurring before them in permutations, yielding strictly more modeled dependencies (Yang et al., 2019).

Empirical and theoretical evidence converges to support XLNet’s pretraining paradigm as a flexible and superior substitute for fixed-order AR and masked denoising pretraining. Its formal connections to probabilistically masked models generalize these previous approaches and extend the set of learnable token-context pairs for both NLU and NLG applications (Yang et al., 2019, Liao et al., 2020).

Markdown Report Issue Upgrade to Chat

References (2)

XLNet: Generalized Autoregressive Pretraining for Language Understanding (2019)

Probabilistically Masked Language Model Capable of Autoregressive Generation in Arbitrary Word Order (2020)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Generalized Autoregressive Pretraining for Language Understanding (XLNet).

XLNet: Generalized Autoregressive Pretraining

1. Permutation Language Modeling Objective

2. Model Architecture: Two-Stream Attention and Long-Range Recurrence

3. Theoretical Comparison with Masked and Autoregressive Models

4. Training Configuration and Implementation

5. Empirical Performance and Comparative Results

6. Relation to Probabilistically Masked LLMs

7. Analysis, Ablations, and Theoretical Insights

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

XLNet: Generalized Autoregressive Pretraining

1. Permutation Language Modeling Objective

2. Model Architecture: Two-Stream Attention and Long-Range Recurrence

3. Theoretical Comparison with Masked and Autoregressive Models

4. Training Configuration and Implementation

5. Empirical Performance and Comparative Results

6. Relation to Probabilistically Masked LLMs

7. Analysis, Ablations, and Theoretical Insights

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research