Papers
Topics
Authors
Recent
2000 character limit reached

XLNet: Generalized Autoregressive Pretraining

Updated 24 December 2025
  • The paper introduces a permutation-based pretraining objective that enables bidirectional context modeling without token masking.
  • The model leverages two-stream attention and Transformer-XL recurrence to capture long-range dependencies and stabilize training.
  • Empirical results demonstrate that XLNet outperforms BERT and RoBERTa across benchmarks such as SQuAD, RACE, and GLUE.

Generalized Autoregressive Pretraining for Language Understanding (XLNet) is a pretraining technique that unifies the strengths of autoregressive LLMs and bidirectional masked LLMs within a single generalized framework. XLNet introduces a permutation-based training objective, two-stream attention mechanisms, and segment-level recurrence, resulting in empirically superior performance across a broad range of language understanding benchmarks relative to prior approaches such as BERT. The technique is formally related to probabilistically masked language modeling with uniform masking ratio priors, yielding equivalence with permutation-based autoregressive objectives and supporting flexible text generation and representation learning (Yang et al., 2019, Liao et al., 2020).

1. Permutation Language Modeling Objective

XLNet’s core pretraining objective replaces conventional left-to-right (or masked) language modeling with a generalized permutation-based factorization. Let x=(x1,...,xT)x = (x_1, ..., x_T) represent a sequence of TT tokens, and let ZTZ_T denote all permutations of [1,,T][1,\dots,T]. For any permutation zZTz\in Z_T, the autoregressive factorization under order zz is:

p(xθ;z)=t=1Tpθ(xztxz<t)p(x | \theta; z) = \prod_{t=1}^T p_\theta(x_{z_t} \mid x_{z_{<t}})

The XLNet loss maximizes the expected log-likelihood over all zz:

LXLNet(θ)=EzUniform(ZT)[t=1Tlogpθ(xztxz<t)](Eq. 3, [1906.08237])\mathcal{L}_{\text{XLNet}}(\theta) = \mathbb{E}_{z \sim \text{Uniform}(Z_T)} \left[ \sum_{t=1}^{T} \log p_\theta( x_{z_t} \mid x_{z_{<t}} ) \right] \quad \text{(Eq. 3, [1906.08237])}

By sharing θ\theta across all permutations, this objective exposes every token xtx_t to both left- and right-context in expectation, while each prediction remains fully autoregressive with no independence assumptions among targets.

A formal connection is established to probabilistically masked LLMs (PMLM) under a uniform prior on the masking ratio (“u-PMLM”), as proven in (Liao et al., 2020): training with all possible binary masks drawn from a uniform masking ratio is equivalent, up to constants, to maximizing the log-likelihood over all token orderings—a direct correspondence to XLNet’s permutation objective.

2. Model Architecture: Two-Stream Attention and Long-Range Recurrence

To enable permutation-based factorization, XLNet modifies the standard Transformer architecture with two-stream self-attention and integrates mechanisms from Transformer-XL for long-range context:

  • Two-stream attention: At each layer mm, two parallel streams are maintained per position ii:
    • Content stream hi(m)h_i^{(m)} encodes the actual token xix_i together with all allowed context, analogous to conventional self-attention.
    • Query stream gi(m)g_i^{(m)} encodes only the context and position ii, excluding token content xix_i, which is critical for predicting arbitrary targets under permutation.

Layer-wise updates: - gzt(m)g_{z_t}^{(m)} \leftarrow Attention( Q=gzt(m1)Q=g_{z_t}^{(m-1)}, KVKV = [memory(m1)^{(m-1)}; hz<t(m1)h_{z_{<t}}^{(m-1)}] ) - hzt(m)h_{z_t}^{(m)} \leftarrow Attention( Q=hzt(m1)Q=h_{z_t}^{(m-1)}, KVKV = [memory(m1)^{(m-1)}; hzt(m1)h_{z_{\le t}}^{(m-1)}] )

  • Prediction: The conditional for position ztz_t is computed via the query stream:

pθ(Xzt=xxz<t)=softmaxx[e(x)Tgzt(M)]p_\theta(X_{z_t} = x \mid x_{z_{<t}}) = \mathrm{softmax}_{x'} [ e(x)^T g_{z_t}^{(M)} ]

  • Transformer-XL recurrence: Relative positional encoding replaces absolute position vectors, and segment-level recurrence caches previous segment activations, extending effective context and permitting sequences beyond 512 tokens.
  • Partial prediction and multi-segment modeling: To stabilize and accelerate pretraining, only the final KK fraction of tokens in each random order is predicted. Multi-segment inputs are supported with relative segment encodings and memory is reused only from the original document.

3. Theoretical Comparison with Masked and Autoregressive Models

XLNet’s permutation-based objective generalizes and strictly subsumes conventional masked language modeling (MLM) and unidirectional autoregressive (AR) objectives. BERT’s MLM corrupts the input by replacing tokens with a [MASK], predicts missing tokens independently, and assumes conditional independence among masked positions:

logpθ(masked tokenscorrupted input)t:mt=1logpθ(xtxmasked=)(Eq. 2, [1906.08237])\log p_\theta( \text{masked tokens} \mid \text{corrupted input} ) \approx \sum_{t: m_t=1} \log p_\theta(x_t \mid x_{\text{masked}=\cdot}) \quad \text{(Eq. 2, [1906.08237])}

This results in no modeling of dependencies among masked tokens and a discrepancy between pretraining and downstream finetuning (since [MASK] never appears at finetuning time).

In contrast, XLNet’s full autoregressive factorization over permutation orders:

  • Requires no input corruption
  • Makes no conditional independence assumptions
  • Achieves bidirectionality in expectation over the distribution of permutation orders
  • Enables training on more (token, context) pairs than MLM

Formally, given any subset of “target” tokens TT and context NN, BERT trains only p(xN)p(x|N), while XLNet covers p(xNearlier-targets)p(x | N \cup \text{earlier-targets}) for all permutations—leading to strictly more context/modeling coverage.

4. Training Configuration and Implementation

XLNet is pretrained on large-scale corpora including BooksCorpus, English Wikipedia, Giga5, ClueWeb12B, and filtered CommonCrawl, totaling 32B tokens. Tokenization uses SentencePiece with a 32K vocabulary. Pretraining utilizes 512 TPUv3 cores, a batch size of 8,192, input length 512, and optimized with Adam (β1=0.9\beta_1=0.9, β2=0.999\beta_2=0.999, ϵ=106\epsilon=10^{-6}, weight decay $0.01$). Learning rate is 4e-4 (with 40K step warmup and linear decay), and typical training is run for 500K steps (\sim5.5 days).

XLNet-Base and XLNet-Large mirror the BERT-Base (12-layer, hidden=768, 12-head) and BERT-Large (24-layer, hidden=1024, 16-head) model configurations, respectively. Partial prediction ratios are set to K=6K=6 (Large) and K=7K=7 (Base). Fine-tuning procedures align with standard BERT protocols, utilizing task-specific output heads.

5. Empirical Performance and Comparative Results

XLNet demonstrates empirical gains over BERT and RoBERTa across diverse NLU benchmarks, including SQuAD, RACE, GLUE, and document ranking. For fair comparison, identical data sources and fine-tuning recipes are used.

Task BERT-Large XLNet-Large RoBERTa XLNet (RoBERTa setting)
SQuAD 1.1 F1 90.9 95.1 93.3 95.1
SQuAD 2.0 F1 81.8 90.6 89.8 90.7
RACE Accuracy 75.1 77.4 83.2 85.4
GLUE Overall ~82.1 ~86.3 Matches/exceeds RoBERTa
ClueWeb09-B (NDCG@20) 31.10 30.53
IMDB Error 4.51 3.20

XLNet matches or exceeds RoBERTa performance on all 9 single-task GLUE dev categories, e.g., MNLI (+0.6), QQP (+0.1), MRPC (+1.8), CoLA (+1.0) (Yang et al., 2019). The inclusion of permutation objective and segment recurrence (Transformer-XL) contributes critically, as demonstrated by ablation: disabling memory drops RACE by \sim1.1, while span-based prediction and bidirectional data pipelining individually cost \sim0.2–0.5 points.

6. Relation to Probabilistically Masked LLMs

(Liao et al., 2020) establishes that XLNet’s objective is theoretically equivalent to training a u-PMLM—a masked LLM with the masking ratio drawn from a uniform prior p(r)Uniform[0,1]p(r)\sim \text{Uniform}[0,1]. Under this scheme, the expected log-likelihood objective over all binary masks MM,

Lu-pmlm(X;θ)=M[N]1KnMlogp(xnXM;θ)K!(NK)!(N+1)!L_{\text{u-pmlm}}(X;\theta) = \sum_{M\subset[N]} \frac{1}{K} \sum_{n\in M} \log p(x_n \mid X_{\setminus M}; \theta) \cdot \frac{K! (N-K)!}{(N+1)!}

is shown to be proportional to the permutation language modeling objective in XLNet, thereby unifying the two perspectives.

u-PMLM, like XLNet, outperforms BERT on language modeling (WikiText-103, One-Billion-Word) and downstream NLU tasks (GLUE, SQuAD), and supports generation in arbitrary token order. Relative positional embeddings further boost NLU, while NLG (measured via perplexity and sequence generation quality) is insensitive to the coincidence of encoding scheme. A plausible implication is the core gains of XLNet can be attributed to the full coverage of permutation-dependent context sets by the equivalence to u-PMLM, rather than architectural factors alone.

7. Analysis, Ablations, and Theoretical Insights

XLNet’s ablation studies reveal:

  • The permutation-based objective is essential; replacing it with BERT’s denoising autoencoding with a Transformer-XL backbone yields only partial improvements over BERT.
  • Excluding segment recurrence or memory caching impairs long-range task performance (e.g., RACE by \sim1.1 points).
  • Next-sentence prediction heads confer little benefit.
  • Novel relative-position attention patterns emerge under permutation objective (e.g., self-exclusion stripes, stride patterns), which are unattainable with standard Transformers and may aid long-range reasoning tasks.
  • Qualitative and quantitative analyses confirm that XLNet’s expected conditioning coverage is a strict superset of both left-to-right AR and MLM models; tokens are conditioned on arbitrary subsets of “target” tokens occurring before them in permutations, yielding strictly more modeled dependencies (Yang et al., 2019).

Empirical and theoretical evidence converges to support XLNet’s pretraining paradigm as a flexible and superior substitute for fixed-order AR and masked denoising pretraining. Its formal connections to probabilistically masked models generalize these previous approaches and extend the set of learnable token-context pairs for both NLU and NLG applications (Yang et al., 2019, Liao et al., 2020).

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Generalized Autoregressive Pretraining for Language Understanding (XLNet).