Large Language Diffusion Models (2502.09992v2)

Published 14 Feb 2025 in cs.CL and cs.LG

Abstract: Autoregressive models (ARMs) are widely regarded as the cornerstone of LLMs. We challenge this notion by introducing LLaDA, a diffusion model trained from scratch under the pre-training and supervised fine-tuning (SFT) paradigm. LLaDA models distributions through a forward data masking process and a reverse process, parameterized by a vanilla Transformer to predict masked tokens. By optimizing a likelihood bound, it provides a principled generative approach for probabilistic inference. Across extensive benchmarks, LLaDA demonstrates strong scalability, outperforming our self-constructed ARM baselines. Remarkably, LLaDA 8B is competitive with strong LLMs like LLaMA3 8B in in-context learning and, after SFT, exhibits impressive instruction-following abilities in case studies such as multi-turn dialogue. Moreover, LLaDA addresses the reversal curse, surpassing GPT-4o in a reversal poem completion task. Our findings establish diffusion models as a viable and promising alternative to ARMs, challenging the assumption that key LLM capabilities discussed above are inherently tied to ARMs. Project page and codes: https://ml-gsai.github.io/LLaDA-demo/.

Summary

The paper presents LLaDA, a diffusion-based model that challenges autoregressive approaches by using a discrete token masking and reverse denoising process.
It utilizes a standard Transformer architecture trained with an ELBO objective to iteratively reconstruct masked sequences.
Empirical results show competitive in-context learning, instruction following, and superior reversal handling compared to traditional autoregressive models.

The paper "Large Language Diffusion Models" (2502.09992) introduces LLaDA (Large Language Diffusion Assistant), a diffusion-based model designed for large-scale language generation, presenting it as a potential alternative to the prevalent autoregressive models (ARMs) like GPT and LLaMA. The work challenges the assumption that core LLM capabilities such as in-context learning (ICL) and instruction following are intrinsically tied to the autoregressive paradigm. LLaDA is trained from scratch using standard pre-training and supervised fine-tuning (SFT) methodologies.

Methodology: LLaDA Diffusion Process

LLaDA employs a discrete diffusion process operating directly on token sequences. The core idea involves two stages: a forward noising process and a reverse denoising (generation) process.

Forward Process (Data Masking): The forward process, denoted as $q(x_t | x_{t-1})$ , progressively corrupts an initial clean sequence $x_0$ by introducing mask tokens ([MASK]) over $T$ discrete timesteps. This can be conceptualized as sampling from a transition kernel that replaces tokens with [MASK] according to a predefined schedule. Unlike continuous diffusion, which adds Gaussian noise, this process operates in the discrete token space. Let $x_0 = (w_1, w_2, ..., w_L)$ be the original sequence of length $L$ . At each step $t$ , tokens are randomly selected and replaced with [MASK] based on a transition probability, leading to increasingly masked sequences $x_1, x_2, ..., x_T$ . The final state $x_T$ typically approximates a sequence composed entirely or predominantly of mask tokens. The probability of transitioning from $x_{t-1}$ to $x_t$ often follows a simple rule, like independently masking each non-masked token with a certain probability at step $t$ . The overall distribution $q(x_t | x_0)$ can usually be computed in closed form, representing the probability of observing the masked sequence $x_t$ given the original $x_0$ after $t$ steps.

Reverse Process (Mask Prediction): The reverse process, parameterized by a neural network $p_\theta(x_{t-1} | x_t, t)$ , aims to reverse the masking process. Given a masked sequence $x_t$ and the timestep $t$ , the model predicts the less masked sequence $x_{t-1}$ . Generation starts from a fully masked sequence $x_T$ (or a sequence sampled from the prior distribution $p(x_T)$ ) and iteratively applies the learned reverse transitions $p_\theta(x_{t-1} | x_t, t)$ for $t = T, T-1, ..., 1$ to generate the final sequence $x_0$ .

Model Parameterization: LLaDA utilizes a standard Transformer architecture (referred to as "vanilla") to parameterize the reverse process $p_\theta(x_{t-1} | x_t, t)$ . The Transformer takes the masked sequence $x_t$ and the timestep $t$ (usually encoded as an embedding and added to the input) as input. Its objective is to predict the original tokens at the masked positions. Specifically, for each position $i$ where $x_t^{(i)} = \text{[MASK]}$ , the model outputs a probability distribution over the vocabulary for the original token $x_0^{(i)}$ .

Training Objective: The model is trained by optimizing a variational lower bound (ELBO) on the data log-likelihood $\log p_\theta(x_0)$ . This objective can typically be simplified into a form resembling a denoising score matching objective, weighted across timesteps. A common formulation involves minimizing the negative log-likelihood of predicting the original tokens given the masked sequence $x_t$ :

$L(\theta) = \mathbb{E}_{t \sim U(1, T), x_0 \sim \mathcal{D}, x_t \sim q(x_t|x_0)} [-\log p_\theta(x_0 | x_t, t)]$

Here, $\mathcal{D}$ is the training data distribution. The term $p_\theta(x_0 | x_t, t)$ often simplifies to predicting the masked tokens. In practice, the model might predict the probability distribution for each masked token independently or jointly, depending on the specific factorization chosen. The paper states LLaDA optimizes a "likelihood bound" for "principled generative approach for probabilistic inference."

Implementation Details

Implementing LLaDA involves several key components:

1. Tokenization and Embedding: Standard subword tokenization (e.g., BPE) is used. Token embeddings are fed into the Transformer. A special [MASK] token is added to the vocabulary.

2. Masking Schedule: A noise or masking schedule determines the probability of masking tokens at each timestep $t$ . Common schedules include linear, cosine, or square-root schedules, adapted for the discrete masking process. The choice of schedule and the total number of diffusion steps $T$ can significantly impact performance and sampling speed.

3. Transformer Architecture: A standard decoder-only or encoder-decoder Transformer can be adapted. Given the task is to predict masked tokens based on the surrounding context (masked sequence $x_t$ ), an encoder-like architecture (similar to BERT) or a non-causal decoder seems appropriate. The paper mentions a "vanilla Transformer," suggesting a standard architecture without major modifications specific to diffusion, potentially leveraging bidirectional attention over $x_t$ . Timestep $t$ is typically incorporated via sinusoidal embeddings added to the token embeddings.

4. Training:

Pre-training: LLaDA is pre-trained on large text corpora. During each training step:
1. Sample a clean sequence $x_0$ from the dataset.
2. Sample a random timestep $t \sim U(1, T)$ .
3. Generate a masked sequence $x_t$ by applying the forward process: $x_t \sim q(x_t | x_0)$ .
4. Feed $x_t$ and $t$ into the Transformer model $p_\theta$ .
5. Compute the loss, typically cross-entropy between the model's predicted distributions for masked tokens and the true original tokens in $x_0$ .
6. Update model parameters $\theta$ using gradient descent.
Supervised Fine-Tuning (SFT): After pre-training, LLaDA is fine-tuned on instruction-following datasets (e.g., question-answering pairs, dialogue data). The format likely involves concatenating instruction and response, applying the diffusion training objective to this combined sequence. This stage adapts the model to follow instructions and generate desired outputs.

5. Inference (Sampling): Generating text involves iteratively applying the learned reverse transition $p_\theta(x_{t-1} | x_t, t)$ .

function generate(model, length, num_steps):
  # Start with a fully masked sequence
  x_t = ["[MASK]"] * length
  t = num_steps

  while t > 0:
    # Predict probabilities for masked tokens based on current x_t and t
    token_probabilities = model.predict(x_t, t) # Shape: [length, vocab_size]

    # Sample the previous state x_{t-1}
    # Option 1: Sample original tokens for *all* positions based on p(x_0 | x_t)
    # and then sample x_{t-1} using q(x_{t-1} | x_t, x_0).
    # Option 2 (Simpler heuristic): Directly sample tokens for currently masked positions
    # and potentially unmask according to the schedule.

    # Simplified Example: Direct sampling for masked tokens
    x_{t-1} = list(x_t) # Copy current state
    predicted_x0 = sample_tokens(token_probabilities) # Sample potential original tokens

    # Determine which tokens should be unmasked at step t-1 based on schedule q
    mask_indices_at_t = {i for i, token in enumerate(x_t) if token == "[MASK]"}
    mask_indices_at_t_minus_1 = determine_mask_indices(t-1, length, schedule=q) # Theoretical indices based on schedule

    # Unmask tokens that were masked at t but shouldn't be at t-1
    indices_to_unmask = mask_indices_at_t - mask_indices_at_t_minus_1
    for i in indices_to_unmask:
        x_{t-1}[i] = predicted_x0[i] # Fill with sampled prediction

    x_t = tuple(x_{t-1}) # Update state
    t = t - 1

  return x_t # Final generated sequence

The exact sampling procedure can vary. Some methods involve predicting the final clean sequence

x_0

at each step and then using the forward process

q(x_{t-1} | x_t, x_0)

to sample

x_{t-1}

, which is theoretically more grounded but potentially more complex. Others might use simpler heuristics or iterative refinement. The number of diffusion steps

T

during inference is a crucial hyperparameter affecting quality and speed.

Empirical Evaluation and Results

LLaDA's performance was evaluated against self-constructed ARM baselines and existing strong LLMs.

Scalability: The paper reports strong scalability, with LLaDA models outperforming their ARM counterparts (trained by the authors for direct comparison) across various model sizes. This suggests that the diffusion framework is amenable to large-scale training.
In-Context Learning: LLaDA 8B demonstrated competitive performance on ICL benchmarks compared to established ARM models like LLaMA3 8B. This finding is significant as ICL is often considered a haLLMark capability strongly associated with autoregressive generation.
Instruction Following: After SFT, LLaDA showed proficiency in instruction following, illustrated through case studies involving multi-turn dialogues. This indicates that the diffusion framework can be successfully adapted via SFT to align with user intentions, similar to ARMs.
Reversal Curse: A notable claim is LLaDA's ability to address the "reversal curse" – the difficulty of standard ARMs in reversing sequences (e.g., given "A is B", query "B is A?"). The paper specifically highlights that LLaDA surpasses GPT-4o on a reversal poem completion task. This suggests the bidirectional nature inherent in attending to the full masked sequence $x_t$ during the reverse process might be advantageous for tasks requiring non-sequential reasoning or manipulation.

Discussion and Implications

The introduction of LLaDA presents several implications for the LLM field:

Viability of Diffusion Models: LLaDA provides empirical evidence that diffusion models are a viable architectural choice for large-scale LLMing, capable of achieving performance comparable to strong ARMs on key benchmarks like ICL and instruction following. This potentially broadens the architectural search space for future foundational models.
Challenging ARM Dominance: The results question the necessity of the autoregressive formulation for achieving advanced LLM capabilities. If capabilities like ICL are not exclusive to ARMs, it opens avenues for exploring alternative generative frameworks that might offer different trade-offs.
Potential Advantages: Diffusion models might possess inherent advantages for certain tasks. The reported success on the reversal curse suggests better handling of bidirectional dependencies or non-sequential relationships compared to left-to-right ARMs. Furthermore, diffusion models offer possibilities for controllable generation by manipulating the sampling process or conditioning information. Non-autoregressive generation, characteristic of diffusion sampling (predicting multiple tokens somewhat simultaneously within a step), could potentially lead to faster inference compared to sequential token-by-token generation in ARMs, although iterative refinement over many steps is still required.
Limitations and Trade-offs: Diffusion models typically require multiple iterative steps for generation, which can be computationally expensive compared to single-pass generation in some non-ARM architectures (though often faster than ARMs per generated token, the total time depends on the number of steps $T$ ). The complexity of the sampling process and the choice of masking schedule are critical design decisions. Evaluating computational efficiency (training and inference FLOPs, latency) compared to optimized ARMs remains important.

Conclusion

LLaDA (2502.09992) positions diffusion models as a competitive alternative framework for building LLMs. By demonstrating strong performance in scalability, in-context learning, instruction following, and specific tasks like sequence reversal, the work challenges the prevailing dominance of autoregressive models. While further research is needed to fully understand the trade-offs and optimize diffusion-based LLMs, LLaDA signifies a potentially significant development in exploring diverse architectures for advanced language generation.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/probnstat/status/1925250851351728267

https://twitter.com/boknilev/status/1932838651110555881

https://twitter.com/AdinaYakup/status/1895458043539890471

https://twitter.com/s_scardapane/status/1909917426797392339

https://twitter.com/bycloudai/status/1934055600385622112

https://twitter.com/YouJiacheng/status/1891342231438799164