Papers
Topics
Authors
Recent
Search
2000 character limit reached

Embedded Language Flows: Diffusion Models for NLP

Updated 13 May 2026
  • Embedded Language Flows (ELF) is a continuous-time diffusion model for language that operates in the embedding space using flow-matching ODEs and classifier-free guidance.
  • It integrates techniques from image-domain diffusion, achieving efficient generation with only 32 steps compared to over 1000 steps in traditional discrete models.
  • Empirical evaluations demonstrate ELF's superiority in generation quality, data efficiency, and speed, making it a promising approach for continuous language modeling.

Embedded Language Flows (ELF) are a class of continuous-time flow-matching diffusion models for language, operating in continuous embedding space rather than directly on discrete tokens. ELF is specifically designed to leverage techniques successful in image-domain diffusion models—such as flow-matching ODEs and classifier-free guidance—by largely remaining in the embedding space throughout generation, resorting to a single discrete decoding step at the end. Empirical evaluations demonstrate that ELF achieves superior generation quality, efficiency, and data utilization compared to both discrete and continuous diffusion LLMs (DLMs), offering new directions for continuous generative modeling in language domains (Hu et al., 11 May 2026).

1. Mathematical Formulation

The underlying space for ELF is the sequence of token embeddings. Given a sequence of discrete tokens s=[s1,…,sL]s = [s_1,\ldots,s_L], si∈Vs_i \in V, it is mapped to an embedding x∈RL×dx \in \mathbb{R}^{L \times d} via a (typically frozen) encoder:

x=encode(s)x = \mathrm{encode}(s)

Gaussian noise ϵ∼N(0,I)\epsilon \sim \mathcal{N}(0, I) of the same shape as xx is introduced, and a linear "rectified-flow" interpolant is defined for continuous time t∈[0,1]t \in [0, 1]:

zt=tx+(1−t)ϵz_t = t x + (1-t)\epsilon

For t=0t=0, z0z_0 is pure noise; for si∈Vs_i \in V0, si∈Vs_i \in V1 recovers the embedded sequence.

Flow Matching is modeled as the ODE:

si∈Vs_i \in V2

where the true velocity field is:

si∈Vs_i \in V3

ELF employs an si∈Vs_i \in V4-prediction parameterization, using a Transformer-based neural network si∈Vs_i \in V5 to approximate si∈Vs_i \in V6, yielding a predicted velocity

si∈Vs_i \in V7

with the core objective being mean-squared denoising:

si∈Vs_i \in V8

At si∈Vs_i \in V9, ELF performs a single-step decoding by corrupting x∈RL×dx \in \mathbb{R}^{L \times d}0 with a per-token rate x∈RL×dx \in \mathbb{R}^{L \times d}1 to obtain x∈RL×dx \in \mathbb{R}^{L \times d}2, after which x∈RL×dx \in \mathbb{R}^{L \times d}3 is projected to token logits by a learned matrix x∈RL×dx \in \mathbb{R}^{L \times d}4, with cross-entropy loss:

x∈RL×dx \in \mathbb{R}^{L \times d}5

x∈RL×dx \in \mathbb{R}^{L \times d}6

The network weights are shared for both MSE and CE branches, with a mode flag determining which loss is applied.

2. Model Architecture and Optimization

The denoiser x∈RL×dx \in \mathbb{R}^{L \times d}7 is a Transformer architecture parameterized as follows:

  • x∈RL×dx \in \mathbb{R}^{L \times d}8 layers (12/24/32 for ELF-B/M/L), hidden size x∈RL×dx \in \mathbb{R}^{L \times d}9 (768/1056/1280), x=encode(s)x = \mathrm{encode}(s)0 heads
  • SwiGLU feed-forward, RMSNorm, rotary positional embeddings
  • Lightweight bottleneck x=encode(s)x = \mathrm{encode}(s)1

Training alternates between two modes, selected per batch with a Bernoulli coin:

  • Denoise mode (80%): applies x=encode(s)x = \mathrm{encode}(s)2, samples x=encode(s)x = \mathrm{encode}(s)3 from a logit-normal distribution, noise scale 2.0
  • Decode mode (20%): applies x=encode(s)x = \mathrm{encode}(s)4 at x=encode(s)x = \mathrm{encode}(s)5, per-token corruption x=encode(s)x = \mathrm{encode}(s)6, noise scale 5.0 (1.0 for conditional tasks)

Optimization uses the Muon optimizer (LR 0.002, batch 512, zero weight decay, 5 epochs on OpenWebText). Self-conditioning is applied with probability 0.5: the previous x=encode(s)x = \mathrm{encode}(s)7 prediction is concatenated (without gradient) as extra input to x=encode(s)x = \mathrm{encode}(s)8. Training and inference share weights across both branches; inference follows the ODE

x=encode(s)x = \mathrm{encode}(s)9

from ϵ∼N(0,I)\epsilon \sim \mathcal{N}(0, I)0 to ϵ∼N(0,I)\epsilon \sim \mathcal{N}(0, I)1, then applies final decoding.

3. Integration of Image-Domain Techniques

ELF adapts classifier-free guidance (CFG) from image diffusion models. During inference, CFG combines conditional and unconditional velocity fields:

ϵ∼N(0,I)\epsilon \sim \mathcal{N}(0, I)2

Rather than requiring two forward passes, ELF is trained to predict the guided velocity directly:

ϵ∼N(0,I)\epsilon \sim \mathcal{N}(0, I)3

Guidance scales ϵ∼N(0,I)\epsilon \sim \mathcal{N}(0, I)4 are sampled from ϵ∼N(0,I)\epsilon \sim \mathcal{N}(0, I)5 during training. Self-conditioning is used instead of explicit conditional fields, serving as a conditioning mechanism in the absence of external signals.

4. Inference, Sampling, and Computational Efficiency

ELF supports both ODE (Euler steps) and SDE-inspired samplers. The SDE sampler re-injects noise at each step:

ϵ∼N(0,I)\epsilon \sim \mathcal{N}(0, I)6

then evaluates ϵ∼N(0,I)\epsilon \sim \mathcal{N}(0, I)7 and updates ϵ∼N(0,I)\epsilon \sim \mathcal{N}(0, I)8. Default ϵ∼N(0,I)\epsilon \sim \mathcal{N}(0, I)9; SDE sampling is empirically effective for small (xx0) step counts.

Time steps xx1 are chosen via a logit-normal grid, allocating more steps near xx2 (higher uncertainty).

ELF achieves a generation perplexity (Gen PPL) of approximately 24 with only 32 steps, contrasting with xx3 steps required by prior DLMs (e.g., MDLM, Duo). Complexity per step is similar to a single Transformer pass, resulting in overall inference speed 10–30× faster than token-based DLM samplers.

5. Empirical Evaluation

Experimental results on OpenWebText and conditional tasks demonstrate ELF’s data and computational efficiency as well as generation quality.

Setting Baseline Model Metric ELF-B Value Best Competing Value
Unconditional gen. (32 steps) GPT-2 Large Gen PPL 24.1 MDLM (1000 steps): >30
WMT’14 De→En (BLEU) AR / MDLM / Duo BLEU 26.4 25.2 / 18.4 / 21.3
XSum (ROUGE-1/2/L) best baseline ROUGE-1 36.0 33.4

ELF demonstrates improved sample diversity (higher unigram entropy at comparable PPL) relative to discrete step DLMs, advancing the quality–diversity tradeoff.

ELF requires only 45 billion training tokens (5 epochs × 9B text), compared with 500–600B for earlier DLMs.

6. Ablations and Observed Insights

Systematic ablations confirm critical design choices:

  • Embeddings: Pretrained contextual (T5) encodings significantly outperform randomly initialized or token-only embeddings.
  • Prediction target: xx4-prediction is stable across 512–1024 dimensions; xx5-prediction degrades above 512; xx6-prediction fails.
  • Bottleneck: 128-dimensional bottleneck optimal; 32 reduces diversity, 512 harms Gen PPL.
  • Denoise/decode: 80% MSE, 20% CE yields best results.
  • Samplers: SDE-based sampling superior to ODE for xx732 steps.
  • CFG scale: Empirical sweep between xx8 and xx9 traces optimal frontier; t∈[0,1]t \in [0, 1]0 for unconditional, t∈[0,1]t \in [0, 1]1 for conditional.
  • Conditioning: In-context tokens achieve performance similar to adaLN-Zero with 30% fewer parameters.
  • Optimizer: Muon demonstrates faster loss reduction and improved PPL–entropy tradeoff compared to AdamW.

7. Limitations and Future Prospects

ELF currently depends on a frozen encoder. Joint training of the encoder could enhance flexibility and expressiveness, particularly for end-to-end or multimodal generative tasks. Scaling ELF to billion-parameter flow models and longer context lengths remains open. Future research directions include theoretical analysis of flow matching in discrete output domains for improved complexity guarantees and extending ELF to fully end-to-end encoder–decoder frameworks or multimodal generation tasks (Hu et al., 11 May 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Embedded Language Flows (ELF).