Embedded Language Flows: Diffusion Models for NLP

Updated 13 May 2026

Embedded Language Flows (ELF) is a continuous-time diffusion model for language that operates in the embedding space using flow-matching ODEs and classifier-free guidance.
It integrates techniques from image-domain diffusion, achieving efficient generation with only 32 steps compared to over 1000 steps in traditional discrete models.
Empirical evaluations demonstrate ELF's superiority in generation quality, data efficiency, and speed, making it a promising approach for continuous language modeling.

Embedded Language Flows (ELF) are a class of continuous-time flow-matching diffusion models for language, operating in continuous embedding space rather than directly on discrete tokens. ELF is specifically designed to leverage techniques successful in image-domain diffusion models—such as flow-matching ODEs and classifier-free guidance—by largely remaining in the embedding space throughout generation, resorting to a single discrete decoding step at the end. Empirical evaluations demonstrate that ELF achieves superior generation quality, efficiency, and data utilization compared to both discrete and continuous diffusion LLMs (DLMs), offering new directions for continuous generative modeling in language domains (Hu et al., 11 May 2026).

1. Mathematical Formulation

The underlying space for ELF is the sequence of token embeddings. Given a sequence of discrete tokens $s = [s_1,\ldots,s_L]$ , $s_i \in V$ , it is mapped to an embedding $x \in \mathbb{R}^{L \times d}$ via a (typically frozen) encoder:

$x = \mathrm{encode}(s)$

Gaussian noise $\epsilon \sim \mathcal{N}(0, I)$ of the same shape as $x$ is introduced, and a linear "rectified-flow" interpolant is defined for continuous time $t \in [0, 1]$ :

$z_t = t x + (1-t)\epsilon$

For $t=0$ , $z_0$ is pure noise; for $s_i \in V$ 0, $s_i \in V$ 1 recovers the embedded sequence.

Flow Matching is modeled as the ODE:

$s_i \in V$ 2

where the true velocity field is:

$s_i \in V$ 3

ELF employs an $s_i \in V$ 4-prediction parameterization, using a Transformer-based neural network $s_i \in V$ 5 to approximate $s_i \in V$ 6, yielding a predicted velocity

$s_i \in V$ 7

with the core objective being mean-squared denoising:

$s_i \in V$ 8

At $s_i \in V$ 9, ELF performs a single-step decoding by corrupting $x \in \mathbb{R}^{L \times d}$ 0 with a per-token rate $x \in \mathbb{R}^{L \times d}$ 1 to obtain $x \in \mathbb{R}^{L \times d}$ 2, after which $x \in \mathbb{R}^{L \times d}$ 3 is projected to token logits by a learned matrix $x \in \mathbb{R}^{L \times d}$ 4, with cross-entropy loss:

$x \in \mathbb{R}^{L \times d}$ 5

$x \in \mathbb{R}^{L \times d}$ 6

The network weights are shared for both MSE and CE branches, with a mode flag determining which loss is applied.

2. Model Architecture and Optimization

The denoiser $x \in \mathbb{R}^{L \times d}$ 7 is a Transformer architecture parameterized as follows:

$x \in \mathbb{R}^{L \times d}$ 8 layers (12/24/32 for ELF-B/M/L), hidden size $x \in \mathbb{R}^{L \times d}$ 9 (768/1056/1280), $x = \mathrm{encode}(s)$ 0 heads
SwiGLU feed-forward, RMSNorm, rotary positional embeddings
Lightweight bottleneck $x = \mathrm{encode}(s)$ 1

Training alternates between two modes, selected per batch with a Bernoulli coin:

Denoise mode (80%): applies $x = \mathrm{encode}(s)$ 2, samples $x = \mathrm{encode}(s)$ 3 from a logit-normal distribution, noise scale 2.0
Decode mode (20%): applies $x = \mathrm{encode}(s)$ 4 at $x = \mathrm{encode}(s)$ 5, per-token corruption $x = \mathrm{encode}(s)$ 6, noise scale 5.0 (1.0 for conditional tasks)

Optimization uses the Muon optimizer (LR 0.002, batch 512, zero weight decay, 5 epochs on OpenWebText). Self-conditioning is applied with probability 0.5: the previous $x = \mathrm{encode}(s)$ 7 prediction is concatenated (without gradient) as extra input to $x = \mathrm{encode}(s)$ 8. Training and inference share weights across both branches; inference follows the ODE

$x = \mathrm{encode}(s)$ 9

from $\epsilon \sim \mathcal{N}(0, I)$ 0 to $\epsilon \sim \mathcal{N}(0, I)$ 1, then applies final decoding.

3. Integration of Image-Domain Techniques

ELF adapts classifier-free guidance (CFG) from image diffusion models. During inference, CFG combines conditional and unconditional velocity fields:

$\epsilon \sim \mathcal{N}(0, I)$ 2

Rather than requiring two forward passes, ELF is trained to predict the guided velocity directly:

$\epsilon \sim \mathcal{N}(0, I)$ 3

Guidance scales $\epsilon \sim \mathcal{N}(0, I)$ 4 are sampled from $\epsilon \sim \mathcal{N}(0, I)$ 5 during training. Self-conditioning is used instead of explicit conditional fields, serving as a conditioning mechanism in the absence of external signals.

4. Inference, Sampling, and Computational Efficiency

ELF supports both ODE (Euler steps) and SDE-inspired samplers. The SDE sampler re-injects noise at each step:

$\epsilon \sim \mathcal{N}(0, I)$ 6

then evaluates $\epsilon \sim \mathcal{N}(0, I)$ 7 and updates $\epsilon \sim \mathcal{N}(0, I)$ 8. Default $\epsilon \sim \mathcal{N}(0, I)$ 9; SDE sampling is empirically effective for small ( $x$ 0) step counts.

Time steps $x$ 1 are chosen via a logit-normal grid, allocating more steps near $x$ 2 (higher uncertainty).

ELF achieves a generation perplexity (Gen PPL) of approximately 24 with only 32 steps, contrasting with $x$ 3 steps required by prior DLMs (e.g., MDLM, Duo). Complexity per step is similar to a single Transformer pass, resulting in overall inference speed 10–30× faster than token-based DLM samplers.

5. Empirical Evaluation

Experimental results on OpenWebText and conditional tasks demonstrate ELF’s data and computational efficiency as well as generation quality.

Setting	Baseline Model	Metric	ELF-B Value	Best Competing Value
Unconditional gen. (32 steps)	GPT-2 Large	Gen PPL	24.1	MDLM (1000 steps): >30
WMT’14 De→En (BLEU)	AR / MDLM / Duo	BLEU	26.4	25.2 / 18.4 / 21.3
XSum (ROUGE-1/2/L)	best baseline	ROUGE-1	36.0	33.4

ELF demonstrates improved sample diversity (higher unigram entropy at comparable PPL) relative to discrete step DLMs, advancing the quality–diversity tradeoff.

ELF requires only 45 billion training tokens (5 epochs × 9B text), compared with 500–600B for earlier DLMs.

6. Ablations and Observed Insights

Systematic ablations confirm critical design choices:

Embeddings: Pretrained contextual (T5) encodings significantly outperform randomly initialized or token-only embeddings.
Prediction target: $x$ 4-prediction is stable across 512–1024 dimensions; $x$ 5-prediction degrades above 512; $x$ 6-prediction fails.
Bottleneck: 128-dimensional bottleneck optimal; 32 reduces diversity, 512 harms Gen PPL.
Denoise/decode: 80% MSE, 20% CE yields best results.
Samplers: SDE-based sampling superior to ODE for $x$ 732 steps.
CFG scale: Empirical sweep between $x$ 8 and $x$ 9 traces optimal frontier; $t \in [0, 1]$ 0 for unconditional, $t \in [0, 1]$ 1 for conditional.
Conditioning: In-context tokens achieve performance similar to adaLN-Zero with 30% fewer parameters.
Optimizer: Muon demonstrates faster loss reduction and improved PPL–entropy tradeoff compared to AdamW.

7. Limitations and Future Prospects

ELF currently depends on a frozen encoder. Joint training of the encoder could enhance flexibility and expressiveness, particularly for end-to-end or multimodal generative tasks. Scaling ELF to billion-parameter flow models and longer context lengths remains open. Future research directions include theoretical analysis of flow matching in discrete output domains for improved complexity guarantees and extending ELF to fully end-to-end encoder–decoder frameworks or multimodal generation tasks (Hu et al., 11 May 2026).

Markdown Report Issue Upgrade to Chat

References (1)

ELF: Embedded Language Flows (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Embedded Language Flows (ELF).