Anchored Negative ELBO (ANELBO)

Updated 14 February 2026

ANELBO is a variational objective that augments the standard ELBO with an anchoring mechanism to emphasize key tokens in discrete diffusion models.
It introduces an auxiliary anchor KL term and a two-stage denoising framework, achieving lower sample complexity and improved likelihood approximations.
Empirical results demonstrate significant perplexity reductions and enhanced generalization across generative, autoregressive, and reasoning tasks in ADLM.

The Anchored Negative Evidence Lower Bound (ANELBO) is a variational objective introduced in the context of Anchored Diffusion LLMs (ADLM). ANELBO extends the standard Evidence Lower Bound (ELBO) formulation for discrete diffusion models by incorporating an explicit anchoring mechanism for important tokens within the input sequence. This objective underpins improvements in sample complexity, likelihood modeling, and empirical performance across generative, autoregressive (AR), and reasoning tasks within the ADLM framework (Rout et al., 24 May 2025).

1. Mathematical Definition and Formulation

The standard Negative ELBO (NELBO) for masked discrete diffusion models is augmented in ANELBO by introducing (a) an auxiliary anchor Kullback–Leibler (KL) term that supervises an anchor network $\varphi$ and (b) a two-stage composition in the prediction pipeline, where a denoiser network $\psi$ operates on anchor logits from $\varphi$ .

The ANELBO is given by

$-\log p_{\psi,\varphi}(X) + \gamma\,\mathcal{L}_{\rm Anchor}(\varphi) \leq \mathcal{L}_{\rm ANELBO}(\psi, \varphi)$

where

$\begin{aligned} \mathcal{L}_{\rm ANELBO}(\psi,\varphi) &= \mathbb{E}_{Z_{0}\sim q(\cdot\mid X)}\left[-\log p_{\psi}(X\mid\varphi(Z_{0}))\right] \ &\quad + \sum_{i=1}^{T} \mathbb{E}_{Z_{t(i)}\sim q(\cdot\mid X)} \Bigg[ \lambda_{t(i)} \sum_{l=1}^{L} \left( -\log\langle \pi^l_\psi(\varphi(Z_{t(i)})), X^l\rangle - \gamma \log\langle \pi^l_\varphi(Z_{t(i)}), X^l\rangle \right) \Bigg]. \end{aligned}$

Here, $\gamma > 0$ controls anchor alignment penalty, and the weight $\lambda_{t(i)}$ depends on the masking schedule and remasking probability $\sigma_{t(i)}$ with

$\lambda_{t(i)} = \frac{(1-\sigma_{t(i)})\alpha_{t(i)}-\alpha_{s(i)}}{1-\alpha_{t(i)}}.$

The objective reduces to the standard NELBO when $\gamma=0$ and $\sigma_{t}=0$ .

2. Derivation and Theoretical Justification

The ANELBO is derived as a variational lower bound on $-\log p(X)$ for the ADLM framework:

Start from the negative log-likelihood $-\log p_\theta(X)$ , introducing the forward joint $q(Z_{0:T}|X)$ .
Apply Jensen’s inequality to obtain the usual ELBO/NELBO separation: a reconstruction loss plus a sum of KL terms over the reverse process steps.
Incorporate the anchor-alignment KL $\mathcal{L}_{\rm Anchor}(\varphi)$ , encouraging the anchor network $r_\varphi$ to match a target anchor transition $r(\cdot)$ via

$\mathcal{L}_{\rm Anchor}(\varphi) = \mathbb{E}_{q(Z_{0:T}|X)} \sum_i \mathrm{D_{KL}\big[r(Y_{s(i)}|Z_{t(i)},X)\;||\;r_\varphi(Y_{s(i)}|Z_{t(i)})\big]},$

where $Y_{s(i)}$ denotes variables unmasked at step $s(i)$ .

Set the variational reverse posterior to $q(z_{s}|z_{t},\pi_\psi(\varphi(z_t)))$ , parameterized by $\psi$ and $\varphi$ .
By expanding the KLs over the sum, contributions from already unmasked tokens vanish, leaving only the two log-likelihood terms in $\mathcal{L}_{\rm ANELBO}$ .

The bound formalized in Theorem 3.1 guarantees that optimization of ANELBO provides control over both marginal likelihood and anchor alignment (Rout et al., 24 May 2025).

3. Symbolic Definitions and Model Structure

A table of relevant quantities provides clarity on notation:

Symbol	Definition	Role
$X$	$(X^1,\dots,X^L)$ , one-hot sequence in $\{1,\dots,K\}^L$	Input sequence
$Z_t$	Discrete noised sequence at time $t$ ; $q(Z_t\|Z_s) = \mathrm{Cat}(Z_t; \alpha_{t\|s}Z_s + (1-\alpha_{t\|s})\mathbf{m})$	Forward process state
$\alpha_t$	Masking schedule	Controls diffusion noise
$\sigma_t$	Remasking probability at step $t$	Modifies forward process
$\varphi$	Anchor network, outputting $\pi^l_\varphi(Z_t) \in \Delta_K$ at each sequence position	Anchor prediction
$\psi$	Denoiser network; outputs $\pi_\psi(\varphi(Z_t))$ given anchors	Final denoising step
$r(\cdot)$	Target anchor transition	Supervises anchoring
$r_\varphi$	Learned/parameterized anchor transition	Anchor KL target
$\gamma$	Strength of anchor-KL term	Tunes supervision
$\lambda_t$	KL weighting in diffusion sum	Step reweighting

The two-stage composition $\psi\circ\varphi$ partitions modeling capacity, with $\varphi$ focusing on low-entropy anchor predictions and $\psi$ handling full sequence denoising.

4. Implications for Sample Complexity and Likelihood Modeling

Anchoring yields an exponential reduction in sample complexity for both AR and DLM frameworks. In the standard discrete DAG/CPT parameterization, every conditional distribution can depend on up to $L$ tokens, requiring $\mathcal{O}(L K^{L})$ parameters. By restricting conditionals to a fixed small parent set of size $d \ll L$ , anchored models necessitate only $\mathcal{O}(L K^{d+1})$ parameters, dramatically lowering the sample complexity. This reduction is theoretically justified in Proposition 3.2 and verified by practical performance. The ANELBO objective—by exposing and supervising key anchor tokens early in training—reduces denoiser entropy, leading to enhanced likelihood approximations and improved generalization in empirical benchmarks (Rout et al., 24 May 2025).

5. Practical Training and Optimization Procedures

Optimization of the ANELBO follows these steps:

Draw minibatches $X \sim q$ from the data.
Sample the forward noising chain $\{Z_{t(i)}\}_{i=0}^T$ for each input $X$ .
Compute anchor logits $\varphi(Z_{t(i)})$ and subsequent denoiser logits $\psi(\varphi(Z_{t(i)}))$ .
Evaluate the reconstruction loss $-\log p_\psi(X|\varphi(Z_0))$ .
At each timestep $t(i)$ , accumulate the weighted terms

$\lambda_{t(i)}\,\bigl[-\log\langle\psi(\varphi(Z_{t(i)})),X\rangle - \gamma\,\log\langle\varphi(Z_{t(i)}),X\rangle\bigr].$

Gradients are backpropagated through both networks; all variables are discrete, and cross-entropies suffice for KL terms.
Optimization is performed using Adam or equivalent optimizers.
Monte Carlo estimates for expectations over $Z_t$ use a single sample per instance, with batch averaging (Rout et al., 24 May 2025).

6. Relationship to Standard (Negative) ELBO

The ANELBO is a strict generalization of the NELBO for discrete diffusion LLMs. When the anchor supervision parameter $\gamma=0$ and remasking probability $\sigma_t=0$ , ANELBO collapses to the standard NELBO. Conceptually, ANELBO is a variational lower bound on $-\log p(X)$ augmented by explicit anchor supervision. In practice, the two-stage parameterization—anchor followed by denoiser—concentrates modeling effort, increasing efficiency in capturing key tokens before full sequence reconstruction.

7. Empirical Outcomes and Significance

Empirical validation of ANELBO centers on likelihood estimation, text generation quality, generalization, and task-specific performance:

On LM1B, ADLM trained with ANELBO achieves perplexity reductions of up to 9.7% over MDLM baselines at 65B tokens.
On OpenWebText, ADLM achieves up to 25.4% relative perplexity improvement over SEDD/MDLM, within 3 perplexity points of AR models.
In text generation, ADLM attains higher MAUVE scores than autoregressive models on OWT, and surpasses prior DLMs in MAUVE, GPT-2 perplexity, and entropy for large $T$ .
In zero-shot generalization, ADLM outperforms SEDD/MDLM/BD3LM on 6 of 7 benchmarks.
For AR models fine-tuned using anchor supervision, perplexity is further reduced compared to standard AR.
In mathematical and logical reasoning tasks, Anchored Chain-of-Thought (ACoT) achieves accuracy improvements without increasing token usage.

These results collectively demonstrate that ANELBO is both a principled extension of the diffusion-model negative ELBO and empirically effective in delivering improvements in sample efficiency, likelihood modeling, and downstream task quality (Rout et al., 24 May 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Anchored Diffusion Language Model (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Anchored Negative Evidence Lower Bound (ANELBO).