Papers
Topics
Authors
Recent
Search
2000 character limit reached

Anchored Negative ELBO (ANELBO)

Updated 14 February 2026
  • ANELBO is a variational objective that augments the standard ELBO with an anchoring mechanism to emphasize key tokens in discrete diffusion models.
  • It introduces an auxiliary anchor KL term and a two-stage denoising framework, achieving lower sample complexity and improved likelihood approximations.
  • Empirical results demonstrate significant perplexity reductions and enhanced generalization across generative, autoregressive, and reasoning tasks in ADLM.

The Anchored Negative Evidence Lower Bound (ANELBO) is a variational objective introduced in the context of Anchored Diffusion LLMs (ADLM). ANELBO extends the standard Evidence Lower Bound (ELBO) formulation for discrete diffusion models by incorporating an explicit anchoring mechanism for important tokens within the input sequence. This objective underpins improvements in sample complexity, likelihood modeling, and empirical performance across generative, autoregressive (AR), and reasoning tasks within the ADLM framework (Rout et al., 24 May 2025).

1. Mathematical Definition and Formulation

The standard Negative ELBO (NELBO) for masked discrete diffusion models is augmented in ANELBO by introducing (a) an auxiliary anchor Kullback–Leibler (KL) term that supervises an anchor network φ\varphi and (b) a two-stage composition in the prediction pipeline, where a denoiser network ψ\psi operates on anchor logits from φ\varphi.

The ANELBO is given by

logpψ,φ(X)+γLAnchor(φ)LANELBO(ψ,φ)-\log p_{\psi,\varphi}(X) + \gamma\,\mathcal{L}_{\rm Anchor}(\varphi) \leq \mathcal{L}_{\rm ANELBO}(\psi, \varphi)

where

LANELBO(ψ,φ)=EZ0q(X)[logpψ(Xφ(Z0))] +i=1TEZt(i)q(X)[λt(i)l=1L(logπψl(φ(Zt(i))),Xlγlogπφl(Zt(i)),Xl)].\begin{aligned} \mathcal{L}_{\rm ANELBO}(\psi,\varphi) &= \mathbb{E}_{Z_{0}\sim q(\cdot\mid X)}\left[-\log p_{\psi}(X\mid\varphi(Z_{0}))\right] \ &\quad + \sum_{i=1}^{T} \mathbb{E}_{Z_{t(i)}\sim q(\cdot\mid X)} \Bigg[ \lambda_{t(i)} \sum_{l=1}^{L} \left( -\log\langle \pi^l_\psi(\varphi(Z_{t(i)})), X^l\rangle - \gamma \log\langle \pi^l_\varphi(Z_{t(i)}), X^l\rangle \right) \Bigg]. \end{aligned}

Here, γ>0\gamma > 0 controls anchor alignment penalty, and the weight λt(i)\lambda_{t(i)} depends on the masking schedule and remasking probability σt(i)\sigma_{t(i)} with

λt(i)=(1σt(i))αt(i)αs(i)1αt(i).\lambda_{t(i)} = \frac{(1-\sigma_{t(i)})\alpha_{t(i)}-\alpha_{s(i)}}{1-\alpha_{t(i)}}.

The objective reduces to the standard NELBO when γ=0\gamma=0 and σt=0\sigma_{t}=0.

2. Derivation and Theoretical Justification

The ANELBO is derived as a variational lower bound on logp(X)-\log p(X) for the ADLM framework:

  1. Start from the negative log-likelihood logpθ(X)-\log p_\theta(X), introducing the forward joint q(Z0:TX)q(Z_{0:T}|X).
  2. Apply Jensen’s inequality to obtain the usual ELBO/NELBO separation: a reconstruction loss plus a sum of KL terms over the reverse process steps.
  3. Incorporate the anchor-alignment KL LAnchor(φ)\mathcal{L}_{\rm Anchor}(\varphi), encouraging the anchor network rφr_\varphi to match a target anchor transition r()r(\cdot) via

LAnchor(φ)=Eq(Z0:TX)iDKL[r(Ys(i)Zt(i),X)    rφ(Ys(i)Zt(i))],\mathcal{L}_{\rm Anchor}(\varphi) = \mathbb{E}_{q(Z_{0:T}|X)} \sum_i \mathrm{D_{KL}\big[r(Y_{s(i)}|Z_{t(i)},X)\;||\;r_\varphi(Y_{s(i)}|Z_{t(i)})\big]},

where Ys(i)Y_{s(i)} denotes variables unmasked at step s(i)s(i).

  1. Set the variational reverse posterior to q(zszt,πψ(φ(zt)))q(z_{s}|z_{t},\pi_\psi(\varphi(z_t))), parameterized by ψ\psi and φ\varphi.
  2. By expanding the KLs over the sum, contributions from already unmasked tokens vanish, leaving only the two log-likelihood terms in LANELBO\mathcal{L}_{\rm ANELBO}.

The bound formalized in Theorem 3.1 guarantees that optimization of ANELBO provides control over both marginal likelihood and anchor alignment (Rout et al., 24 May 2025).

3. Symbolic Definitions and Model Structure

A table of relevant quantities provides clarity on notation:

Symbol Definition Role
XX (X1,,XL)(X^1,\dots,X^L), one-hot sequence in {1,,K}L\{1,\dots,K\}^L Input sequence
ZtZ_t Discrete noised sequence at time tt; q(ZtZs)=Cat(Zt;αtsZs+(1αts)m)q(Z_t|Z_s) = \mathrm{Cat}(Z_t; \alpha_{t|s}Z_s + (1-\alpha_{t|s})\mathbf{m}) Forward process state
αt\alpha_t Masking schedule Controls diffusion noise
σt\sigma_t Remasking probability at step tt Modifies forward process
φ\varphi Anchor network, outputting πφl(Zt)ΔK\pi^l_\varphi(Z_t) \in \Delta_K at each sequence position Anchor prediction
ψ\psi Denoiser network; outputs πψ(φ(Zt))\pi_\psi(\varphi(Z_t)) given anchors Final denoising step
r()r(\cdot) Target anchor transition Supervises anchoring
rφr_\varphi Learned/parameterized anchor transition Anchor KL target
γ\gamma Strength of anchor-KL term Tunes supervision
λt\lambda_t KL weighting in diffusion sum Step reweighting

The two-stage composition ψφ\psi\circ\varphi partitions modeling capacity, with φ\varphi focusing on low-entropy anchor predictions and ψ\psi handling full sequence denoising.

4. Implications for Sample Complexity and Likelihood Modeling

Anchoring yields an exponential reduction in sample complexity for both AR and DLM frameworks. In the standard discrete DAG/CPT parameterization, every conditional distribution can depend on up to LL tokens, requiring O(LKL)\mathcal{O}(L K^{L}) parameters. By restricting conditionals to a fixed small parent set of size dLd \ll L, anchored models necessitate only O(LKd+1)\mathcal{O}(L K^{d+1}) parameters, dramatically lowering the sample complexity. This reduction is theoretically justified in Proposition 3.2 and verified by practical performance. The ANELBO objective—by exposing and supervising key anchor tokens early in training—reduces denoiser entropy, leading to enhanced likelihood approximations and improved generalization in empirical benchmarks (Rout et al., 24 May 2025).

5. Practical Training and Optimization Procedures

Optimization of the ANELBO follows these steps:

  • Draw minibatches XqX \sim q from the data.
  • Sample the forward noising chain {Zt(i)}i=0T\{Z_{t(i)}\}_{i=0}^T for each input XX.
  • Compute anchor logits φ(Zt(i))\varphi(Z_{t(i)}) and subsequent denoiser logits ψ(φ(Zt(i)))\psi(\varphi(Z_{t(i)})).
  • Evaluate the reconstruction loss logpψ(Xφ(Z0))-\log p_\psi(X|\varphi(Z_0)).
  • At each timestep t(i)t(i), accumulate the weighted terms

λt(i)[logψ(φ(Zt(i))),Xγlogφ(Zt(i)),X].\lambda_{t(i)}\,\bigl[-\log\langle\psi(\varphi(Z_{t(i)})),X\rangle - \gamma\,\log\langle\varphi(Z_{t(i)}),X\rangle\bigr].

  • Gradients are backpropagated through both networks; all variables are discrete, and cross-entropies suffice for KL terms.
  • Optimization is performed using Adam or equivalent optimizers.
  • Monte Carlo estimates for expectations over ZtZ_t use a single sample per instance, with batch averaging (Rout et al., 24 May 2025).

6. Relationship to Standard (Negative) ELBO

The ANELBO is a strict generalization of the NELBO for discrete diffusion LLMs. When the anchor supervision parameter γ=0\gamma=0 and remasking probability σt=0\sigma_t=0, ANELBO collapses to the standard NELBO. Conceptually, ANELBO is a variational lower bound on logp(X)-\log p(X) augmented by explicit anchor supervision. In practice, the two-stage parameterization—anchor followed by denoiser—concentrates modeling effort, increasing efficiency in capturing key tokens before full sequence reconstruction.

7. Empirical Outcomes and Significance

Empirical validation of ANELBO centers on likelihood estimation, text generation quality, generalization, and task-specific performance:

  • On LM1B, ADLM trained with ANELBO achieves perplexity reductions of up to 9.7% over MDLM baselines at 65B tokens.
  • On OpenWebText, ADLM achieves up to 25.4% relative perplexity improvement over SEDD/MDLM, within 3 perplexity points of AR models.
  • In text generation, ADLM attains higher MAUVE scores than autoregressive models on OWT, and surpasses prior DLMs in MAUVE, GPT-2 perplexity, and entropy for large TT.
  • In zero-shot generalization, ADLM outperforms SEDD/MDLM/BD3LM on 6 of 7 benchmarks.
  • For AR models fine-tuned using anchor supervision, perplexity is further reduced compared to standard AR.
  • In mathematical and logical reasoning tasks, Anchored Chain-of-Thought (ACoT) achieves accuracy improvements without increasing token usage.

These results collectively demonstrate that ANELBO is both a principled extension of the diffusion-model negative ELBO and empirically effective in delivering improvements in sample efficiency, likelihood modeling, and downstream task quality (Rout et al., 24 May 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Anchored Negative Evidence Lower Bound (ANELBO).