Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

108 tokens/sec

GPT-4o

67 tokens/sec

Gemini 2.5 Pro Pro

54 tokens/sec

o3 Pro

13 tokens/sec

GPT-4.1 Pro

49 tokens/sec

DeepSeek R1 via Azure Pro

24 tokens/sec

2000 character limit reached

Eso-LMs (Esoteric Language Models)

Updated 22 June 2025

Eso-LMs (Esoteric LLMs) refer to a novel class of LLMs that integrate the benefits of both autoregressive (AR) and masked diffusion modeling (MDM) paradigms. Designed to enable efficient, controllable, and high-fidelity text generation, Eso-LMs are distinguished by a two-phase hybrid generation mechanism and are the first diffusion-based LLMs to support full key-value (KV) caching while preserving parallel generation capabilities. This architecture achieves state-of-the-art results on standard LLMing benchmarks and offers substantial improvements in inference efficiency relative to pure MDMs and previous hybrid approaches (Sahoo et al., 2 Jun 2025 ).

1. Architectural Overview

Eso-LMs employ a unified Transformer backbone to merge the sequential, left-to-right prediction of AR models with the parallel denoising of MDMs. The generation process unfolds in two distinct phases:

Diffusion Phase: A subset of masked tokens in a sequence are denoised in parallel. The choice and order of tokens to denoise are flexible, supporting the parallelism characteristic of MDMs.
Sequential Phase: Remaining masked tokens are resolved sequentially, typically in a left-to-right fashion, emulating standard AR inference.

The backbone incorporates an attention-biasing mechanism that dynamically adapts the self-attention mask to operate in one of three regimes:

Bidirectional (for pure diffusion)
Causal (for pure autoregression)
Hybrid (for mixed phase generation)

Two prominent variants are described:

Eso-LM (A): Prioritizes diffusion speed, disables bidirectional attention over mask tokens in the diffusion phase, and permits KV caching only during the sequential phase.
Eso-LM (B): Applies causal attention throughout, enabling unified KV caching in both phases, with a minor trade-off in perplexity for pure diffusion settings.

2. Hybrid Generation and Key-Value Caching

A critical innovation of Eso-LMs is the introduction of KV caching—a standard acceleration technique for AR models—into the diffusion modeling context. This is accomplished by:

Adopting Causal Attention during Diffusion: When clean tokens rely exclusively on causal information (i.e., do not attend forward), the model can cache previously computed key-value pairs and reuse them during subsequent steps.
Unified Caching Across Phases: In Eso-LM (B), the same cache state is exploited for both the parallel diffusion and sequential refinement, ensuring computational efficiency and minimal recomputation.

This mechanism allows Eso-LMs to achieve up to 65× faster inference than conventional MDMs and 4× faster than previous semi-autoregressive designs, notably BD3-LMs (Blockwise Diffusion Denoising Decoders) (Sahoo et al., 2 Jun 2025 ).

3. Performance Evaluation

Eso-LMs are evaluated on the One Billion Words (LM1B) and OpenWebText (OWT) benchmarks using test perplexity (PPL), inference speed, and generation quality metrics. Key findings include:

Perplexity (LM1B):
- AR Transformer: 22.83
- Best Eso-LM (A), $\alpha_0=0.0625$ : 24.51
- Best Eso-LM (B): 35.00 (diffusion-only), but lower (approaching AR) for interpolated regimes
- MDLM (state-of-art MDM): 31.78
- BD3-LM: 28.23
Perplexity (OWT):
- AR Transformer: 17.90
- Eso-LM (A), $\alpha_0=1.0$ : 26.21
- Eso-LM (B), $\alpha_0=0.125$ : 21.87
- MDLM: 25.76
- BD3-LM: 23.57

Eso-LMs enable smooth interpolation between AR and MDM perplexities by varying the schedule parameter $\alpha_0$ , outperforming prior diffusion-based and semi-autoregressive models for many settings.

Inference Speed (sequence length $L=8192$ ):
- MDLM: ~5438s
- BD3-LM: ~268–312s
- Eso-LM (B): 82.1s
- AR Transformer: 54s
Sampling Quality and Pareto Frontier: Eso-LMs, in particular (B), dominate the speed-quality Pareto frontier. They maintain high sample diversity (as measured by entropy and generated perplexity) without succumbing to mode collapse at high-speed, low-NFE (number of function evaluations) settings.

4. Optimized Sampling Schedule

The two-phase generation mechanism is regulated by a unified sampling scheduler $\mathcal{S}$ , which partitions the denoising steps into parallel and sequential subsets:

$\mathcal{S} = \mathcal{S}^{\text{MDM}} \cup \mathcal{S}^{\text{AR}}$

At each diffusion step $t$ , the number of tokens to denoise $n_t$ is determined as:

$n_t = \operatorname{Binom}\left(n = n_t^{\text{remaining}},\ p = \frac{\alpha_s - \alpha_t}{1 - \alpha_t}\right)$

where $s = t - \Delta t$ and $\alpha_t$ is the schedule parameter.

This scheme allows for flexible, high-granularity control over the parallelism/sequentiality trade-off: predominantly diffusion-based schedules maximize speed, while greater reliance on the AR phase yields higher output fidelity.

5. Loss Formulation and Theoretical Guarantees

Eso-LMs optimize a hybrid objective combining the strengths of AR and MDM training:

$p_\theta(x) = \sum_{x_0 \in V^L} p^{\text{AR}}(x\,|\,x_0)\;p^{\text{MDM}}(x_0)$

The variational lower bound on log-likelihood is:

$- \log p_\theta(x) \leq - \mathbb{E}_{x_0 \sim q_0(.|x)} \left[ \sum_{\ell \in \operatorname{mask}(x_0)} \log p^{\text{AR}}(x^\ell | x_0, x^{<\ell}) \right] + D_{\mathrm{KL}}\left[q_0(x_0|x) \bigg\| p^{\text{MDM}}(x_0)\right]$

The total loss is composed of:

AR loss (over sequential tokens),
MDM loss (integrated over the diffusion schedule),
KL divergence for the partially masked intermediate sequence.

6. Comparative Analysis

A summarized comparison with prior model classes:

Feature	AR	MDMs	BD3-LMs	Eso-LM (B)
Generation	Sequential	Parallel	Blockwise	Hybrid (Parallel + Seq.)
Quality	SOTA	Lower	Interpolate	AR–MDM interpolation
Inference Speed	Fast (KV)	Slow	Medium	Fastest (KV, both phases)
KV Caching	Yes	No	Partial	Yes (full)
Controllability	No	Yes	Yes	Yes
Parallelism	No	Yes	Partial	Yes

Eso-LMs uniquely combine full KV caching, efficient parallelism, and hybrid quality–speed trade-offs, facilitating unprecedented generation speed at scalable quality levels.

7. Implications and Applications

The practical advancements of Eso-LMs include:

Enabling real-time or low-latency generation with high output fidelity, especially beneficial for long-context or interactive NLP applications.
Granting practitioners the ability to select a desired balance between generation quality and speed without retraining the model.
Supporting controllable and out-of-order text editing, retained from the diffusion modeling lineage.

Deployment of Eso-LMs may significantly expand the feasible application scope for diffusion-based LLMs by resolving prior efficiency bottlenecks and offering a unified framework for hybrid inference with strong theoretical and empirical performance guarantees. A plausible implication is the broadening of large-scale chatbot and bulk text-generation workflows that previously demanded either AR quality or MDM control but could not efficiently combine both.

PDF Markdown Chat (Pro)