Autoregressive Language Models as Energy-Based Models

Updated 21 December 2025

The paper demonstrates the formal equivalence between ARLMs and EBMs, unifying token-wise and global sequence modeling within a single energy framework.
It introduces energy-based training objectives that reduce exposure bias by incorporating both positive data and negative sample gradients through autoregressive sampling.
The paper highlights architectural innovations, such as Energy-Based Transformers, and establishes links to soft Bellman equations, enhancing sequence planning and inference.

Autoregressive LLMs (ARLMs) and energy-based models (EBMs) are two foundational paradigms for modeling sequential data in natural language processing and related domains. While ARLMs have dominated large-scale language modeling by directly decomposing sequence probabilities into token-wise conditionals, an increasingly robust literature demonstrates that these models can also be rigorously interpreted, extended, and trained as energy-based models. This duality underlies a unified theoretical framework with implications for training objectives, inference strategies, and the incorporation of cognitive or global-sequence considerations.

1. Formal Equivalence of Autoregressive LLMs and Energy-Based Models

An ARLM specifies a probability distribution over sequences $x = (x_1, \dots, x_T)$ via a chain rule: $p_\theta(x) = \prod_{t=1}^{T} p_\theta(x_t | x_{<t})$ where each conditional $p_\theta(x_t | x_{<t})$ is parameterized, for example, by a neural network's output logits.

An EBM instead defines an unnormalized distribution using an energy function $E_\theta(x)$ ,

$p_\theta(x) = \frac{\exp(-E_\theta(x))}{Z_\theta}$

where $Z_\theta$ is the (generally intractable) partition function.

The key insight is that every left-to-right ARLM can be embedded as a special case of an EBM with

$E_\theta(x) = -\sum_{t=1}^T \log p_\theta(x_t | x_{<t}),$

effectively setting $Z_\theta = 1$ under this decomposition (Ou, 2024, Blondel et al., 17 Dec 2025). Conversely, one can always decompose a global sequence-level energy as a sum of token-wise pseudo-rewards, recovering an autoregressive factorization. This bijection is not only formal but functional: it underlies the equivalence of many learning and inference procedures between the ARLM and EBM regimes (Blondel et al., 17 Dec 2025).

2. Energy-Based Training Objectives for Autoregressive Models

Recoding ARLMs as EBMs enables the application of EBM objectives—especially various forms of maximum likelihood, contrastive divergence, and score-matching scheme—directly to standard autoregressive architectures. For example, the E-ARM method augments the classical cross-entropy loss with a contrastive-divergence-inspired objective that introduces both "positive-phase" (data) and "negative-phase" (model-generated) gradients, leveraging the inherent degree of freedom in the softmax logits to define an explicit energy function $\phi_\theta(x_k, x_{<k}) = -z_{x_k}(x_{<k})$ (Wang et al., 2022).

Negative-phase samples are generated by autoregressive sampling from the model, with importance weighting used to approximate expectations under the unnormalized EBM. This encourages the model to reduce "exposure bias" by training on its own generations and enables learning of both locally and globally plausible distributions. Training thus becomes an alternation between fitting to real data and discriminating (assigning higher energy to) its own, potentially erroneous samples.

3. Extensions: Residual, Globally-Normalized, and Non-Autoregressive EBM Variants

Several research lines enrich the autoregressive-as-energy-based framework with global or residual energies and distinct normalization schemes. Residual EBMs introduce a learned whole-sequence energy $E_\theta(x)$ that corrects (via a globally normalized factor) a fixed, strong base ARLM. These are trained using conditional noise-contrastive estimation, with practical importance resampling strategies for sampling and evaluation (Bakhtin et al., 2020).

Autoregressive Energy Machines generalize ARLMs to allow non-linear, nonparametric, or otherwise flexible energy functions over subsequences or tokens, normalizing each conditional with importance sampling. This approach can capture complex, multi-modal, or globally constrained distributions that standard softmax-based ARLMs cannot (Nash et al., 2019).

In global/autoregressive additive schemes, the overall sequence energy aggregates both token-wise conditionals and hand-crafted or learned global features, supporting applications such as reward-augmented generation, constrained decoding, and plug-and-play control (Parshakova et al., 2019).

4. Algorithmic and Architectural Innovations: MCMC, Reconstruction, and Hybrid Inference

The reframing of the next-token prediction task in energy terms enables novel training and inference pipelines. Cognitively Inspired Energy-Based World Models (EBWM) supplant the conventional softmax with an energy-based compatibility score and deploy an Energy-Based Transformer (EBT) architecture that exposes future-state embeddings in the model's hidden layers (Gladstone et al., 2024). This architecture operates via both past-state self-attention (as in classical causal transformers) and future-state attention mechanisms, retaining parallelizability while supporting bi-directional influence between predicted and observed states.

Critically, instead of strictly maximizing likelihood or minimizing cross-entropy, EBWM employs a reconstruction-based objective: the model descends its energy surface (via MCMC in the embedding space) to reconstruct the true future-token embedding, sidestepping the instability of contrastive losses and the expense of global partition function estimation.

At inference, MCMC refinement is employed to iteratively minimize energy over continuous embeddings, followed by discrete token selection via Boltzmann sampling. Such procedures make explicit the uncertainty (through sampling from the Boltzmann distribution) and assessibility of predictions (each sequence is assigned a plausible energy score) (Gladstone et al., 2024).

5. Theoretical Connections: Soft Bellman Equations, Maximum-Entropy RL, and Sequence Planning

A rigorous formal equivalence exists between the autoregressive decomposition (chain rule) for LLMs and the soft Bellman recursion from maximum-entropy reinforcement learning (Blondel et al., 17 Dec 2025). Specifically, the ARM's logits encode both immediate token-level energy/rewards and the soft value function of future continuations, with the local recurrence

$q(s,y) = r_\phi(s,y) + V(s \oplus y), \quad V(s) = \log \sum_{y'} \exp q(s, y')$

mirroring Bellman's recursive backup. Supervised training via cross-entropy for ARLMs thus coincides (up to constant offsets) with maximum-likelihood EBM training in function space, and their optimal solutions correspond under this mapping.

This perspective provides an explanation for the empirical "lookahead" or planning capabilities of ARLMs: despite being trained for next-token prediction, the architecture internalizes global regularities of completion distributions, encoding soft sequence-level value functions in each local prediction (Blondel et al., 17 Dec 2025).

6. Empirical Advantages, Limitations, and Cognitive Analogies

Recent advances empirically validate the performance and representational advantages of EBM-inspired ARLM formulations:

E-ARM and residual EBM approaches achieve lower perplexity and superior BLEU scores in language modeling and translation, particularly on longer sequences and in low-resource regimes, without architectural changes (Wang et al., 2022, Bakhtin et al., 2020).
EBWM achieves steeper data-scaling exponents and overtakes standard Transformers at moderate-to-large compute budgets, with per-token perplexity consistently lower as data grows (Gladstone et al., 2024).
EBM-based training reduces exposure bias and increases long-range coherence, with negative-phase updates actively pushing down the probability of model-generated mistakes (Wang et al., 2022).

A summary of tradeoffs appears below:

Approach	Key Strengths	Principal Limitations
EBM-style ARLM	Global coherence, exposure bias mitigation	Intractable partition, extra overhead
Residual EBM	Post-hoc corrections, controlled generation	Needs strong base LM, extra sampling
EBWM/EBT	Plausibility evaluation, adaptive computation, uncertainty modeling	Slower inference, hyperparameter tuning

A notable analogy is drawn to human cognition: EBM-based world models can explicitly evaluate the plausibility of multiple futures, adapt computational effort (e.g., by refining more at uncertain tokens), and incorporate predictions back into internal state representations, paralleling features of human System 2 reasoning (Gladstone et al., 2024).

7. Future Directions and Open Problems

The alignment of ARLMs and EBMs positions future research to integrate advances from both paradigms: fine-tuning large LMs with EBM or RL objectives, improving planning and reasoning via explicit value-function or global energy terms, and developing scalable, stable algorithms for approximate normalization and sampling.

Practical barriers remain in partition function estimation, computational cost (especially in MCMC or sampling-intensive algorithms), and stability when combining loss functions. Recent proposals for distributional reinforcement learning distillation provide promising pathways for bridging EBM-based training and tractable autoregressive inference in domains requiring global semantic coherence, constraints, or reward-augmented text generation (Parshakova et al., 2019, Blondel et al., 17 Dec 2025).

Further cross-fertilization between the EBM, RL, and neural sequence modeling communities is anticipated to accelerate the development of LLMs capable of flexible, robust, and controllable sequence generation.