Equilibrium Transformers (EqT)

Updated 20 May 2026

Equilibrium Transformers (EqT) are autoregressive sequence models that iteratively refine latent states using a closed-loop gradient descent process to reach energetic equilibrium before token prediction.
The architecture employs an Equilibrium Refinement Module (ERM) that combines attention, initial proposal, and iterative gradient updates driven by a learned energy function.
Empirical results demonstrate enhanced long-range reasoning with up to an 8% accuracy boost on hard tasks and rapid convergence in as few as 8 refinement steps.

Equilibrium Transformers (EqT) are a class of autoregressive sequence models that replace the open-loop, one-shot inference of standard transformers with a closed-loop paradigm based on iterative latent refinement. This approach enforces self-consistent hidden states at each sequence position prior to token commitment by minimizing a learned energy function over the latent space. In EqT, each hidden state is refined via gradient-based minimization until equilibrium is reached with respect to both the dynamical prior and a composite self-supervised energy. The method provides theoretical guarantees as approximate MAP inference in a latent energy-based model, converges geometrically under mild assumptions, and yields the greatest empirical benefits on hard prediction instances—thereby addressing key limitations of autoregressive transformers in long-range reasoning and multi-step planning tasks (Jafari et al., 26 Nov 2025).

1. Motivation: Closed-Loop Prediction versus Open-Loop Autoregressive Transformers

Standard autoregressive transformers operate in an open-loop manner, computing each hidden state $h_t = F_\theta(h_{t-1}, x_{\leq t})$ in a single forward pass and never revisiting it, committing irrevocably to an internal representation. This lack of revision propagates early errors forward and fundamentally limits the model’s ability to recover from mistakes, especially in settings that require long-range dependency tracking or factual consistency. EqT introduces the closed-loop prediction principle: rather than immediately generating output, the latent state at each step is iteratively refined until a self-consistent equilibrium is achieved, defined as the minimum of an energy-regularized objective. The equilibrium latent $z^*_t$ is obtained by

$z^*_t \in \arg\min_{z\in\mathbb{R}^d} E(z; x_{\leq t}) + \frac{1}{2\gamma} \|z - F_\theta(h_{t-1}, x_{\leq t})\|^2,$

and only then is the token emission $p(x_t \mid x_{< t}) = \mathrm{softmax}(W z^*_t)$ produced. This closed-loop process eliminates the “commitment bottleneck” of classical transformers (Jafari et al., 26 Nov 2025).

2. Architecture and Energy Function Design

The core architectural modification in EqT is the Equilibrium Refinement Module (ERM), which replaces or augments the feed-forward sublayer in a standard transformer block. The computation at each position comprises three phases:

Attention & Proposal: Compute standard multi-head self-attention and add a feed-forward layer to propose an initial latent $\hat z^{(0)} = a^{(0)} + \mathrm{FFN}(a^{(0)})$ .
Iterative Refinement: Starting from $z^{(0)} = \hat z^{(0)}$ , perform $K$ steps of gradient descent:

$z^{(k+1)} = z^{(k)} - \eta \nabla_z \left[ E(z^{(k)}; x_{\leq t}) + \frac{1}{2\gamma} \|z^{(k)} - \hat z^{(0)}\|^2 \right].$

Output to Next Layer: Pass the equilibrium latent $z^*_t = z^{(K)}$ through layer norm for subsequent processing.

The energy function $E(z; x_{\leq t})$ is a learned, differentiable combination of multiple self-supervised losses:

Reverse predictive coding $z^*_t$ 0: Recovers recent context via a small reverse transformer.
Masked reconstruction $z^*_t$ 1: Predicts masked tokens from $z^*_t$ 2 using a lightweight decoder.
Output confidence $z^*_t$ 3: Penalizes ambiguous, high-entropy predictions.
Episodic memory coherence $z^*_t$ 4: Pulls $z^*_t$ 5 toward relevant recent memory vectors. Weighting parameters $z^*_t$ 6 allow principled tuning of these components (Jafari et al., 26 Nov 2025).

3. Inference and Optimization Dynamics

The latent refinement is performed via first-order gradient descent, with precise update steps: $z^*_t$ 7 Iteration proceeds until either a maximum number of steps $z^*_t$ 8 is reached or the update norm $z^*_t$ 9 falls below threshold $z^*_t \in \arg\min_{z\in\mathbb{R}^d} E(z; x_{\leq t}) + \frac{1}{2\gamma} \|z - F_\theta(h_{t-1}, x_{\leq t})\|^2,$ 0 (typically $z^*_t \in \arg\min_{z\in\mathbb{R}^d} E(z; x_{\leq t}) + \frac{1}{2\gamma} \|z - F_\theta(h_{t-1}, x_{\leq t})\|^2,$ 1). Empirical results indicate that $z^*_t \in \arg\min_{z\in\mathbb{R}^d} E(z; x_{\leq t}) + \frac{1}{2\gamma} \|z - F_\theta(h_{t-1}, x_{\leq t})\|^2,$ 2 of tokens converge within $z^*_t \in \arg\min_{z\in\mathbb{R}^d} E(z; x_{\leq t}) + \frac{1}{2\gamma} \|z - F_\theta(h_{t-1}, x_{\leq t})\|^2,$ 3 steps, balancing accuracy and computational efficiency. The refined equilibrium $z^*_t \in \arg\min_{z\in\mathbb{R}^d} E(z; x_{\leq t}) + \frac{1}{2\gamma} \|z - F_\theta(h_{t-1}, x_{\leq t})\|^2,$ 4 is then used for token prediction (Jafari et al., 26 Nov 2025).

4. Theoretical Foundations

EqT is formally equivalent to performing MAP inference in a latent energy-based model: $z^*_t \in \arg\min_{z\in\mathbb{R}^d} E(z; x_{\leq t}) + \frac{1}{2\gamma} \|z - F_\theta(h_{t-1}, x_{\leq t})\|^2,$ 5 and the closed-loop equilibrium produces $z^*_t \in \arg\min_{z\in\mathbb{R}^d} E(z; x_{\leq t}) + \frac{1}{2\gamma} \|z - F_\theta(h_{t-1}, x_{\leq t})\|^2,$ 6. Under mild strong convexity and smoothness assumptions on $z^*_t \in \arg\min_{z\in\mathbb{R}^d} E(z; x_{\leq t}) + \frac{1}{2\gamma} \|z - F_\theta(h_{t-1}, x_{\leq t})\|^2,$ 7, standard gradient-descent analysis guarantees geometric (“linear”) convergence, with contraction factor typically $z^*_t \in \arg\min_{z\in\mathbb{R}^d} E(z; x_{\leq t}) + \frac{1}{2\gamma} \|z - F_\theta(h_{t-1}, x_{\leq t})\|^2,$ 8 for practical hyperparameters. Moreover, the benefit of refinement is greatest when the amortized proposal $z^*_t \in \arg\min_{z\in\mathbb{R}^d} E(z; x_{\leq t}) + \frac{1}{2\gamma} \|z - F_\theta(h_{t-1}, x_{\leq t})\|^2,$ 9 is distant from the loss-optimal $p(x_t \mid x_{< t}) = \mathrm{softmax}(W z^*_t)$ 0, i.e., for hard prediction instances. This effect is quantitatively validated on synthetic tasks (Jafari et al., 26 Nov 2025).

The EqT framework generalizes several distinct architectures:

Standard Transformer: $p(x_t \mid x_{< t}) = \mathrm{softmax}(W z^*_t)$ 1 (or $p(x_t \mid x_{< t}) = \mathrm{softmax}(W z^*_t)$ 2) recovers the open-loop transformer.
Deep Equilibrium Models (DEQ): With $p(x_t \mid x_{< t}) = \mathrm{softmax}(W z^*_t)$ 3 and $p(x_t \mid x_{< t}) = \mathrm{softmax}(W z^*_t)$ 4, the model reduces to a fixed-point system as in (Bai et al., 2019).
Diffusion LLMs: In the limiting regime $p(x_t \mid x_{< t}) = \mathrm{softmax}(W z^*_t)$ 5, $p(x_t \mid x_{< t}) = \mathrm{softmax}(W z^*_t)$ 6, the refinement process becomes analogous to diffusion-based denoising sampling.
Test-Time Training (TTT): Treating model weights as latent variables and refining them via a similar energy minimization recapitulates TTT.
Energy-Based Models (EBM): Removing the prior ( $p(x_t \mid x_{< t}) = \mathrm{softmax}(W z^*_t)$ 7) makes EqT a standard latent energy optimizer.

This closed-loop formalism establishes a unified framework parameterized by the energy form $p(x_t \mid x_{< t}) = \mathrm{softmax}(W z^*_t)$ 8 and refinement depth $p(x_t \mid x_{< t}) = \mathrm{softmax}(W z^*_t)$ 9, subsuming prominent families of autoregressive and energy-based network design (Jafari et al., 26 Nov 2025, Bai et al., 2019).

6. Empirical Results: Binary Parity Task

EqT's empirical evaluation employs the running XOR (“parity”) prediction task, a stringent probe of long-range sequential reasoning. The model configuration is as follows:

6-layer transformer, hidden dimension 256, 8 heads. EqT version augments with ERM, increasing parameters from $\hat z^{(0)} = a^{(0)} + \mathrm{FFN}(a^{(0)})$ 0M to $\hat z^{(0)} = a^{(0)} + \mathrm{FFN}(a^{(0)})$ 1M ( $\hat z^{(0)} = a^{(0)} + \mathrm{FFN}(a^{(0)})$ 2).
25 training epochs on 32K random sequences, evaluation on 4K held-out; sequence lengths $\hat z^{(0)} = a^{(0)} + \mathrm{FFN}(a^{(0)})$ 3.
EqT inference with up to $\hat z^{(0)} = a^{(0)} + \mathrm{FFN}(a^{(0)})$ 4 refinement steps ( $\hat z^{(0)} = a^{(0)} + \mathrm{FFN}(a^{(0)})$ 5 for convergence in practice).

Performance summary (per-token accuracy as representative excerpt):

Sequence Length	Standard (%)	EqT (%)	$\hat z^{(0)} = a^{(0)} + \mathrm{FFN}(a^{(0)})$ 6 (abs)
8–48	>95	$\hat z^{(0)} = a^{(0)} + \mathrm{FFN}(a^{(0)})$ 795	small
64	88.15	92.81	+4.66
96	77.19	77.68	+0.49
128	64.64	67.04	+2.40
192	51.86	59.93	+8.07
256	55.79	56.60	+0.80

For $\hat z^{(0)} = a^{(0)} + \mathrm{FFN}(a^{(0)})$ 8, EqT achieves an average accuracy improvement of +3.28% over standard transformers, and a maximum gain of +8.07% on the hardest subsets (standard approaches random performance). Performance improvements scale with task difficulty, and nearly all tokens converge with $\hat z^{(0)} = a^{(0)} + \mathrm{FFN}(a^{(0)})$ 9 refinement steps. Inference computational overhead is about $z^{(0)} = \hat z^{(0)}$ 0 at $z^{(0)} = \hat z^{(0)}$ 1, with training cost $z^{(0)} = \hat z^{(0)}$ 2 per epoch (with two refinement steps). Adaptive early stopping can reduce inference cost to $z^{(0)} = \hat z^{(0)}$ 3 (Jafari et al., 26 Nov 2025).

7. Connections to Deep Equilibrium Models and Memory Considerations

The DEQ-Transformer, as introduced by Bai et al. (Bai et al., 2019), is a special case of the EqT formalism under specific energy choices. DEQ models operate by finding the fixed point $z^{(0)} = \hat z^{(0)}$ 4, leveraging root-finding solvers such as Broyden’s method to directly compute equilibrium without explicit layer unrolling. Gradients are computed via implicit differentiation, allowing constant $z^{(0)} = \hat z^{(0)}$ 5 memory regardless of effective “depth.” On large-scale language modeling benchmarks (WikiText-103, PTB), DEQ-Transformers match or slightly surpass standard transformer performance with up to $z^{(0)} = \hat z^{(0)}$ 6– $z^{(0)} = \hat z^{(0)}$ 7 reduction in GPU memory usage. EqT relaxes the strict fixed-point requirement by introducing a learnable energy $z^{(0)} = \hat z^{(0)}$ 8 and thus accommodates broader classes of structured refinement, at the expense of some additional inference cost compared to traditional transformers (Bai et al., 2019, Jafari et al., 26 Nov 2025).

Markdown Report Issue Upgrade to Chat

References (2)

Closed-Loop Transformers: Autoregressive Modeling as Iterative Latent Equilibrium (2025)

Deep Equilibrium Models (2019)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Equilibrium Transformers (EqT).