Papers
Topics
Authors
Recent
Search
2000 character limit reached

Equilibrium Transformers (EqT)

Updated 20 May 2026
  • Equilibrium Transformers (EqT) are autoregressive sequence models that iteratively refine latent states using a closed-loop gradient descent process to reach energetic equilibrium before token prediction.
  • The architecture employs an Equilibrium Refinement Module (ERM) that combines attention, initial proposal, and iterative gradient updates driven by a learned energy function.
  • Empirical results demonstrate enhanced long-range reasoning with up to an 8% accuracy boost on hard tasks and rapid convergence in as few as 8 refinement steps.

Equilibrium Transformers (EqT) are a class of autoregressive sequence models that replace the open-loop, one-shot inference of standard transformers with a closed-loop paradigm based on iterative latent refinement. This approach enforces self-consistent hidden states at each sequence position prior to token commitment by minimizing a learned energy function over the latent space. In EqT, each hidden state is refined via gradient-based minimization until equilibrium is reached with respect to both the dynamical prior and a composite self-supervised energy. The method provides theoretical guarantees as approximate MAP inference in a latent energy-based model, converges geometrically under mild assumptions, and yields the greatest empirical benefits on hard prediction instances—thereby addressing key limitations of autoregressive transformers in long-range reasoning and multi-step planning tasks (Jafari et al., 26 Nov 2025).

1. Motivation: Closed-Loop Prediction versus Open-Loop Autoregressive Transformers

Standard autoregressive transformers operate in an open-loop manner, computing each hidden state ht=Fθ(ht1,xt)h_t = F_\theta(h_{t-1}, x_{\leq t}) in a single forward pass and never revisiting it, committing irrevocably to an internal representation. This lack of revision propagates early errors forward and fundamentally limits the model’s ability to recover from mistakes, especially in settings that require long-range dependency tracking or factual consistency. EqT introduces the closed-loop prediction principle: rather than immediately generating output, the latent state at each step is iteratively refined until a self-consistent equilibrium is achieved, defined as the minimum of an energy-regularized objective. The equilibrium latent ztz^*_t is obtained by

ztargminzRdE(z;xt)+12γzFθ(ht1,xt)2,z^*_t \in \arg\min_{z\in\mathbb{R}^d} E(z; x_{\leq t}) + \frac{1}{2\gamma} \|z - F_\theta(h_{t-1}, x_{\leq t})\|^2,

and only then is the token emission p(xtx<t)=softmax(Wzt)p(x_t \mid x_{< t}) = \mathrm{softmax}(W z^*_t) produced. This closed-loop process eliminates the “commitment bottleneck” of classical transformers (Jafari et al., 26 Nov 2025).

2. Architecture and Energy Function Design

The core architectural modification in EqT is the Equilibrium Refinement Module (ERM), which replaces or augments the feed-forward sublayer in a standard transformer block. The computation at each position comprises three phases:

  1. Attention & Proposal: Compute standard multi-head self-attention and add a feed-forward layer to propose an initial latent z^(0)=a(0)+FFN(a(0))\hat z^{(0)} = a^{(0)} + \mathrm{FFN}(a^{(0)}).
  2. Iterative Refinement: Starting from z(0)=z^(0)z^{(0)} = \hat z^{(0)}, perform KK steps of gradient descent:

z(k+1)=z(k)ηz[E(z(k);xt)+12γz(k)z^(0)2].z^{(k+1)} = z^{(k)} - \eta \nabla_z \left[ E(z^{(k)}; x_{\leq t}) + \frac{1}{2\gamma} \|z^{(k)} - \hat z^{(0)}\|^2 \right].

  1. Output to Next Layer: Pass the equilibrium latent zt=z(K)z^*_t = z^{(K)} through layer norm for subsequent processing.

The energy function E(z;xt)E(z; x_{\leq t}) is a learned, differentiable combination of multiple self-supervised losses:

  • Reverse predictive coding ztz^*_t0: Recovers recent context via a small reverse transformer.
  • Masked reconstruction ztz^*_t1: Predicts masked tokens from ztz^*_t2 using a lightweight decoder.
  • Output confidence ztz^*_t3: Penalizes ambiguous, high-entropy predictions.
  • Episodic memory coherence ztz^*_t4: Pulls ztz^*_t5 toward relevant recent memory vectors. Weighting parameters ztz^*_t6 allow principled tuning of these components (Jafari et al., 26 Nov 2025).

3. Inference and Optimization Dynamics

The latent refinement is performed via first-order gradient descent, with precise update steps: ztz^*_t7 Iteration proceeds until either a maximum number of steps ztz^*_t8 is reached or the update norm ztz^*_t9 falls below threshold ztargminzRdE(z;xt)+12γzFθ(ht1,xt)2,z^*_t \in \arg\min_{z\in\mathbb{R}^d} E(z; x_{\leq t}) + \frac{1}{2\gamma} \|z - F_\theta(h_{t-1}, x_{\leq t})\|^2,0 (typically ztargminzRdE(z;xt)+12γzFθ(ht1,xt)2,z^*_t \in \arg\min_{z\in\mathbb{R}^d} E(z; x_{\leq t}) + \frac{1}{2\gamma} \|z - F_\theta(h_{t-1}, x_{\leq t})\|^2,1). Empirical results indicate that ztargminzRdE(z;xt)+12γzFθ(ht1,xt)2,z^*_t \in \arg\min_{z\in\mathbb{R}^d} E(z; x_{\leq t}) + \frac{1}{2\gamma} \|z - F_\theta(h_{t-1}, x_{\leq t})\|^2,2 of tokens converge within ztargminzRdE(z;xt)+12γzFθ(ht1,xt)2,z^*_t \in \arg\min_{z\in\mathbb{R}^d} E(z; x_{\leq t}) + \frac{1}{2\gamma} \|z - F_\theta(h_{t-1}, x_{\leq t})\|^2,3 steps, balancing accuracy and computational efficiency. The refined equilibrium ztargminzRdE(z;xt)+12γzFθ(ht1,xt)2,z^*_t \in \arg\min_{z\in\mathbb{R}^d} E(z; x_{\leq t}) + \frac{1}{2\gamma} \|z - F_\theta(h_{t-1}, x_{\leq t})\|^2,4 is then used for token prediction (Jafari et al., 26 Nov 2025).

4. Theoretical Foundations

EqT is formally equivalent to performing MAP inference in a latent energy-based model: ztargminzRdE(z;xt)+12γzFθ(ht1,xt)2,z^*_t \in \arg\min_{z\in\mathbb{R}^d} E(z; x_{\leq t}) + \frac{1}{2\gamma} \|z - F_\theta(h_{t-1}, x_{\leq t})\|^2,5 and the closed-loop equilibrium produces ztargminzRdE(z;xt)+12γzFθ(ht1,xt)2,z^*_t \in \arg\min_{z\in\mathbb{R}^d} E(z; x_{\leq t}) + \frac{1}{2\gamma} \|z - F_\theta(h_{t-1}, x_{\leq t})\|^2,6. Under mild strong convexity and smoothness assumptions on ztargminzRdE(z;xt)+12γzFθ(ht1,xt)2,z^*_t \in \arg\min_{z\in\mathbb{R}^d} E(z; x_{\leq t}) + \frac{1}{2\gamma} \|z - F_\theta(h_{t-1}, x_{\leq t})\|^2,7, standard gradient-descent analysis guarantees geometric (“linear”) convergence, with contraction factor typically ztargminzRdE(z;xt)+12γzFθ(ht1,xt)2,z^*_t \in \arg\min_{z\in\mathbb{R}^d} E(z; x_{\leq t}) + \frac{1}{2\gamma} \|z - F_\theta(h_{t-1}, x_{\leq t})\|^2,8 for practical hyperparameters. Moreover, the benefit of refinement is greatest when the amortized proposal ztargminzRdE(z;xt)+12γzFθ(ht1,xt)2,z^*_t \in \arg\min_{z\in\mathbb{R}^d} E(z; x_{\leq t}) + \frac{1}{2\gamma} \|z - F_\theta(h_{t-1}, x_{\leq t})\|^2,9 is distant from the loss-optimal p(xtx<t)=softmax(Wzt)p(x_t \mid x_{< t}) = \mathrm{softmax}(W z^*_t)0, i.e., for hard prediction instances. This effect is quantitatively validated on synthetic tasks (Jafari et al., 26 Nov 2025).

The EqT framework generalizes several distinct architectures:

  • Standard Transformer: p(xtx<t)=softmax(Wzt)p(x_t \mid x_{< t}) = \mathrm{softmax}(W z^*_t)1 (or p(xtx<t)=softmax(Wzt)p(x_t \mid x_{< t}) = \mathrm{softmax}(W z^*_t)2) recovers the open-loop transformer.
  • Deep Equilibrium Models (DEQ): With p(xtx<t)=softmax(Wzt)p(x_t \mid x_{< t}) = \mathrm{softmax}(W z^*_t)3 and p(xtx<t)=softmax(Wzt)p(x_t \mid x_{< t}) = \mathrm{softmax}(W z^*_t)4, the model reduces to a fixed-point system as in (Bai et al., 2019).
  • Diffusion LLMs: In the limiting regime p(xtx<t)=softmax(Wzt)p(x_t \mid x_{< t}) = \mathrm{softmax}(W z^*_t)5, p(xtx<t)=softmax(Wzt)p(x_t \mid x_{< t}) = \mathrm{softmax}(W z^*_t)6, the refinement process becomes analogous to diffusion-based denoising sampling.
  • Test-Time Training (TTT): Treating model weights as latent variables and refining them via a similar energy minimization recapitulates TTT.
  • Energy-Based Models (EBM): Removing the prior (p(xtx<t)=softmax(Wzt)p(x_t \mid x_{< t}) = \mathrm{softmax}(W z^*_t)7) makes EqT a standard latent energy optimizer.

This closed-loop formalism establishes a unified framework parameterized by the energy form p(xtx<t)=softmax(Wzt)p(x_t \mid x_{< t}) = \mathrm{softmax}(W z^*_t)8 and refinement depth p(xtx<t)=softmax(Wzt)p(x_t \mid x_{< t}) = \mathrm{softmax}(W z^*_t)9, subsuming prominent families of autoregressive and energy-based network design (Jafari et al., 26 Nov 2025, Bai et al., 2019).

6. Empirical Results: Binary Parity Task

EqT's empirical evaluation employs the running XOR (“parity”) prediction task, a stringent probe of long-range sequential reasoning. The model configuration is as follows:

  • 6-layer transformer, hidden dimension 256, 8 heads. EqT version augments with ERM, increasing parameters from z^(0)=a(0)+FFN(a(0))\hat z^{(0)} = a^{(0)} + \mathrm{FFN}(a^{(0)})0M to z^(0)=a(0)+FFN(a(0))\hat z^{(0)} = a^{(0)} + \mathrm{FFN}(a^{(0)})1M (z^(0)=a(0)+FFN(a(0))\hat z^{(0)} = a^{(0)} + \mathrm{FFN}(a^{(0)})2).
  • 25 training epochs on 32K random sequences, evaluation on 4K held-out; sequence lengths z^(0)=a(0)+FFN(a(0))\hat z^{(0)} = a^{(0)} + \mathrm{FFN}(a^{(0)})3.
  • EqT inference with up to z^(0)=a(0)+FFN(a(0))\hat z^{(0)} = a^{(0)} + \mathrm{FFN}(a^{(0)})4 refinement steps (z^(0)=a(0)+FFN(a(0))\hat z^{(0)} = a^{(0)} + \mathrm{FFN}(a^{(0)})5 for convergence in practice).

Performance summary (per-token accuracy as representative excerpt):

Sequence Length Standard (%) EqT (%) z^(0)=a(0)+FFN(a(0))\hat z^{(0)} = a^{(0)} + \mathrm{FFN}(a^{(0)})6 (abs)
8–48 >95 z^(0)=a(0)+FFN(a(0))\hat z^{(0)} = a^{(0)} + \mathrm{FFN}(a^{(0)})795 small
64 88.15 92.81 +4.66
96 77.19 77.68 +0.49
128 64.64 67.04 +2.40
192 51.86 59.93 +8.07
256 55.79 56.60 +0.80

For z^(0)=a(0)+FFN(a(0))\hat z^{(0)} = a^{(0)} + \mathrm{FFN}(a^{(0)})8, EqT achieves an average accuracy improvement of +3.28% over standard transformers, and a maximum gain of +8.07% on the hardest subsets (standard approaches random performance). Performance improvements scale with task difficulty, and nearly all tokens converge with z^(0)=a(0)+FFN(a(0))\hat z^{(0)} = a^{(0)} + \mathrm{FFN}(a^{(0)})9 refinement steps. Inference computational overhead is about z(0)=z^(0)z^{(0)} = \hat z^{(0)}0 at z(0)=z^(0)z^{(0)} = \hat z^{(0)}1, with training cost z(0)=z^(0)z^{(0)} = \hat z^{(0)}2 per epoch (with two refinement steps). Adaptive early stopping can reduce inference cost to z(0)=z^(0)z^{(0)} = \hat z^{(0)}3 (Jafari et al., 26 Nov 2025).

7. Connections to Deep Equilibrium Models and Memory Considerations

The DEQ-Transformer, as introduced by Bai et al. (Bai et al., 2019), is a special case of the EqT formalism under specific energy choices. DEQ models operate by finding the fixed point z(0)=z^(0)z^{(0)} = \hat z^{(0)}4, leveraging root-finding solvers such as Broyden’s method to directly compute equilibrium without explicit layer unrolling. Gradients are computed via implicit differentiation, allowing constant z(0)=z^(0)z^{(0)} = \hat z^{(0)}5 memory regardless of effective “depth.” On large-scale language modeling benchmarks (WikiText-103, PTB), DEQ-Transformers match or slightly surpass standard transformer performance with up to z(0)=z^(0)z^{(0)} = \hat z^{(0)}6–z(0)=z^(0)z^{(0)} = \hat z^{(0)}7 reduction in GPU memory usage. EqT relaxes the strict fixed-point requirement by introducing a learnable energy z(0)=z^(0)z^{(0)} = \hat z^{(0)}8 and thus accommodates broader classes of structured refinement, at the expense of some additional inference cost compared to traditional transformers (Bai et al., 2019, Jafari et al., 26 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Equilibrium Transformers (EqT).