Energy-Based Transformers are Scalable Learners and Thinkers (2507.02092v1)

Published 2 Jul 2025 in cs.LG, cs.AI, cs.CL, and cs.CV

Abstract: Inference-time computation techniques, analogous to human System 2 Thinking, have recently become popular for improving model performances. However, most existing approaches suffer from several limitations: they are modality-specific (e.g., working only in text), problem-specific (e.g., verifiable domains like math and coding), or require additional supervision/training on top of unsupervised pretraining (e.g., verifiers or verifiable rewards). In this paper, we ask the question "Is it possible to generalize these System 2 Thinking approaches, and develop models that learn to think solely from unsupervised learning?" Interestingly, we find the answer is yes, by learning to explicitly verify the compatibility between inputs and candidate-predictions, and then re-framing prediction problems as optimization with respect to this verifier. Specifically, we train Energy-Based Transformers (EBTs) -- a new class of Energy-Based Models (EBMs) -- to assign an energy value to every input and candidate-prediction pair, enabling predictions through gradient descent-based energy minimization until convergence. Across both discrete (text) and continuous (visual) modalities, we find EBTs scale faster than the dominant Transformer++ approach during training, achieving an up to 35% higher scaling rate with respect to data, batch size, parameters, FLOPs, and depth. During inference, EBTs improve performance with System 2 Thinking by 29% more than the Transformer++ on language tasks, and EBTs outperform Diffusion Transformers on image denoising while using fewer forward passes. Further, we find that EBTs achieve better results than existing models on most downstream tasks given the same or worse pretraining performance, suggesting that EBTs generalize better than existing approaches. Consequently, EBTs are a promising new paradigm for scaling both the learning and thinking capabilities of models.

Summary

The paper introduces Energy-Based Transformers that integrate energy-based optimization with Transformers to enable dynamic computation and self-verification.
The model scales efficiently across text and vision tasks, achieving up to 35% higher performance rates and robust out-of-distribution generalization.
Regularization methods like replay buffers and Langevin dynamics ensure smooth energy landscapes, enhancing optimization stability and feature learning.

Energy-Based Transformers: A Scalable Paradigm for Learning and System 2 Thinking

The paper "Energy-Based Transformers are Scalable Learners and Thinkers" (2507.02092) introduces Energy-Based Transformers (EBTs), a new class of models that integrate the energy-based modeling framework with Transformer architectures. The central claim is that EBTs not only scale more efficiently than standard Transformer-based models across both discrete (text) and continuous (vision) modalities, but also natively support "System 2 Thinking"—the ability to dynamically allocate computation and verify predictions at inference time, analogous to deliberate human reasoning.

Motivation and Theoretical Foundations

The authors identify three key cognitive facets necessary for advanced reasoning in AI systems:

Dynamic Compute Allocation: The ability to spend variable computational effort per prediction, adapting to task difficulty.
Uncertainty Modeling: Explicit estimation of prediction uncertainty, especially in continuous state spaces.
Prediction Verification: The capacity to verify the quality of candidate predictions, enabling self-evaluation and selective refinement.

Traditional autoregressive Transformers and RNNs are limited in these respects: they allocate fixed compute per prediction, struggle with uncertainty estimation in continuous domains, and lack explicit verification mechanisms. Diffusion models offer iterative inference but require external verifiers and are not modality-agnostic.

EBTs address these limitations by reframing prediction as an optimization problem over a learned energy landscape. The model assigns an energy (unnormalized negative log-likelihood) to each input–candidate pair, and predictions are obtained by minimizing this energy via gradient descent. This process enables dynamic computation, uncertainty quantification, and self-verification within a unified, unsupervised learning framework.

Architecture and Training

EBTs are implemented as Transformer-based energy functions, with two main variants:

Decoder-only (autoregressive) EBTs: Parallelize all next-token predictions, inspired by GPT architectures.
Bidirectional EBTs: Enable masked modeling and infilling, similar to BERT and Diffusion Transformers.

Training involves initializing candidate predictions (e.g., random noise for text tokens or image patches), then iteratively refining them by descending the energy landscape. The loss is computed between the final refined prediction and the ground truth, and gradients are backpropagated through the entire optimization trajectory, requiring efficient Hessian-vector product computation.

To ensure smooth and convex energy landscapes—critical for stable optimization and effective "thinking"—the authors introduce several regularization techniques:

Replay buffers to simulate longer optimization trajectories.
Langevin dynamics (adding noise during optimization) to encourage exploration.
Randomization of step size and number of optimization steps to improve generalization.

Empirical Results

Learning Scalability

Across both language and vision domains, EBTs demonstrate superior scaling properties compared to the Transformer++ recipe (the current standard for large-scale Transformer training):

LLMing: EBTs achieve up to 35% higher scaling rates with respect to data, batch size, model depth, parameter count, FLOPs, and embedding dimension. This suggests improved data and compute efficiency, with the gap expected to widen at foundation model scale.
Video prediction: EBTs scale over 33% faster than Transformer++ in next-frame prediction tasks, particularly with respect to embedding dimension and parameter count.

System 2 Thinking and Inference-Time Computation

EBTs exhibit emergent System 2 Thinking capabilities:

Thinking longer (more optimization steps): EBTs improve per-token performance by up to 29% on language tasks when allowed additional inference-time computation, whereas Transformer++ models show no such improvement.
Self-verification (Best-of-N sampling): EBTs can generate multiple candidate predictions and select the one with minimum energy, yielding further performance gains. The benefit of self-verification increases with model/data scale.
Out-of-distribution (OOD) generalization: The performance improvement from System 2 Thinking scales linearly with the degree of distributional shift, indicating robust generalization to novel or challenging data.

Continuous Domains and Bidirectional Modeling

Image denoising: Bidirectional EBTs outperform Diffusion Transformers (DiTs) on both in-distribution and OOD noise levels, achieving higher PSNR and lower MSE with 99% fewer forward passes.
Representation learning: EBTs yield significantly better linear probe accuracy on ImageNet-1k than DiTs, indicating superior feature learning and understanding of generated content.

Generalization Beyond Pretraining Performance

Notably, EBTs often achieve better downstream task performance than Transformer++ models even when their pretraining perplexity is slightly worse. This suggests that explicit verification and uncertainty modeling confer generalization benefits not captured by standard likelihood-based training.

Implementation Considerations

Computational cost: Training and inference with EBTs are more expensive than standard Transformers due to the need for multiple optimization steps and second-order derivatives. However, the improved scaling rates and data efficiency may offset these costs at large scale.
Hyperparameter sensitivity: Stability is sensitive to optimization step size, number of steps, and regularization parameters. Careful tuning is required, especially for large models or long optimization trajectories.
Scalability: The authors demonstrate scaling up to 800M parameters, but further work is needed to validate EBTs at multi-billion parameter, trillion-token scale.

Limitations and Open Questions

Mode collapse in multimodal distributions: EBTs struggle with highly multimodal output spaces (e.g., unconditional image generation), likely due to the convexity bias in the energy landscape.
Inference latency: The iterative optimization process increases inference time, which may be prohibitive for latency-sensitive applications.
Integration with existing models: While EBTs can serve as standalone models, they may also be used as verifiers or refinement modules for standard feed-forward models.

Implications and Future Directions

EBTs represent a principled approach to integrating learning and inference-time reasoning within a single, scalable architecture. By unifying prediction, verification, and uncertainty modeling, they offer a path toward more robust, adaptable, and generalizable AI systems.

Potential future developments include:

Scaling to foundation model regimes to empirically validate the predicted performance gains.
Hybrid architectures combining EBTs with standard models for efficient System 1/System 2 trade-offs.
Application to world modeling, planning, and control where explicit verification and dynamic computation are critical.
Extension to multimodal and multi-agent settings leveraging the compositionality and flexibility of energy-based objectives.

The explicit demonstration that unsupervised learning can give rise to System 2 Thinking, without domain-specific supervision or external verifiers, challenges prevailing assumptions in large-scale model training. EBTs thus provide a compelling framework for the next generation of scalable, reasoning-capable AI systems.

PDF Markdown

Related Papers

Tweets

https://twitter.com/bronzeagepapi/status/1941205154533187708

https://twitter.com/_akhaliq/status/1941181042163057116

https://twitter.com/hillbig/status/1941657162247438505

https://twitter.com/fly51fly/status/1941253160930296013

https://twitter.com/JagersbergKnut/status/1941518103218880689

https://twitter.com/hillbig/status/1941659295827054652