- The paper introduces Energy-Based Transformers that integrate energy-based optimization with Transformers to enable dynamic computation and self-verification.
- The model scales efficiently across text and vision tasks, achieving up to 35% higher performance rates and robust out-of-distribution generalization.
- Regularization methods like replay buffers and Langevin dynamics ensure smooth energy landscapes, enhancing optimization stability and feature learning.
The paper "Energy-Based Transformers are Scalable Learners and Thinkers" (2507.02092) introduces Energy-Based Transformers (EBTs), a new class of models that integrate the energy-based modeling framework with Transformer architectures. The central claim is that EBTs not only scale more efficiently than standard Transformer-based models across both discrete (text) and continuous (vision) modalities, but also natively support "System 2 Thinking"—the ability to dynamically allocate computation and verify predictions at inference time, analogous to deliberate human reasoning.
Motivation and Theoretical Foundations
The authors identify three key cognitive facets necessary for advanced reasoning in AI systems:
- Dynamic Compute Allocation: The ability to spend variable computational effort per prediction, adapting to task difficulty.
- Uncertainty Modeling: Explicit estimation of prediction uncertainty, especially in continuous state spaces.
- Prediction Verification: The capacity to verify the quality of candidate predictions, enabling self-evaluation and selective refinement.
Traditional autoregressive Transformers and RNNs are limited in these respects: they allocate fixed compute per prediction, struggle with uncertainty estimation in continuous domains, and lack explicit verification mechanisms. Diffusion models offer iterative inference but require external verifiers and are not modality-agnostic.
EBTs address these limitations by reframing prediction as an optimization problem over a learned energy landscape. The model assigns an energy (unnormalized negative log-likelihood) to each input–candidate pair, and predictions are obtained by minimizing this energy via gradient descent. This process enables dynamic computation, uncertainty quantification, and self-verification within a unified, unsupervised learning framework.
Architecture and Training
EBTs are implemented as Transformer-based energy functions, with two main variants:
- Decoder-only (autoregressive) EBTs: Parallelize all next-token predictions, inspired by GPT architectures.
- Bidirectional EBTs: Enable masked modeling and infilling, similar to BERT and Diffusion Transformers.
Training involves initializing candidate predictions (e.g., random noise for text tokens or image patches), then iteratively refining them by descending the energy landscape. The loss is computed between the final refined prediction and the ground truth, and gradients are backpropagated through the entire optimization trajectory, requiring efficient Hessian-vector product computation.
To ensure smooth and convex energy landscapes—critical for stable optimization and effective "thinking"—the authors introduce several regularization techniques:
- Replay buffers to simulate longer optimization trajectories.
- Langevin dynamics (adding noise during optimization) to encourage exploration.
- Randomization of step size and number of optimization steps to improve generalization.
Empirical Results
Learning Scalability
Across both language and vision domains, EBTs demonstrate superior scaling properties compared to the Transformer++ recipe (the current standard for large-scale Transformer training):
- LLMing: EBTs achieve up to 35% higher scaling rates with respect to data, batch size, model depth, parameter count, FLOPs, and embedding dimension. This suggests improved data and compute efficiency, with the gap expected to widen at foundation model scale.
- Video prediction: EBTs scale over 33% faster than Transformer++ in next-frame prediction tasks, particularly with respect to embedding dimension and parameter count.
System 2 Thinking and Inference-Time Computation
EBTs exhibit emergent System 2 Thinking capabilities:
- Thinking longer (more optimization steps): EBTs improve per-token performance by up to 29% on language tasks when allowed additional inference-time computation, whereas Transformer++ models show no such improvement.
- Self-verification (Best-of-N sampling): EBTs can generate multiple candidate predictions and select the one with minimum energy, yielding further performance gains. The benefit of self-verification increases with model/data scale.
- Out-of-distribution (OOD) generalization: The performance improvement from System 2 Thinking scales linearly with the degree of distributional shift, indicating robust generalization to novel or challenging data.
Continuous Domains and Bidirectional Modeling
- Image denoising: Bidirectional EBTs outperform Diffusion Transformers (DiTs) on both in-distribution and OOD noise levels, achieving higher PSNR and lower MSE with 99% fewer forward passes.
- Representation learning: EBTs yield significantly better linear probe accuracy on ImageNet-1k than DiTs, indicating superior feature learning and understanding of generated content.
Notably, EBTs often achieve better downstream task performance than Transformer++ models even when their pretraining perplexity is slightly worse. This suggests that explicit verification and uncertainty modeling confer generalization benefits not captured by standard likelihood-based training.
Implementation Considerations
- Computational cost: Training and inference with EBTs are more expensive than standard Transformers due to the need for multiple optimization steps and second-order derivatives. However, the improved scaling rates and data efficiency may offset these costs at large scale.
- Hyperparameter sensitivity: Stability is sensitive to optimization step size, number of steps, and regularization parameters. Careful tuning is required, especially for large models or long optimization trajectories.
- Scalability: The authors demonstrate scaling up to 800M parameters, but further work is needed to validate EBTs at multi-billion parameter, trillion-token scale.
Limitations and Open Questions
- Mode collapse in multimodal distributions: EBTs struggle with highly multimodal output spaces (e.g., unconditional image generation), likely due to the convexity bias in the energy landscape.
- Inference latency: The iterative optimization process increases inference time, which may be prohibitive for latency-sensitive applications.
- Integration with existing models: While EBTs can serve as standalone models, they may also be used as verifiers or refinement modules for standard feed-forward models.
Implications and Future Directions
EBTs represent a principled approach to integrating learning and inference-time reasoning within a single, scalable architecture. By unifying prediction, verification, and uncertainty modeling, they offer a path toward more robust, adaptable, and generalizable AI systems.
Potential future developments include:
- Scaling to foundation model regimes to empirically validate the predicted performance gains.
- Hybrid architectures combining EBTs with standard models for efficient System 1/System 2 trade-offs.
- Application to world modeling, planning, and control where explicit verification and dynamic computation are critical.
- Extension to multimodal and multi-agent settings leveraging the compositionality and flexibility of energy-based objectives.
The explicit demonstration that unsupervised learning can give rise to System 2 Thinking, without domain-specific supervision or external verifiers, challenges prevailing assumptions in large-scale model training. EBTs thus provide a compelling framework for the next generation of scalable, reasoning-capable AI systems.