Compute-Optimal Transformer Shapes

Updated 2 October 2025

The paper presents a framework for compute-optimal transformer shapes by balancing model width, depth, and MLP dimensions to minimize task loss under fixed compute constraints.
It leverages scaling-law theory with empirical scaling exponents—such as MLP scaling of approximately 0.6—to optimize architectural configurations and improve compute efficiency.
The methodology integrates adaptive training, dynamic shape scheduling, and hardware co-design to achieve significant reductions in FLOPs and energy consumption across diverse tasks.

Compute-optimal transformer shapes refer to architectural and scaling configurations of transformer models that maximize task performance for a fixed compute budget. This principle encompasses the interplay between model parameterization, training steps, dataset size, architectural dimensioning (such as width, depth, and MLP size), and hardware efficiency. Recent research integrates scaling-law theory, optimization perspectives, information-theoretic bounds, full-stack systems co-design, and specialized parameterizations to characterize and systematically realize compute-optimal transformer architectures. The following sections synthesize theoretical frameworks, empirical methodologies, architectural recipes, and implications for transformer design, as established in contemporary literature.

1. Theoretical Foundations: Scaling Laws and Information-Theoretic Trade-Offs

A rigorous foundation for compute-optimal transformer shapes derives from empirical and theoretical scaling laws that describe how task loss $L$ decays as a function of compute $C$ , model parameters $p$ , and training tokens $t$ . Classical studies establish error bounds for neural predictor parameterizations $P_t$ via decompositions of the form

$\mathbb{E}[L_t(P_t)] = L^* + \text{misspecification error} + \text{estimation error} + \text{inferential error}$

where $L^*$ is the irreducible Bayes error. The misspecification error decays roughly as $O(1/n)$ , where $n$ is model width, while the estimation error scales as $O(1/t)$ with dataset size. Under a compute budget constraint $C \sim n d t$ , minimizing both errors leads to the nearly linear prescription

$n d \sim \tilde{O}(\sqrt{C}), \qquad t \sim \tilde{O}(\sqrt{C})$

yielding a scaling relationship $p \sim t$ (Jeon et al., 2022). As input dimension $d$ or latent complexity $K$ increase, the optimal allocation tilts toward larger model sizes rather than solely expanding dataset size. This framework generalizes to multi-layer and transformer models, explaining the empirical scaling observed in high-capacity LLMs.

2. Architectural Shape Scaling: Width, Depth, and MLP Dimension

Contemporary work extends classical scaling laws to explicitly account for model “shape” — that is, the allocation of compute across width $(w)$ , depth $(d)$ , and MLP hidden dimension $(m)$ . For vision transformers (ViTs), the task loss as a function of a shape parameter $x_k$ and compute $t$ is modeled as

$f_k(x_k, t) \sim \alpha_k x_k^{-a_k} + (\beta_k x_k^{b_k} + \xi_k)t^{-c} + \epsilon_k$

(Alabdulmohsin et al., 2023). Minimizing $f_k$ for fixed $t$ yields compute-optimal scaling exponents

$x_k^* = \left( \frac{\alpha_k a_k t^c}{\beta_k b_k} \right)^{1/(a_k + b_k)} \propto t^{s_k}, \qquad s_k = \frac{c}{a_k + b_k}$

For image classification, experimentally determined scaling rates indicate that the MLP hidden dimension should be scaled most aggressively, followed by depth and then width: $s_\mathrm{MLP} \approx 0.6$ , $s_\mathrm{depth} \approx 0.45$ , $s_\mathrm{width} \approx 0.22$ . This structured scaling surpasses naïve model size increases, resulting in models such as SoViT-400m/14 that match or exceed much larger models in accuracy and require less than half the inference compute.

3. Optimization Perspective: Energy Functions and Unfolding

The optimization perspective recasts the transformer forward pass as iterative descent on an energy function $E(Y)$ . For example, by constructing

$E_1(Y) = \sum_{i,j} \rho(\tfrac{1}{2} \|y_i - y_j\|^2) + R(Y) \quad \text{with} \quad \rho(z) = -\exp(-z),\; R(Y)=\tfrac{1}{2}\|Y\|_F^2,$

and defining updates via (approximate) gradient or proximal steps, the forward pass can be interpreted as alternating minimization over self-attention and feed-forward modules, unfolding into the canonical transformer architecture (Yang et al., 2022). Each layer executes a (possibly inexact) descent step, with the network shape (number of layers) corresponding to the iterative optimization trajectory. This provides both a theoretical grounding for transformer inductive bias and a design map for determining depth relative to task difficulty and optimization convergence.

4. Parameterization and the Role of Depth

The depth and residual parameterization critically impact compute-optimal transformer shapes. Recent work identifies the CompleteP parameterization, which sets block updates as $h^{(\ell+1)} = h^{(\ell)} + L^{-1} \mathcal{F}_\ell(h^{(\ell)})$ (with $L$ total layers), as uniquely enabling both depth-wise hyperparameter transfer and non-lazy (strongly nonlinear) feature learning (Dey et al., 2 May 2025). When $\alpha=1$ in

$h^{(\ell+1)} = h^{(\ell)} + L^{-\alpha} \mathcal{F}_\ell(h^{(\ell)}),$

the same base learning rate and initialization scales function optimally at all depths, and each layer contributes meaningfully to non-linear representation improvement. This allows for a significantly wider range of efficient width/depth ratios, with empirically observed FLOP savings of 12–34% compared to prior standard parameterizations. For compute-constrained deployment, aspect ratios as low as $N:L \approx 12$ remain within $1\%$ of the optimal loss frontier, supporting both shallow–wide and deep–narrow models tuned to hardware and operational constraints.

5. Adaptive Training and Dynamic Shape Scheduling

Compute-optimality can be further improved by adaptively varying model shape during training. Rather than fixing a single configuration (e.g., static patch size, width, or context length), models may interpolate between scaling regimes, scheduling shape adjustments to maximally exploit the compute-efficiency offered by different configurations (Anagnostidis et al., 2023). Let $E(C) = a(C+d)^{-b} + c$ describe task error as a function of compute $C$ given shape parameters; by evaluating the derivative of the inverse scaling law, $q_P(E^*) = \frac{\partial}{\partial E} f_P^{-1}(E)\big|_{E=E^*}$ , one can greedily select the most compute-efficient configuration at each stage. Empirically, such adaptively scheduled models can reduce required FLOPs by 40–60% over static strategies in both vision and language domains.

6. Hardware and Systems Co-Design

Compute-optimal transformer shapes also depend on hardware mapping and hardware-aware shape selection. Full-stack frameworks combine architecture search with hardware profiling, optimization, and scheduling, as in EdgeTran (Tuli et al., 2023) and Gemmini case studies (Kim et al., 2023). Techniques include quantization, pruning, operator fusion, and the addition of on-chip accelerators for non-linear functions (e.g., Softmax, GELU) to achieve efficient inference. Automated neural architecture search (NAS) operates over discrete choices—number of layers, dimension per layer, head count, FFN size—subject to fleet- or device-specific constraints (latency, energy, power). For edge or embedded deployment, hardware-aware co-design can yield models that are $2.8\times$ smaller, $10\times$ more energy-efficient, and $0.8\%$ higher in accuracy than baseline architectures for equivalent tasks.

Framework	Design Space	Hardware Metrics
EdgeTran	Num. layers, heads, FFN	Latency, energy, power
Gemmini	Dimension, ops fusion	EDP, HW utilization
NAS methods	Layerwise parameters	Accuracy/hardware Pareto

7. Task-Driven and Algorithmic Optimality

Certain algorithmic or task constraints inform compute-optimal shape beyond generic scaling laws. In in-context estimation for wireless (Kunde et al., 2023), a single-layer softmax-attention transformer with engineered parameterization is provably optimal for large prompt lengths, and deeper or more complex (multi-layer) models can efficiently accommodate context-varying dynamics. For algorithmic problems such as optimal transport, transformer depth explicitly controls the number of simulated gradient descent steps; the approximation error for solved Wasserstein-2 transport with entropic regularization scales as $O(n^{3/2}/\sqrt{\text{depth}})$ (Daneshmand, 25 Oct 2024). Prompt engineering, by structuring input features and auxiliary memory, can maximize this algorithmic expressivity, further optimizing the computational return of added depth.

8. Implications and Limitations

Compute-optimal transformer shapes enable principled and resource-efficient scaling for large-scale pretraining and downstream adaptation. They provide a quantitative foundation for hyperparameter tuning, guide architectural scaling across width, depth, and feed-forward dimension, and link systems-level choices to model design. However, optimal shapes are task- and domain-dependent; shapes found optimal for image classification may be sub-optimal for dense tasks such as panoptic segmentation (Alabdulmohsin et al., 2023). Compute-optimal scaling laws hold over observed data ranges and may require further refinement for regimes far outside empirical bounds.

By formally connecting information-theoretic error bounds, iterative optimization principles, adaptive scheduling, parameterization strategies, and hardware-aligned architecture search, the field has established a comprehensive recipe for compute-optimal transformer shapes—paving the way for next-generation, scalable, and efficient neural models.