Meta-Attention: Bayesian Per-Token Routing for Efficient Transformer Inference

Published 27 May 2026 in cs.LG | (2605.28384v1)

Abstract: Standard transformer architectures apply a single attention mechanism uniformly across all tokens and sequence positions, irrespective of local context or computational budget. We propose Meta-Attention, a framework that dynamically routes each token to the most appropriate attention strategy -- full softmax attention, linear (kernel) attention, or sliding-window local attention -- via a Bayesian Meta-Controller. Unlike prior routing approaches that use deterministic or prior-free learned routing, the Meta-Controller treats per-token mechanism selection as posterior inference under a compute-aware Dirichlet prior: routing weights are the output of an amortised variational posterior q(alpha | x_t; phi) trained with an Evidence Lower Bound (ELBO) objective that jointly encodes task performance and attention-mechanism cost. This design produces principled routing uncertainty estimates that govern the soft-to-hard routing transition, mitigates routing collapse without ad hoc load-balancing losses, and yields better compute-performance trade-offs than deterministic or prior-free learned routing at negligible overhead. Phase 1 empirical results on a Tiny LM benchmark confirm core predictions: the Bayesian controller's learned routing distribution implies a projected normalised FLOP cost of 25.1% under hard routing, vs. 59.3% for the prior-free baseline (-34.2 pp), and reduces routing entropy from 55.8% to 43.3% (-12.5 pp), demonstrating that the Dirichlet prior prevents routing collapse while the non-Bayesian model defaults to full attention. We present the Bayesian architecture, ELBO training objective, and a Phase 1 PyTorch prototype validating forward-pass correctness, posterior diversity, and a controlled ablation against a prior-free baseline. Code available at: https://github.com/KFEAL/meta-attention

Abstract PDF Upgrade to Chat

Authors (1)

Alan Ferrari

Summary

The paper presents a Bayesian Meta-Controller that dynamically routes each token to optimal attention mechanisms via a compute-aware Dirichlet prior for improved efficiency.
It achieves a projected reduction in normalized FLOP cost by 2.4× and lower routing entropy compared to prior-free baselines, with only a modest perplexity increase.
The method provides practical adaptivity in transformer inference and offers theoretical insights by framing per-token routing as Bayesian model comparison for future expansion.

Meta-Attention: Bayesian Per-Token Routing for Efficient Transformer Inference

Motivation and Conceptual Framework

The Meta-Attention architecture introduces a principled Bayesian approach to per-token routing across several attention algorithms within transformer models. Unlike standard transformer designs, which commit to a single attention mechanism applied homogeneously, Meta-Attention employs a Bayesian Meta-Controller that dynamically selects among full softmax attention, linear (kernel) attention, and sliding-window local attention on a per-token basis. Routing decisions are framed as posterior inference over an expert set under a compute-aware Dirichlet prior, optimizing an ELBO objective that jointly captures task performance and computational cost. This framework directly addresses two limitations in existing efficient attention literature: indiscriminate application of attention algorithms regardless of token-context salience, and unprincipled routing collapse when learned routers are unconstrained by priors.

Bayesian Meta-Controller Architecture

At the core of Meta-Attention is the Bayesian Meta-Controller, parameterized by a two-layer MLP, which computes Dirichlet posterior concentration parameters for each token based on its embedding. The prior encodes normalized FLOP costs: $c_1{=}1.0$ (full attention), $c_2{=}0.15$ (linear), $c_3{=}0.30$ (local). To avoid degeneracy, a nonzero floor is used for all concentration parameters, ensuring the KL divergence term in the ELBO remains well-defined. The posterior is computed as $q_\phi(\cdot \mid x_t)$ where $\delta_\phi(x_t)>0$ via softplus, and the mean Dirichlet parameter yields routing weights $\alpha_{t,i}$ used for token-level merging across experts under soft routing. The ELBO loss enables regularized training, with the strength of regularization controlled by $\beta_{\mathrm{elbo}}$ .

The design also facilitates uncertainty estimation for each token's routing distribution via entropy of the Dirichlet posterior, enabling a principled transition from soft to hard routing: tokens with low uncertainty are routed deterministically to their selected expert, substantially reducing compute overhead by only executing the necessary expert.

Empirical Validation and Numerical Results

The initial Phase~1 experimental validation was conducted on a Tiny LM benchmark, contrasting the Bayesian controller (with ELBO regularization and compute-aware Dirichlet prior) against a prior-free baseline. The key findings are:

Projected normalized FLOP cost: 25.1% for the Bayesian controller vs. 59.3% for prior-free baseline, reflecting a $2.4\times$ projected reduction.
Routing entropy: 43.3% (Bayesian) vs. 55.8% (prior-free), supporting the claim that the Dirichlet prior prevents routing collapse and encourages committed routing distributions.
Perplexity: The Bayesian approach incurs a modest 6.3% overhead (normalized PPL 1.07 vs. 1.00), attributed to regularization pressure favoring less expensive experts.

These results indicate that Bayesian routing not only prevents the collapse to the most expensive attention mechanism but also enables significant compute savings without a disproportionate loss in task accuracy.

Relationship to Prior Work and Orthogonality

Meta-Attention is positioned as orthogonal and composable with techniques such as Mixture of Depths (MoD) [raposo2024mod], which varies computation along transformer depth, and Attention Residuals [chen2026attnres], which modifies residual pathways. The architecture permits stacking, enabling depth savings (via MoD) multiplied by mechanism savings (Meta-Attention). Furthermore, recent advances such as Gated Attention [qiu2025gated], sparse attention emergence [zucchet2025sparse], and polynomial-time learnability of linear attention [yau2024linear] strongly inform the expert set and theoretical underpinnings of this work.

Notably, the use of Bayesian inference for expert selection in routing aligns with empirical findings in VMoER [li2026vmoe] and theoretical analysis in [agarwal2025bayesian], providing a robust geometric interpretation and improved calibration, stability, and OOD detection.

Practical and Theoretical Implications

The practical implication of Meta-Attention is clear: adaptive routing enables more efficient transformer inference, particularly for long sequence tasks, by dynamically allocating compute to the contextually appropriate attention mechanism for each token. The Bayesian controller's uncertainty estimation provides a natural pathway for gating the transition to hard routing, maximizing efficiency without manual intervention or ad hoc regularization.

Theoretically, the framework formalizes the variable selection problem across mechanisms as Bayesian model comparison, with direct ties to geometric and learnability properties of attention mechanisms. This paves the way for further integration of state space models and more heterogeneous expert sets, as well as improved interpretability and calibration of routing decisions.

Limitations and Future Work

Current results are limited to small-scale language modeling and soft routing. Absolute perplexity values and direct wall-clock FLOP savings (requiring hard routing) are targeted for Phase~2 and Phase~3 experiments. Gradient variance in Dirichlet reparameterization, prior sensitivity, adequacy of salience proxies, and scaling behavior of routing entropy phase transitions remain open problems. Expansion of the expert set (e.g., gated attention, state space models) and more sophisticated prior scheduling are identified as future directions.

Conclusion

Meta-Attention advances efficient transformer inference via Bayesian per-token routing, offering principled compute-performance trade-offs and preventing routing collapse through a compute-aware Dirichlet prior. Empirical validation demonstrates substantial projected efficiency gains at modest perplexity overhead, with robust theoretical and empirical support for Bayesian routing. The proposed architecture is highly composable and interpretable, with clear implications for further reductions in transformer inference cost and enhanced adaptivity across diverse downstream tasks. The experimental roadmap lays the foundation for expanded benchmarks and expert sets, with anticipated improvements in both practical efficiency and theoretical understanding of adaptive computation in sequence models.

Markdown Report Issue