Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation (2507.10524v1)

Published 14 Jul 2025 in cs.CL and cs.LG

Abstract: Scaling LLMs unlocks impressive capabilities, but the accompanying computational and memory demands make both training and deployment expensive. Existing efficiency efforts typically target either parameter sharing or adaptive computation, leaving open the question of how to attain both simultaneously. We introduce Mixture-of-Recursions (MoR), a unified framework that combines the two axes of efficiency inside a single Recursive Transformer. MoR reuses a shared stack of layers across recursion steps to achieve parameter efficiency, while lightweight routers enable adaptive token-level thinking by dynamically assigning different recursion depths to individual tokens. This allows MoR to focus quadratic attention computation only among tokens still active at a given recursion depth, further improving memory access efficiency by selectively caching only their key-value pairs. Beyond these core mechanisms, we also propose a KV sharing variant that reuses KV pairs from the first recursion, specifically designed to decrease prefill latency and memory footprint. Across model scales ranging from 135M to 1.7B parameters, MoR forms a new Pareto frontier: at equal training FLOPs and smaller model sizes, it significantly lowers validation perplexity and improves few-shot accuracy, while delivering higher throughput compared with vanilla and existing recursive baselines. These gains demonstrate that MoR is an effective path towards large-model quality without incurring large-model cost.

Summary

The paper introduces Mixture-of-Recursions, a framework that dynamically assigns recursive depths to tokens for efficient computation in Transformer models.
It employs expert-choice and token-choice routing strategies along with innovative KV caching methods to enhance throughput and reduce memory usage.
Empirical results demonstrate lower validation perplexity and improved scalability across models from 135M to 1.7B parameters, setting a new efficiency benchmark.

Mixture-of-Recursions: Dynamic Recursive Depths for Adaptive Token-Level Computation

The Mixture-of-Recursions (MoR) framework introduces a unified approach to parameter and computational efficiency in Transformer-based LLMs by combining recursive parameter sharing with token-level adaptive computation. This architecture addresses the dual challenge of reducing both the parameter count and the computational/memory overhead associated with large-scale LLMs, while maintaining or improving model quality.

Core Methodology

MoR builds upon Recursive Transformers, which reuse a shared stack of layers across multiple recursion steps, thereby achieving significant parameter efficiency. The key innovation in MoR is the integration of lightweight, trainable routers that dynamically assign recursion depths to individual tokens. This enables the model to allocate more computation to complex tokens and less to simpler ones, effectively implementing token-level adaptive computation within a parameter-shared architecture.

Two primary routing strategies are explored:

Expert-choice routing: At each recursion step, a router selects the top- $k$ tokens to continue, progressively narrowing the set of active tokens. This approach ensures a static compute budget and perfect load balancing but introduces potential causality violations during training, which are mitigated via auxiliary losses.
Token-choice routing: Each token is assigned a fixed recursion depth at the outset, determining its full compute path. This avoids causality issues but can suffer from load imbalance, addressed through balancing losses or loss-free algorithms.

MoR also introduces two key-value (KV) caching strategies to further improve memory and compute efficiency:

Recursion-wise KV caching: Only tokens routed to a given recursion step store their KV pairs at that level, reducing memory and I/O requirements.
Recursive KV sharing: All tokens cache KV pairs at the first recursion step, which are then reused in subsequent recursions, minimizing prefill latency and memory footprint.

Empirical Results

MoR demonstrates strong empirical performance across model scales (135M to 1.7B parameters):

Efficiency: At equal training FLOPs, MoR achieves lower validation perplexity and higher few-shot accuracy than both vanilla and recursive Transformer baselines, despite using up to 50% fewer parameters.
Throughput: Inference throughput is significantly improved (up to 2.18× over baselines) due to reduced KV cache sizes and the ability to batch tokens at different recursion depths.
Scalability: MoR maintains or exceeds the performance of standard Transformers at larger scales, establishing a new Pareto frontier for compute-accuracy trade-offs.

The ablation studies reveal that the Middle-Cycle parameter sharing strategy is most effective, and that expert-choice routing with auxiliary loss and a linear router yields the best performance. While recursion-wise KV caching is generally superior for expert-choice routing, recursive KV sharing can benefit token-choice routing, especially when routing decisions are less accurate.

Practical Implications

MoR's design has several practical advantages for real-world deployment:

Reduced Memory and Compute: By adaptively allocating computation and memory only where needed, MoR enables the deployment of high-quality LLMs on resource-constrained hardware.
Flexible Inference: The architecture supports test-time scaling, allowing the number of recursion steps to be increased for higher quality or decreased for faster inference, without retraining.
Compatibility with Sparse Methods: MoR's token-level adaptivity is complementary to structured sparsity and quantization techniques, offering further avenues for efficiency gains.

Theoretical and Future Directions

Theoretically, MoR bridges the gap between parameter-efficient and compute-adaptive architectures, providing a foundation for latent reasoning in LLMs. The dynamic allocation of recursion depth aligns with the semantic importance of tokens, suggesting a form of implicit token-level difficulty estimation.

Future research directions include:

Scaling to Larger Models: Extending MoR to models beyond 3B parameters and integrating with uptraining from existing checkpoints.
Advanced Routing Strategies: Developing routers that better align recursion depth with reasoning complexity, particularly for chain-of-thought tasks.
Adaptive Capacity Control: Enabling more flexible adjustment of compute allocation at inference time.
Multimodal and Non-Text Applications: Applying MoR's modality-agnostic recursion blocks to vision, speech, and multimodal tasks.
Integration with Sparse Algorithms: Combining MoR with pruning and quantization for further efficiency.

Conclusion

Mixture-of-Recursions represents a significant step toward efficient, scalable, and adaptive LLMing. By unifying parameter sharing, token-level adaptive computation, and efficient KV caching, MoR achieves large-model quality at a fraction of the computational and memory cost. Its practical benefits and extensibility position it as a promising architecture for both research and deployment in large-scale AI systems.

PDF Markdown

Related Papers

Tweets

https://twitter.com/fly51fly/status/1945234810835751270

https://twitter.com/bronzeagepapi/status/1945688652342009897

https://twitter.com/fleetwood___/status/1945543657534136340

https://twitter.com/reza_byt/status/1945498450927526167

https://twitter.com/hillbig/status/1945632863682269606

https://twitter.com/hillbig/status/1945635526314529090