- The paper introduces Mixture-of-Recursions, a framework that dynamically assigns recursive depths to tokens for efficient computation in Transformer models.
- It employs expert-choice and token-choice routing strategies along with innovative KV caching methods to enhance throughput and reduce memory usage.
- Empirical results demonstrate lower validation perplexity and improved scalability across models from 135M to 1.7B parameters, setting a new efficiency benchmark.
Mixture-of-Recursions: Dynamic Recursive Depths for Adaptive Token-Level Computation
The Mixture-of-Recursions (MoR) framework introduces a unified approach to parameter and computational efficiency in Transformer-based LLMs by combining recursive parameter sharing with token-level adaptive computation. This architecture addresses the dual challenge of reducing both the parameter count and the computational/memory overhead associated with large-scale LLMs, while maintaining or improving model quality.
Core Methodology
MoR builds upon Recursive Transformers, which reuse a shared stack of layers across multiple recursion steps, thereby achieving significant parameter efficiency. The key innovation in MoR is the integration of lightweight, trainable routers that dynamically assign recursion depths to individual tokens. This enables the model to allocate more computation to complex tokens and less to simpler ones, effectively implementing token-level adaptive computation within a parameter-shared architecture.
Two primary routing strategies are explored:
- Expert-choice routing: At each recursion step, a router selects the top-k tokens to continue, progressively narrowing the set of active tokens. This approach ensures a static compute budget and perfect load balancing but introduces potential causality violations during training, which are mitigated via auxiliary losses.
- Token-choice routing: Each token is assigned a fixed recursion depth at the outset, determining its full compute path. This avoids causality issues but can suffer from load imbalance, addressed through balancing losses or loss-free algorithms.
MoR also introduces two key-value (KV) caching strategies to further improve memory and compute efficiency:
- Recursion-wise KV caching: Only tokens routed to a given recursion step store their KV pairs at that level, reducing memory and I/O requirements.
- Recursive KV sharing: All tokens cache KV pairs at the first recursion step, which are then reused in subsequent recursions, minimizing prefill latency and memory footprint.
Empirical Results
MoR demonstrates strong empirical performance across model scales (135M to 1.7B parameters):
- Efficiency: At equal training FLOPs, MoR achieves lower validation perplexity and higher few-shot accuracy than both vanilla and recursive Transformer baselines, despite using up to 50% fewer parameters.
- Throughput: Inference throughput is significantly improved (up to 2.18× over baselines) due to reduced KV cache sizes and the ability to batch tokens at different recursion depths.
- Scalability: MoR maintains or exceeds the performance of standard Transformers at larger scales, establishing a new Pareto frontier for compute-accuracy trade-offs.
The ablation studies reveal that the Middle-Cycle parameter sharing strategy is most effective, and that expert-choice routing with auxiliary loss and a linear router yields the best performance. While recursion-wise KV caching is generally superior for expert-choice routing, recursive KV sharing can benefit token-choice routing, especially when routing decisions are less accurate.
Practical Implications
MoR's design has several practical advantages for real-world deployment:
- Reduced Memory and Compute: By adaptively allocating computation and memory only where needed, MoR enables the deployment of high-quality LLMs on resource-constrained hardware.
- Flexible Inference: The architecture supports test-time scaling, allowing the number of recursion steps to be increased for higher quality or decreased for faster inference, without retraining.
- Compatibility with Sparse Methods: MoR's token-level adaptivity is complementary to structured sparsity and quantization techniques, offering further avenues for efficiency gains.
Theoretical and Future Directions
Theoretically, MoR bridges the gap between parameter-efficient and compute-adaptive architectures, providing a foundation for latent reasoning in LLMs. The dynamic allocation of recursion depth aligns with the semantic importance of tokens, suggesting a form of implicit token-level difficulty estimation.
Future research directions include:
- Scaling to Larger Models: Extending MoR to models beyond 3B parameters and integrating with uptraining from existing checkpoints.
- Advanced Routing Strategies: Developing routers that better align recursion depth with reasoning complexity, particularly for chain-of-thought tasks.
- Adaptive Capacity Control: Enabling more flexible adjustment of compute allocation at inference time.
- Multimodal and Non-Text Applications: Applying MoR's modality-agnostic recursion blocks to vision, speech, and multimodal tasks.
- Integration with Sparse Algorithms: Combining MoR with pruning and quantization for further efficiency.
Conclusion
Mixture-of-Recursions represents a significant step toward efficient, scalable, and adaptive LLMing. By unifying parameter sharing, token-level adaptive computation, and efficient KV caching, MoR achieves large-model quality at a fraction of the computational and memory cost. Its practical benefits and extensibility position it as a promising architecture for both research and deployment in large-scale AI systems.