Mixture-of-Depths: Dynamically allocating compute in transformer-based language models (2404.02258v1)
Abstract: Transformer-based LLMs spread FLOPs uniformly across input sequences. In this work we demonstrate that transformers can instead learn to dynamically allocate FLOPs (or compute) to specific positions in a sequence, optimising the allocation along the sequence for different layers across the model depth. Our method enforces a total compute budget by capping the number of tokens ($k$) that can participate in the self-attention and MLP computations at a given layer. The tokens to be processed are determined by the network using a top-$k$ routing mechanism. Since $k$ is defined a priori, this simple procedure uses a static computation graph with known tensor sizes, unlike other conditional computation techniques. Nevertheless, since the identities of the $k$ tokens are fluid, this method can expend FLOPs non-uniformly across the time and model depth dimensions. Thus, compute expenditure is entirely predictable in sum total, but dynamic and context-sensitive at the token-level. Not only do models trained in this way learn to dynamically allocate compute, they do so efficiently. These models match baseline performance for equivalent FLOPS and wall-clock times to train, but require a fraction of the FLOPs per forward pass, and can be upwards of 50\% faster to step during post-training sampling.
- Colt5: Faster long-range transformers with conditional computation, 2023.
- Controlling computation versus quality for neural sequence models. CoRR, abs/2002.07106, 2020. URL https://arxiv.org/abs/2002.07106.
- Conditional computation in neural networks for faster models, 2016.
- Y. Bengio. Deep learning of representations: Looking forward, 2013.
- Estimating or propagating gradients through stochastic neurons for conditional computation, 2013.
- Token merging: Your vit but faster, 2023.
- K. Cho and Y. Bengio. Exponentially increasing the capacity-to-computation ratio for conditional computation in deep learning, 2014.
- Universal transformers. arXiv preprint arXiv:1807.03819, 2018.
- Depth-adaptive transformer. CoRR, abs/1910.10073, 2019. URL http://arxiv.org/abs/1910.10073.
- Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity, 2022.
- A. Graves. Adaptive computation time for recurrent neural networks. CoRR, abs/1603.08983, 2016. URL http://arxiv.org/abs/1603.08983.
- Longt5: Efficient text-to-text transformer for long sequences, 2022.
- M. Gupta and P. Agrawal. Compression of deep learning models for text: A survey, 2021.
- Towards a unified view of parameter-efficient transfer learning. arXiv preprint arXiv:2110.04366, 2021.
- Variable computation in recurrent neural networks, 2017.
- Conditional adapters: Parameter-efficient transfer learning with fast inference, 2023.
- Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668, 2020.
- Anytime dense prediction with confidence adaptivity. arXiv preprint arXiv:2104.00749, 2021.
- Confident adaptive language modeling, 2022.
- Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017.
- A. Simoulin and B. Crabbé. How many layers and why? An analysis of the model depth in transformers. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: Student Research Workshop, pages 221–228, Online, Aug. 2021. Association for Computational Linguistics. 10.18653/v1/2021.acl-srw.23. URL https://aclanthology.org/2021.acl-srw.23.
- Efficient transformers: A survey. CoRR, abs/2009.06732, 2020. URL https://arxiv.org/abs/2009.06732.
- Skipnet: Learning dynamic routing in convolutional networks. CoRR, abs/1711.09485, 2017. URL http://arxiv.org/abs/1711.09485.
- St-moe: Designing stable and transferable sparse expert models, 2022.
- David Raposo (14 papers)
- Sam Ritter (4 papers)
- Blake Richards (17 papers)
- Timothy Lillicrap (60 papers)
- Peter Conway Humphreys (1 paper)
- Adam Santoro (32 papers)