Mixture-of-Depths: Dynamically allocating compute in transformer-based language models (2404.02258v1)

Published 2 Apr 2024 in cs.LG and cs.CL

Abstract: Transformer-based LLMs spread FLOPs uniformly across input sequences. In this work we demonstrate that transformers can instead learn to dynamically allocate FLOPs (or compute) to specific positions in a sequence, optimising the allocation along the sequence for different layers across the model depth. Our method enforces a total compute budget by capping the number of tokens ($k$) that can participate in the self-attention and MLP computations at a given layer. The tokens to be processed are determined by the network using a top-$k$ routing mechanism. Since $k$ is defined a priori, this simple procedure uses a static computation graph with known tensor sizes, unlike other conditional computation techniques. Nevertheless, since the identities of the $k$ tokens are fluid, this method can expend FLOPs non-uniformly across the time and model depth dimensions. Thus, compute expenditure is entirely predictable in sum total, but dynamic and context-sensitive at the token-level. Not only do models trained in this way learn to dynamically allocate compute, they do so efficiently. These models match baseline performance for equivalent FLOPS and wall-clock times to train, but require a fraction of the FLOPs per forward pass, and can be upwards of 50\% faster to step during post-training sampling.

References (24)

Authors (6)

David Raposo (14 papers)
Sam Ritter (4 papers)
Blake Richards (17 papers)
Timothy Lillicrap (60 papers)
Peter Conway Humphreys (1 paper)
Adam Santoro (32 papers)

Citations (41)

View on Semantic Scholar

Summary

Efficient Compute Allocation in Transformer-Based LLMs through Mixture-of-Depths

Introduction

The paper introduces a novel approach for optimizing computational expenditure in transformer-based LLMs, called Mixture-of-Depths (MoD). This methodology dynamically allocates floating-point operations (FLOPs) across different positions in a sequence by limiting the number of tokens that participate in self-attention and MLP computations at any given layer. Unlike traditional conditional computation methods, MoD maintains a fixed and predictable compute budget, enhancing both training efficiency and inference speed without sacrificing model performance.

Implementing Mixture-of-Depths Transformers

MoD transformers employ a mechanism where each transformer block makes an independent decision to process only a subset of tokens—determined by a top- $k$ routing mechanism—while the rest bypass the block via residual connections. This decision-making process is based on router weights assigned to each token, effectively allowing the model to focus its resources on tokens that require more processing. The approach is underpinned by a few key strategies:

Defining a Compute Budget: The paper details how total compute can be controlled by adjusting the capacity for the computations within each block.
Routing Around Transformer Blocks: A dual-pathway approach is implemented where tokens either undergo the usual transformer block computations or are routed through a residual connection.
Routing Schemes: The paper explores different routing schemes, culminating in the adoption of expert-choice routing due to its balance between performance and computational efficiency.

This routing mechanism is implemented through a linear projection that assigns weights to tokens, which are then used to determine their routing path based on the top- $k$ selection. This scheme allows for compute optimization while retaining the model's static computation graph, a crucial factor for maintaining hardware efficiency.

Results

The paper showcases several critical findings:

Training Efficiency and Model Performance: MoD transformers can match or exceed the baseline models' performance with significantly fewer FLOPs per forward pass, demonstrating that transformers traditionally expend more compute than necessary.
IsoFLOP Comparisons: Through comprehensive isoFLOP analyses, the paper illustrates that optimally configured MoD models—utilizing aggressive capacity reductions—are both faster (in terms of step time) and more effective than their vanilla counterparts, regardless of the total FLOPs budget.
Learned Routing's Crucial Role: The success of MoD heavily relies on learned routing decisions, underscoring that indiscriminate compute reduction without intelligent allocation can degrade performance.
Auto-regressive Evaluation: The transition from training routing schemes to causal predictor-based approaches for auto-regressive sampling incurs minimal performance degradation, suggesting that MoD models preserve their computational advantages in inference settings.

Implications and Future Directions

The paper's findings suggest significant implications for the design and operation of efficient AI models. First, it argues for a reconsideration of compute allocation strategies in transformer models, highlighting the potential for substantial efficiency gains without performance trade-offs. Second, it opens avenues for future research into more complex routing mechanisms, potentially expanding beyond binary decisions (compute vs. bypass) to a more nuanced spectrum of computational pathways.

The integration of MoD with other conditional computation frameworks, particularly Mixture-of-Experts (MoE), further illustrates the versatility of this approach. By allowing for even finer-grained control over compute expenditure, such integrations could lead to more sophisticated models that leverage the strengths of both methodologies.

Conclusion

Mixture-of-Depths presents a promising avenue for enhancing the efficiency of transformer-based LLMs. By dynamically allocating compute resources where they are most needed, MoD transformers offer a pragmatic solution to the challenges of training large-scale models, suggesting a pathway towards more sustainable and economically viable AI systems. As the field continues to evolve, it will be intriguing to see how these concepts are applied and extended in the development of next-generation AI models.

Related Papers

Find Related Papers

Tweets

https://twitter.com/_akhaliq/status/1775740222120087847

https://twitter.com/arankomatsuzaki/status/1775743788402479323

https://twitter.com/PiotrPadlewski/status/1775865549802598800

https://twitter.com/PandaAshwinee/status/1848824481466175967

https://twitter.com/norabelrose/status/1900334928459649495

https://twitter.com/s_scardapane/status/1775818399731400834