Block Transformer: Global-to-Local Language Modeling for Fast Inference (2406.02657v2)

Published 4 Jun 2024 in cs.CL, cs.AI, and cs.LG

Abstract: We introduce the Block Transformer which adopts hierarchical global-to-local modeling to autoregressive transformers to mitigate the inference bottlenecks associated with self-attention. Self-attention requires the key-value (KV) cache of all previous sequences to be retrieved from memory at every decoding step to retrieve context information, leading to two primary bottlenecks during batch inference. First, there is a significant delay in obtaining the first token, as the information of the entire prompt must first be processed to prefill the KV cache. Second, computation of subsequent tokens is bottlenecked by the high memory I/O demand of fetching the entire KV cache, which grows linearly with sequence length, incurring quadratic memory reads overall. We design the Block Transformer to strategically mitigate these costs, by incorporating coarsity and locality into an integrated global-to-local architecture. At the lower layers, we aggregate tokens into fixed size blocks to apply attention across the entire sequence at coarse-grained detail, to capture the global context while minimizing KV cache overhead. At upper layers, we apply attention within each block to decode individual tokens, to model fine-grained details with a lightweight local KV cache. We pretrain vanilla and Block Transformers from scratch and demonstrate that Block Transformers reach 10--20x inference throughput compared to vanilla transformers with equivalent perplexity and zero-shot task performance. Code is available at https://github.com/itsnamgyu/block-transformer.

PDF HTML Abstract

Block Transformer: Global-to-Local LLMing for Fast Inference

The paper presents a new architecture, Block Transformer, which aims to address inference bottlenecks in autoregressive transformers, primarily caused by the self-attention mechanism and KV-cache overhead. The key innovation in the Block Transformer involves a hierarchical global-to-local modeling that segregates computations of the transformer into global and local contexts, allowing for substantial improvements in inference throughput without compromising performance.

Architecture and Methodology

The Block Transformer architecture adopts a hierarchical modeling approach, where global context modeling is performed at lower layers using coarse-grained blocks of tokens, and local modeling is conducted at the upper layers with fine-grained attention focused on individual tokens within each block.

Embedder:
- Converts blocks of tokens into block embeddings.
- Utilizes a lookup table strategy for simplicity and efficiency.
Block Decoder:
- Contextualizes block embeddings by attending to preceding blocks.
- Operates at a coarser granularity and effectively reduces KV-cache overhead by a factor proportional to the block length.
Token Decoder:
- Decodes individual tokens within each block using the context embedding from the block decoder.
- Local attention within blocks eliminates the need for global KV-cache, thereby significantly boosting inference throughput.

Numerical Results and Comparisons

Extensive experiments demonstrate the effectiveness of the Block Transformer architecture:

The Block Transformer exhibits 10-20x gains in inference throughput compared to vanilla transformers.
Models with 420M parameters, nearly three times the size of comparable vanilla models, achieve equivalent perplexity and zero-shot task performance.
The hierarchical global-to-local approach allows for efficient batching and reduced memory overhead, with maximum batch sizes being approximately six times larger than those of vanilla transformers.

The performance and throughput gains are evaluated under varying context and block lengths, demonstrating consistent improvements across different settings. This becomes particularly advantageous when handling longer sequence lengths or deploying in production settings with heavy throughput requirements.

Implications and Future Directions

The Block Transformer architecture is notable for its optimization of the trade-off between model performance and inference efficiency, making it a promising candidate for real-world applications that require fast and efficient LLM inference.

Theoretical Implications:

By distinguishing between global and local dependencies, the architecture capitalizes on the reduced computational complexity of local attention mechanisms, suggesting that further exploration of hierarchical models could revolutionize transformer efficiencies.
The balanced allocation of parameters between the block and token decoders underscores the potential of hierarchical models in distributing computational resources more effectively.

Practical Implications:

The Block Transformer can significantly reduce inference costs in deployment scenarios, particularly for applications that require fast response times and can operate with large batch sizes.
The flexible design allows for further optimizations, such as adaptive block lengths and dynamic allocation of computational resources based on sequence characteristics and hardware constraints.

Conclusion

The Block Transformer introduces a novel hierarchical approach to LLMing, emphasizing the role of global-to-local modeling in improving inference throughput while maintaining robust performance. This architecture addresses critical bottlenecks in autoregressive transformers, paving the way for efficient and scalable LLMs suitable for widespread deployment. Future work could focus on refining the hierarchical components, exploring adaptive mechanisms, and extending the architecture to longer context lengths and various downstream applications.