Kraken: Inherently Parallel Transformers For Efficient Multi-Device Inference (2408.07802v2)

Published 14 Aug 2024 in cs.LG and cs.DC

Abstract: Large Transformer networks are increasingly used in settings where low inference latency can improve the end-user experience and enable new applications. However, autoregressive inference is resource intensive and requires parallelism for efficiency. Parallelism introduces collective communication that is both expensive and represents a phase when hardware resources are underutilized. Towards mitigating this, Kraken is an evolution of the standard Transformer architecture that is designed to complement existing tensor parallelism schemes for efficient inference on multi-device systems. By introducing a fixed degree of intra-layer model parallelism, the architecture allows collective operations to be overlapped with compute, decreasing latency and increasing hardware utilization. When trained on OpenWebText, Kraken models reach a similar perplexity as standard Transformers while also preserving their language modeling capabilities when evaluated on the SuperGLUE benchmark. Importantly, when tested on multi-GPU systems using TensorRT-LLM engines, Kraken speeds up Time To First Token by a mean of 35.6% across a range of model sizes, context lengths, and degrees of tensor parallelism.

Summary

The paper’s main contribution is introducing fixed intra-layer parallelism that overlaps computation with communication to reduce Time To First Token by 35.6%.
The architecture divides each Transformer layer into sub-layers, eliminating inter-device dependencies except for a concluding AllReduce, thereby enhancing hardware efficiency.
Experimental results confirm that Kraken maintains language proficiency on benchmarks like OpenWebText and SuperGLUE while optimizing multi-device inference.

Efficient Multi-Device Inference with Kraken: A Parallel Transformer Architecture

The paper "Kraken: Inherently Parallel Transformers For Efficient Multi-Device Inference" introduces Kraken, a novel Transformer architecture aimed at optimizing inference efficiency across multi-device systems. As Transformer models grow larger, they demand increased computational resources, resulting in substantial latency during inference—especially when deployed via multi-device contexts such as in GPU nodes. The authors propose Kraken to improve hardware utilization and minimize latency by structural parallelism within the Transformer layers.

Key Contributions and Results

Kraken distinguishes itself from traditional Transformers by integrating a fixed degree of intra-layer parallelism, enabling computational operations to overlap with collective communication phases during inference. This design choice allows Kraken to exploit existing tensor parallelism optimizations more effectively. Notably, Kraken showed an average 35.6% reduction in Time To First Token (TTFT) when evaluated on multi-GPU systems with TensorRT-LLM engines, underscoring its significant improvement in inference latency without sacrificing language modeling performance.

The paper reports that Kraken maintains language modeling proficiency comparable to standard Transformers, achieving similar perplexity on OpenWebText, and robust performance on the SuperGLUE suite of benchmarks, across varying degrees of parallelism and parameter counts.

Architecture and Implementation

Kraken's architecture involves dividing each layer into multiple, smaller sub-layers, eliminating the need for inter-device dependency except for an AllReduce operation at the end of each layer. This minimizes latency as inter-device communication is not on the critical execution path, contrasting with techniques like those employed in standard GPT architectures or GPT-J's parallel approach. By leveraging the underlying hardware topology, Kraken enhances compute utilization, allowing MHA and residual attention computations to proceed without waiting for collective synchronizations.

Practical Implications

The Kraken architecture proposes several practical benefits:

Efficiency in Distributed Systems: By minimizing communication bottlenecks in multi-device deployments, Kraken offers significant improvements in latency, making it suitable for applications where fast response times are critical.
Modular Scaling: The fixed degree of sub-layer parallelism offers flexibility, allowing adaptation based on the target hardware (e.g., varying GPU configurations).

Future Prospects

Looking ahead, further exploration into incorporating initialization schemes for weight distillation from pretrained models could reduce the training demands of the Kraken architecture. Additionally, Kraken's modular approach to scaling and latency optimization could be applied beyond standard Transformer's attention and MLP layers, potentially benefiting a broader class of deep learning models.

The potential integration with techniques such as Multi-Query Attention or FlashAttention could further enhance Kraken's efficiency by reducing memory footprints and improving query complexities. As deep learning models continually evolve, Kraken presents a promising paradigm shift towards efficient multi-device architectures, emphasizing the importance of aligning model designs with hardware capabilities.

In conclusion, "Kraken: Inherently Parallel Transformers For Efficient Multi-Device Inference" offers significant advancements in reducing inference latency through architectural enhancements grounded in parallelism. Its proven effectiveness across various configurations and tasks will undoubtedly influence future Transformer deployments and the ongoing quest for efficient AI model serving.