HiRE: High Recall Approximate Top-$k$ Estimation for Efficient LLM Inference (2402.09360v1)

Published 14 Feb 2024 in cs.LG and cs.AI

Abstract: Autoregressive decoding with generative LLMs on accelerators (GPUs/TPUs) is often memory-bound where most of the time is spent on transferring model parameters from high bandwidth memory (HBM) to cache. On the other hand, recent works show that LLMs can maintain quality with significant sparsity/redundancy in the feedforward (FFN) layers by appropriately training the model to operate on a top-$k$ fraction of rows/columns (where $k \approx 0.05$), there by suggesting a way to reduce the transfer of model parameters, and hence latency. However, exploiting this sparsity for improving latency is hindered by the fact that identifying top rows/columns is data-dependent and is usually performed using full matrix operations, severely limiting potential gains. To address these issues, we introduce HiRE (High Recall Approximate Top-k Estimation). HiRE comprises of two novel components: (i) a compression scheme to cheaply predict top-$k$ rows/columns with high recall, followed by full computation restricted to the predicted subset, and (ii) DA-TOP-$k$: an efficient multi-device approximate top-$k$ operator. We demonstrate that on a one billion parameter model, HiRE applied to both the softmax as well as feedforward layers, achieves almost matching pretraining and downstream accuracy, and speeds up inference latency by $1.47\times$ on a single TPUv5e device.

References (62)

Citations (4)

View on Semantic Scholar

Summary

The paper introduces HiRE, a novel method that approximates top-k selection with high recall to reduce LLM inference latency.
It employs a compression scheme and low-rank approximations to focus computation on key FFN and softmax components while preserving output quality.
The distributed DA-TOP operator scales performance across multiple devices, achieving up to 2.27x speedup in real-world deployments.

High Recall Approximate Top- $k$ Estimation for Efficient LLM Inference

The paper "HiRE: High Recall Approximate Top- $k$ Estimation for Efficient LLM Inference" addresses a critical challenge in deploying autoregressive LLMs, which is the substantial latency during the inference phase. This latency predominantly arises due to the need to transfer large model parameters from high-bandwidth memory (HBM) to cache, a process that is often memory-bound on standard accelerators such as GPUs and TPUs.

Contributions of the Paper

The authors propose a novel approach called HiRE, which stands for High Recall Approximate Top- $k$ Estimation. HiRE consists of two main components:

Compression Scheme: This enables the prediction of top- $k$ rows or columns with a high recall rate, thereby limiting full computation to these predicted subsets.
DA-TOP Operator: A distributed, approximate top- $k$ operator for multi-device environments, which facilitates efficient handling across multiple accelerators.

Key Ideas and Methodology

The paper elaborates on the inherent sparsity and redundancy in the feedforward (FFN) and softmax layers of LLMs, suggesting that models can be trained to operate efficiently on only a top fraction of these components. HiRE leverages this by performing approximate top- $k$ estimation to limit computations to the necessary components. Specifically, it uses low-rank approximations and quantization to predict top- $k$ indices, followed by precise calculations within this subset, ensuring high fidelity in output accuracy.

For deployment on large models distributed across multiple devices, HiRE introduces DA-TOP, which reduces communication overhead by operating top- $k$ predictions on each device and subsequently aggregating these predictions, rather than centralizing the entire matrix computation.

Empirical Results

Latency Improvement: HiRE achieves a 1.47x speedup in inference latency on a one billion parameter model across TPUs without degrading pretraining and downstream task performance.
Accuracy Retention: Despite the approximations, HiRE maintains almost matching quality with full calculations, suggesting the efficacy of high recall approximation in preserving the accuracy of the softmax and FFN computations.
Scalability with DA-TOP: On a cluster of TPU devices, the DA-TOP approach further enhances the speedup to 2.27x, demonstrating the scalability and efficiency of the distributed approximation in real-world deployment environments.

Implications and Future Directions

The research highlights the potential of exploiting sparsity within LLM architectures to significantly enhance inference efficiency. This efficiency gain has practical implications in reducing the computation cost and energy consumption associated with large-scale model deployment, thereby supporting more sustainable AI practices. Theoretically, this work presents a compelling case for focusing future model architecture designs on inherently sparse computations.

Future research could extend these findings to the attention layers of LLMs or explore more sophisticated training mechanisms that naturally induce sparsity across the model's components. Additionally, investigating the interplay between model compression techniques and HiRE could yield further reductions in parameter size and computational overhead.

In summary, the paper provides a detailed methodology and empirical evaluation of a technique poised to redefine efficient inference in the deployment of LLMs, paving the way for broader accessibility and integration of AI technologies in resource-constrained settings.

PDF Markdown