Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

TPI-LLM: Serving 70B-scale LLMs Efficiently on Low-resource Edge Devices (2410.00531v1)

Published 1 Oct 2024 in cs.DC and cs.AI
TPI-LLM: Serving 70B-scale LLMs Efficiently on Low-resource Edge Devices

Abstract: Large model inference is shifting from cloud to edge due to concerns about the privacy of user interaction data. However, edge devices often struggle with limited computing power, memory, and bandwidth, requiring collaboration across multiple devices to run and speed up LLM inference. Pipeline parallelism, the mainstream solution, is inefficient for single-user scenarios, while tensor parallelism struggles with frequent communications. In this paper, we argue that tensor parallelism can be more effective than pipeline on low-resource devices, and present a compute- and memory-efficient tensor parallel inference system, named TPI-LLM, to serve 70B-scale models. TPI-LLM keeps sensitive raw data local in the users' devices and introduces a sliding window memory scheduler to dynamically manage layer weights during inference, with disk I/O latency overlapped with the computation and communication. This allows larger models to run smoothly on memory-limited devices. We analyze the communication bottleneck and find that link latency, not bandwidth, emerges as the main issue, so a star-based allreduce algorithm is implemented. Through extensive experiments on both emulated and real testbeds, TPI-LLM demonstrated over 80% less time-to-first-token and token latency compared to Accelerate, and over 90% compared to Transformers and Galaxy, while cutting the peak memory footprint of Llama 2-70B by 90%, requiring only 3.1 GB of memory for 70B-scale models.

Efficient Inference of LLMs on Low-Resource Edge Devices: Insights from TPI-LLM

This paper by Zonghang Li et al. presents TPI-LLM, a tensor parallel inference system designed to efficiently serve 70-billion parameter LLMs on low-resource edge devices. The authors propose novel techniques to manage the substantial computational and memory demands of such large models in resource-constrained environments, achieving impressive performance improvements.

Background and Motivation

LLMs like Llama 2-70B and Yi-34B are typically deployed in cloud environments where high-performance GPUs handle the inference tasks. However, this approach can introduce privacy concerns, as user data must be transmitted to and processed in the cloud. There is a growing interest in shifting the inference of LLMs to edge devices, such as laptops and mobile phones. This transition, however, faces significant challenges due to the limited memory and computational power available on such devices.

The current prevalent solutions, like pipeline parallelism, are not well-suited for single-user scenarios and do not scale efficiently on devices with restricted resources. The authors argue that tensor parallelism, when appropriately optimized, can outperform pipeline parallelism in these environments. TPI-LLM introduces a memory-efficient inference framework with several key innovations to address both computational and communication bottlenecks inherent in edge device deployments.

System Design and Innovations

Tensor Parallel Framework

The core idea of TPI-LLM is to leverage tensor parallelism, which distributes the computational workload across multiple devices by splitting the model layers, specifically attention heads and feed-forward network (FFN) layers. This parallelism keeps devices engaged more efficiently compared to model parallelism, which can leave devices idle during certain stages of the inference process.

The tensor parallel framework consists of a master node, which typically runs the user’s device initiating the prompt, and several worker nodes that share the computational load. The master node handles sensitive data to maintain privacy, while each worker node performs computations on partitioned model weights. A star-based allreduce algorithm is employed to aggregate results, minimizing the communication latency caused by high link latencies typical in edge networks.

Memory Management with Sliding Window Scheduler

To address the memory constraints on edge devices, TPI-LLM introduces a sliding window memory scheduler. This scheduler dynamically loads and unloads model weights during inference, only keeping necessary parts in memory at any time. This approach effectively reduces the peak memory footprint, allowing models as large as 70B parameters to run on devices with limited memory. For instance, the Llama 2-70B model can be executed with just 3.1 GB of memory per device, significantly lower than the theoretical requirements.

Experimental Results

The authors provide extensive experimental validation of TPI-LLM on various models and configurations. Key findings include:

  • Efficiency: TPI-LLM demonstrates over 80% reduction in time-to-first-token and token latency compared to Accelerate and over 90% compared to Transformers and Galaxy.
  • Scalability: The implementation efficiently handles large-scale models like Llama 2-70B across low-resource devices, requiring minimal memory footprint.
  • Robustness: The framework performs well under varying edge conditions, maintaining low token latency even with limited network bandwidth.

For instance, Llama 2-70B achieved a peak memory footprint of only 3.1 GB across eight devices, a drastic reduction from the 34.9 GB required without the memory scheduler.

Implications and Future Directions

The practical implications of TPI-LLM are significant. By enabling the deployment of large-scale LLMs on edge devices, the framework addresses privacy concerns associated with cloud-based inference. Moreover, it leverages existing hardware infrastructure more efficiently, potentially broadening the accessibility and utility of powerful LLMs to a wider array of applications and users.

On the theoretical front, the insights into the balance of computation and communication in tensor parallelism on low-resource devices offer valuable guidance for future research in distributed computing and model optimization. The use of star-based allreduce to address link latency challenges also opens avenues for further exploration in communication-efficient algorithms for edge networks.

Future work may focus on further reducing the memory footprints and enhancing the robustness of the system under more varied network conditions. Additionally, exploring the integration of hardware accelerators like FPGAs or specialized inference chips could yield further performance gains, bridging the gap between high-resource cloud environments and low-resource edge devices.

In summary, TPI-LLM represents a significant step forward in the efficient deployment of LLMs on low-resource edge devices, balancing computational efficiency, memory management, and communication overheads. This work lays a strong foundation for future advancements in edge AI, enabling more secure and accessible AI services.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Zonghang Li (18 papers)
  2. Wenjiao Feng (4 papers)
  3. Mohsen Guizani (174 papers)
  4. Hongfang Yu (35 papers)