HeteGen: Heterogeneous Parallel Inference for Large Language Models on Resource-Constrained Devices (2403.01164v1)

Published 2 Mar 2024 in cs.PF and cs.DC

Abstract: In recent times, the emergence of LLMs has resulted in increasingly larger model size, posing challenges for inference on low-resource devices. Prior approaches have explored offloading to facilitate low-memory inference but often suffer from efficiency due to I/O bottlenecks. To achieve low-latency LLMs inference on resource-constrained devices, we introduce HeteGen, a novel approach that presents a principled framework for heterogeneous parallel computing using CPUs and GPUs. Based on this framework, HeteGen further employs heterogeneous parallel computing and asynchronous overlap for LLMs to mitigate I/O bottlenecks. Our experiments demonstrate a substantial improvement in inference speed, surpassing state-of-the-art methods by over 317% at most.

References (30)

Citations (3)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

HeteGen: Heterogeneous Parallel Inference for Large Language Models on Resource-Constrained Devices (2403.01164v1)

Summary

Related Papers