Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 167 tok/s
Gemini 2.5 Pro 53 tok/s Pro
GPT-5 Medium 31 tok/s Pro
GPT-5 High 38 tok/s Pro
GPT-4o 104 tok/s Pro
Kimi K2 191 tok/s Pro
GPT OSS 120B 433 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

EdgeLoRA: An Efficient Multi-Tenant LLM Serving System on Edge Devices (2507.01438v1)

Published 2 Jul 2025 in cs.DC, cs.AI, and cs.LG

Abstract: LLMs have gained significant attention due to their versatility across a wide array of applications. Fine-tuning LLMs with parameter-efficient adapters, such as Low-Rank Adaptation (LoRA), enables these models to efficiently adapt to downstream tasks without extensive retraining. Deploying fine-tuned LLMs on multi-tenant edge devices offers substantial benefits, such as reduced latency, enhanced privacy, and personalized responses. However, serving LLMs efficiently on resource-constrained edge devices presents critical challenges, including the complexity of adapter selection for different tasks and memory overhead from frequent adapter swapping. Moreover, given the multiple requests in multi-tenant settings, processing requests sequentially results in underutilization of computational resources and increased latency. This paper introduces EdgeLoRA, an efficient system for serving LLMs on edge devices in multi-tenant environments. EdgeLoRA incorporates three key innovations: (1) an adaptive adapter selection mechanism to streamline the adapter configuration process; (2) heterogeneous memory management, leveraging intelligent adapter caching and pooling to mitigate memory operation overhead; and (3) batch LoRA inference, enabling efficient batch processing to significantly reduce computational latency. Comprehensive evaluations using the Llama3.1-8B model demonstrate that EdgeLoRA significantly outperforms the status quo (i.e., llama.cpp) in terms of both latency and throughput. The results demonstrate that EdgeLoRA can achieve up to a 4 times boost in throughput. Even more impressively, it can serve several orders of magnitude more adapters simultaneously. These results highlight EdgeLoRA's potential to transform edge deployment of LLMs in multi-tenant scenarios, offering a scalable and efficient solution for resource-constrained environments.

Summary

  • The paper demonstrates that EdgeLoRA streamlines multi-tenant LLM serving on edge devices by integrating adaptive adapter selection, heterogeneous memory management, and batch LoRA inference.
  • It achieves up to 4× throughput improvements and reduced latency, validating its efficacy in resource-constrained scenarios.
  • Its design supports thousands of adapters simultaneously while optimizing energy consumption for scalable edge deployments.

EdgeLoRA: An Efficient Multi-Tenant LLM Serving System on Edge Devices

The paper "EdgeLoRA: An Efficient Multi-Tenant LLM Serving System on Edge Devices" proposes a system designed to address the challenges of deploying LLMs with Low-Rank Adaptation (LoRA) on resource-constrained edge devices. By integrating three main innovations — adaptive adapter selection, heterogeneous memory management, and batch LoRA inference — the proposed system significantly boosts performance, achieving substantial improvements in latency, throughput, and scalability.

Introduction

EdgeLoRA\ is designed to efficiently serve LLMs with multiple LoRA adapters in multi-tenant edge environments, a setting where computational resources are limited. The paper identifies critical challenges such as the complexity of selecting appropriate adapters for various tasks, excessive memory overhead due to frequent adapter swaps, and underutilization of computational resources due to sequential request processing.

The system combines several innovations:

  • Adaptive Adapter Selection: Automatically identifies and deploys optimal adapters based on request-specific requirements.
  • Heterogeneous Memory Management: Utilizes intelligent caching and pooling techniques to reduce memory operation overhead.
  • Batch LoRA Inference: Enables efficient batch processing to significantly reduce computational latency and improve resource utilization.

Overall, EdgeLoRA\ addresses the distinct needs of serving LLMs on edge devices by effectively managing adapters and optimizing inference processes. Figure 1

Figure 1: Multi-tenant LLM Serving on Edge Devices.

System Design

Adaptive Adapter Selection

The adaptive adapter selection component is crucial for optimizing the process of choosing the most suitable LoRA adapter for incoming requests. This mechanism analyzes incoming prompts and selects adapters based on the availability and suitability, leveraging a profiling-based method to train an adapter router. The router scores each adapter's performance, automatically selecting high-performing adapters already present in memory, thereby minimizing latency and manual intervention. Figure 2

Figure 2: The workflow of adaptive adapter selection.

Heterogeneous Memory Management

Heterogeneous memory management combines both memory caching techniques and pre-allocated memory pools to maximize efficiency and minimize runtime memory allocation overhead. By employing an LRU policy within the memory cache, frequently accessed adapters are kept in memory, optimizing resource utilization under dynamic workloads. Figure 3

Figure 3: The adapter memory manager evicts the least frequently used adapter and loads the newly required one into a free memory block in the pool.

Batch LoRA Inference

Batch LoRA inference is designed to improve computational efficiency in multi-tenant environments by processing multiple requests in a single batch. This approach fully utilizes the parallelism of modern GPU hardware, reducing per-request latency and enhancing throughput. Different requests with diverse adapters are batched together, allowing for simultaneous computation of the pre-trained weights and LoRA-specific weights. Figure 4

Figure 4: Batch LoRA inference.

Implementation and Evaluation

EdgeLoRA\ was implemented with extensive modifications to the llama.cpp framework, comprising intricate C++ code for efficient handling of multi-adapter workloads. Evaluation was conducted across diverse edge devices, establishing EdgeLoRA's superiority in throughput, latency, and energy efficiency compared to existing solutions.

EdgeLoRA\ demonstrated remarkable scalability, supporting thousands of adapters simultaneously while achieving throughput improvements of up to 4×\times over conventional methods. These results underscore the system's efficacy in adapting to large-scale edge deployments. Figure 5

Figure 5: Throughput and average request latency of EdgeLoRA\ and EdgeLoRA\ (w/o AAS) under varying numbers of adapters. Both demonstrate scalability to a large number of adapters with similar throughput.

Conclusion

EdgeLoRA\ offers a robust solution for efficiently serving LoRA-adapted LLMs on edge devices, addressing challenges of adapter selection, memory management, and inference efficiency. The system achieves significant enhancements in throughput and energy consumption, poised to transform LLM deployment in resource-constrained settings. Future developments may focus on extending EdgeLoRA\ capabilities to further enhance adaptability and efficiency in broader edge computing scenarios.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We found no open problems mentioned in this paper.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.