WindVE: Collaborative CPU-NPU Vector Embedding

Published 21 Apr 2025 in cs.DC | (2504.14941v4)

Abstract: Retrieval-Augmented Generation is a technology that enhances LLMs by integrating information retrieval. In the industry, inference services based on LLMs are highly sensitive to cost-performance ratio, prompting the need for improving hardware resource utilization in the inference service. Specifically, vector embedding and retrieval processes take up to 20% of the total latency. Therefore, optimizing the utilization of computational resources in vector embeddings is crucial for enhancing the cost-performance ratio of inference processes, which in turn boosts their product competitiveness.In this paper, we analyze the deployment costs of vector embedding technology in inference services, propose a theoretical formula, and determine through the mathematical expression that increasing the capacity to process concurrent queries is the key to reducing the deployment costs of vector embeddings. Therefore, in this paper, we focus on improving the product's capability to process concurrent queries. To optimize concurrency without sacrificing performance, we have designed a queue manager that adeptly offloads CPU peak queries. This manager utilizes a linear regression model to ascertain the optimal queue depths, a critical parameter that significantly influences the efficacy of the system. We further develop a system named WindVE that uses a CPU-NPU heterogeneous architecture to offload peak concurrent queries, which leverages the performance differences between the two processors to effectively manage traffic surges. Through experiments, we compare WindVE to the state-of-the-art vector embedding framework FlagEmbedding, and achieve a concurrency level up to 22.3% higher than the scheme without offloading.

Abstract PDF Upgrade to Chat

Summary

WindVE: Enhancing Vector Embedding Services through CPU-NPU Collaboration

The paper presents a study on optimizing vector embedding technology utilized in the context of Retrieval-Augmented Generation (RAG) systems. RAG integrates information retrieval with LLMs, enhancing the models' ability to generate informed and coherent content by mapping queries to relevant information through vector embedding. A core challenge addressed in the paper is the substantial latency vector embedding and retrieval processes contribute to inference services, often constituting up to 20% of the total latency. This latency significantly impacts the cost-performance ratio, a critical metric affecting the competitiveness of LLM-based inference services in the commercial sector.

To address the outlined challenges, the authors propose WindVE, a system that aims to optimize hardware resource utilization by leveraging a CPU-NPU heterogeneous architecture. The authors argue that to reduce the deployment costs associated with vector embeddings, increasing the capacity to handle concurrent queries is essential. WindVE is designed to facilitate the offloading of peak concurrent queries from NPUs/GPU systems to already existing multi-core CPUs, thus enhancing concurrent processing capabilities without compromising performance and, consequently, reducing deployment costs.

Contributions of WindVE

Queue Management and Optimization: The paper introduces a queue manager that utilizes a linear regression model to determine optimal queue depths. This ensures efficient dispatching and processing of queries by offloading heavy concurrency peaks to CPUs when NPUs/GPUs reach maximum capacity. The linear regression model is developed to predict the maximum concurrency that can be handled without exceeding predefined latency limits, a critical parameter for maintaining system efficacy.
CPU-NPU Heterogeneous Architecture: WindVE employs a heterogeneous computing model by integrating both CPUs and NPUs. This approach mitigates the risk of NPUs/GPU instances becoming bottlenecks during peak loads. By effectively managing traffic surges through CPU offloading, the system can enhance throughput by up to 22.3% compared to traditional vector embedding frameworks such as FlagEmbedding, without additional hardware resources.
Resource Cost Benefits: Leveraging existing multi-core CPU resources to manage query bursts offers significant reductions in deployment costs. The experimental results demonstrate potential savings of up to 18.6% in hardware deployment costs, enhancing the economic feasibility of deploying high-performance LLMs in cost-sensitive environments.

Experimental Validation

The researchers conducted a comprehensive experimental evaluation of WindVE, comparing it with the state-of-the-art vector embedding system, FlagEmbedding. The experiments validate a concurrency improvement of up to 22.3% with WindVE's CPU-NPU collaborative architecture, especially under conditions simulating high query load and tight latency constraints (e.g., within 2 seconds). These results substantiate the hypothesis that optimal workload distribution across a heterogeneous architecture can effectively manage high-volume, low-latency computational demands without incurring new hardware costs.

Theoretical and Practical Implications

The findings in this paper bring forth several implications for both theoretical formulation and practical deployment strategies within AI systems:

Theoretical Model for Concurrency Enhancement: The paper's formulation of the relationship between latency, concurrency, and hardware resource allocation underscores opportunities for advanced predictive modeling in cloud and edge computing settings.
Scalable Distributed Systems: The results pave the way for more scalable distributed systems capable of supporting increasingly data-intensive AI applications by strategically utilizing the diverse processing capabilities of heterogeneous architectures.
Future Scope in AI Deployment: This approach can serve as a precursor for other AI applications seeking sustainable methods to balance performance demands with resource expenditures.

Overall, the study substantially enriches the understanding of how heterogeneous architectures can be employed to enhance the efficiency of real-time data processing applications, particularly in the field of vector embeddings for enhancing LLMs. Future directions could explore extending this model to GPU, FPGA, and other AI specialized processors, thereby broadening the scope of efficient AI application deployments in diverse computational environments.