Papers
Topics
Authors
Recent
Search
2000 character limit reached

HybridGen: Efficient LLM Generative Inference via CPU-GPU Hybrid Computing

Published 20 Apr 2026 in cs.PF and cs.DC | (2604.18529v1)

Abstract: As modern LLMs support thousands to millions of tokens, KV caches grow to hundreds of gigabytes, stressing memory capacity and bandwidth. Existing solutions, such as KV cache pruning and offloading, alleviate these but underutilize hardware by relying solely on either GPU or CPU for attention computing, and considering yet limited CPU local memory for KV cache storage. We propose HybridGen, an efficient hybrid attention framework for long-context LLM inference. HybridGen enables CPU-GPU collaborative attention on systems with expanded tiered memory (e.g., CXL memory), addressing three key challenges: (1) multi-dimensional attention dependencies, (2) intensifying CPU-GPU load imbalance with longer sequences, and (3) NUMA penalty of tiered memories. HybridGen tackles these by introducing attention logit parallelism, a feedback-driven scheduler, and semantic-aware KV cache mapping. Experiments with three LLM models with eleven different sizes on three GPU platforms with a CXL-expanded memory show that HybridGen outperforms six state-of-the-art KV cache management methods by 1.41x--3.2x on average while maintaining superior accuracy.

Summary

  • The paper introduces a novel CPU-GPU cooperative attention mechanism that alleviates memory pressure from KV cache growth.
  • The approach leverages pipelined logit computation and semantic-aware KV caching to balance workload and reduce latency.
  • Experimental results demonstrate 1.41×–3.2× latency improvements while maintaining accuracy with less than 0.02 absolute degradation.

HybridGen: Efficient LLM Generative Inference via CPU-GPU Hybrid Computing

Motivation and Problem Analysis

The expansion of LLM context lengths has created unprecedented memory pressure due to key-value (KV) cache growth. While model weights remain fixed, KV caches scale linearly with sequence length and batch size, quickly exceeding GPU memory capacity. Existing solutions such as pruning or offloading help alleviate resource constraints, but fundamentally fail to exploit the full computing potential of heterogeneous systems. Pruning often results in accuracy degradation by discarding context, and offloading—whether for computation or cache storage—, suffers from either overwhelming GPU transfer bottlenecks or CPU compute limitations at long sequences.

HybridGen proposes a novel CPU-GPU collaborative attention execution, where each processor operates on tokens in its local memory. This approach addresses: (1) multi-dimensional processor and memory dependencies within and across transformer layers; (2) CPU-GPU load imbalance, which becomes increasingly problematic as generation context length expands; and (3) NUMA penalties incurred by tiered memory systems (CXL, etc.). Figure 1 illustrates the explosive memory demand imposed by KV cache growth. Figure 1

Figure 1: KV cache memory consumption of OPT-13B across varying sequence lengths and batch sizes. The dashed line indicates the model's weight size, highlighting the scalability issue of KV cache.

Through detailed breakdowns, HybridGen demonstrates that neither complete offloading to CPU nor full GPU attention computation suffices: CPU becomes compute-bound at long contexts; GPU is bottlenecked by memory transfers. Figures 4, 5, and 6 quantify the data movement, computation distribution, and latency under pure CPU or GPU affinity, with hybrid execution yielding more efficient outcomes. Figure 2

Figure 2: Estimated data traffic during attention layer computation per iteration under different KV cache management strategies.

Figure 3

Figure 3: Computation distribution on transformer block computation per iteration under different strategies, revealing load imbalance in AoG and AoC.

Figure 4

Figure 4: Estimated latency under CPU, GPU, and hybrid strategies; hybrid achieves lowest end-to-end runtime.

HybridGen Architecture and Core Methodology

HybridGen introduces a lightweight framework for pipelined CPU-GPU parallel attention. The architecture (Figure 5) supports decoupling of attention logit computation, enabling CPU to process attention logits for offloaded tokens and GPU for those resident in local memory. Figure 5

Figure 5: Architecture of HybridGen, illustrating components for logit calculation, token mapping, and feedback scheduling.

Attention logit computation is parallelized; softmax and value aggregation are globally dependent but logit calculations are independently performed. HybridGen leverages high cosine similarity of layer inputs (Figure 6) to pipeline logit computations across layers, allowing CPU to proactively prepare attention logits for the next layer concurrently with GPU computation. Figure 6

Figure 6: Cosine similarity between inputs of consecutive transformer layers for OPT-6.7B, Qwen2.5-7B, and Llama-3.1-8B, validating input reuse across layers for hybrid scheduling.

Workflow details (Figure 7) show CPU and GPU stages overlapping; CPU selects or computes logits for important tokens based on scheduler policy and token mapping, then transfers compact logits and value vectors to GPU, where global softmax and aggregation are completed. Figure 7

Figure 7: Workflow of HybridGen illustrating computational pipeline and DMA-based fusion of CPU and GPU outputs.

HybridGen’s feedback scheduler (Figure 8 and Algorithm in the paper) dynamically adapts CPU workload and token selection strategy, maintaining balance and maximizing throughput while enforcing accuracy constraints. Top-K token selection is executed either post- or pre-logit computation, depending on bottleneck analysis. Figure 8

Figure 8: Attention logits computation under post- and pre-token selection mechanisms; the feedback scheduler selects appropriate strategy according to runtime metrics.

Semantic-aware KV cache mapping is introduced for tiered memory (CXL): K\mathbf K vectors are retained in CPU DRAM, while V\mathbf V vectors are evicted to CXL by design, removing NUMA latency from the critical path. This is a distinct departure from hotness-based mapping—the token selection logic is driven by attention semantics, not runtime memory profiling.

Experimental Results and Numerical Evidence

HybridGen was evaluated across eleven LLMs of various sizes and three GPU platforms, including CXL memory pools. As context length and batch grow, HybridGen preserves low inference latency and superior throughput, consistently outperforming six state-of-the-art baselines (pruning, selective offloading, etc.) by 1.41×\times–3.2×\times on average.

Figure 9 summarizes normalized end-to-end latency across models, with HybridGen dominating all baselines. Figure 9

Figure 9: End-to-end latency of different models (normalized to baseline); HybridGen achieves lowest latency under all configurations.

Performance scaling with sequence length and batch size further demonstrates HybridGen’s robustness. Figure 10 presents latency breakdowns for OPT-13B under variable batch sizes. Figure 10

Figure 10: Performance of OPT-13B across batch sizes, showing HybridGen’s advantage, particularly at scale.

Semantic-aware mapping (Figure 11) achieves measurable speedup over default page interleaving, especially as model size grows and KV cache consumes non-local memory tiers. Figure 11

Figure 11: Speedup attained by semantic-aware mapping vs. standard page mapping for large KV cache scenarios.

Accuracy is maintained due to the feedback scheduler’s enforcement of per-model KminK_{min} thresholds, mitigating pruning-induced degradation. Across standard benchmarks, HybridGen exhibits less than 0.02 absolute accuracy gap compared to full-context baselines, outperforming static pruning and offloading methods where long-range contextual dependencies are critical.

Practical and Theoretical Implications

HybridGen’s hybrid scheduling and semantic awareness unlock efficient long-context inference for LLMs even as context sizes reach hundreds of thousands or millions of tokens. This enables realistic deployment scenarios where inference must scale beyond GPU memory, leveraging expanded CPU and CXL resources without sacrificing accuracy or throughput.

The design is framework-agnostic and extends to vLLM, SGLang, and other modern serving stacks. The introduction of feedback-driven scheduling establishes a principled approach to dynamic resource management, inherently supporting new hardware architectures (Grace-Hopper, etc.) and emerging system topologies.

HybridGen challenges the traditional boundaries of attention computation affinity, demonstrating that decoupling logit calculation and embracing pipelined, collaborative execution on heterogeneous processors is fundamentally superior to monolithic or static approaches. The semantic-aware KV cache mapping lays groundwork for future NUMA-optimized memory hierarchies tailored specifically to LLM inference primitives.

Future Considerations

Potential future directions include:

  • Further exploitation of CXL-expanded memory, including deep integration with persistent memory devices.
  • Algorithmic extension to support even greater concurrency across multiple CPUs/GPUs per node.
  • Integration with speculative decoding and multi-turn dialogue systems for low-latency interactive inference.
  • Exploration of continuous batching, prefix caching, and paged KV-cache metadata for seamless scaling.

The architecture is well-suited for adaptation to evolving hardware landscapes and for inclusion in system-level scheduling frameworks that optimize for both throughput and latency under resource constraints.

Conclusion

HybridGen introduces a CPU-GPU hybrid attention framework for efficient generative inference in LLMs, leveraging pipelined execution, feedback-driven load balancing, and semantic-aware tiered memory mapping. The approach yields substantial performance improvements without sacrificing accuracy, robustly scaling across model sizes, sequence lengths, batch sizes, and memory configurations. HybridGen sets a new technical standard for resource-efficient, scalable LLM inference on modern heterogeneous systems (2604.18529).

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.