Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 93 tok/s
Gemini 2.5 Pro 55 tok/s Pro
GPT-5 Medium 25 tok/s
GPT-5 High 24 tok/s Pro
GPT-4o 91 tok/s
GPT OSS 120B 462 tok/s Pro
Kimi K2 209 tok/s Pro
2000 character limit reached

Cache-Aware Request Scheduling

Updated 31 August 2025
  • Cache-aware request scheduling is a strategy that aligns task sequencing with cache organization and data reuse to minimize conflicts and improve data locality.
  • The approach employs methods from greedy heuristics to formal integer programming, achieving significant throughput gains and reduced latency in varied environments.
  • It is applied in domains from embedded multiprocessor systems to large-scale machine learning serving, emphasizing dynamic partitioning, load balancing, and unified memory pooling.

Cache-aware request scheduling refers to scheduling strategies that explicitly leverage information about memory hierarchy—especially cache state and data reuse—when assigning and sequencing computational or content delivery requests across processing or network resources. Across hardware and software layers, such approaches seek to minimize cache conflicts, maximize data locality, and, by extension, enhance system throughput, latency, and power efficiency. Research in this area spans operating systems, multiprocessor architectures, real-time embedded systems, web/content delivery platforms, and large-scale machine learning serving environments.

1. Design Principles and Motivations

Cache-aware scheduling aims to align request (task, thread, process, or content fetch) sequencing and placement with underlying cache organization, data sharing patterns, and resource limitations. Two main principles emerge:

  • Separation of non-sharing workloads: Processes or requests that do not reuse or share data are preferentially assigned to different cache domains, cores, or processing nodes, reducing capacity and conflict misses (e.g., in embedded MPSoCs (0710.4652)).
  • Co-location of sharing workloads: Requests that exhibit high inter-request or process data sharing (as identified by static or dynamic analysis) are intentionally mapped to the same execution context so as to exploit temporal and spatial locality in the cache (e.g., mapping dependent processes to one core in (0710.4652); co-locating tasks with high communication affinity on the same core in (&&&@@@@1@@@@&&&)).

The importance arises in diverse scenarios:

  • Embedded multiprocessor systems, where on-chip cache is limited and cache misses are highly penalizing.
  • Multicore hard real-time systems, where cache uncertainty induces unpredictability in worst-case execution times (WCET).
  • Modern LLM inference serving, where the cumulative size of key-value (KV) caches often dominates GPU memory budgets.
  • High-performance wireless or content delivery networks, where cache placement and request scheduling interact to impact latency and throughput.

2. Algorithms and Mathematical Formulations

Research in cache-aware scheduling formalizes scheduling decisions with a range of algorithms from greedy heuristics to complex combinatorial optimizations:

Locality-aware Scheduling in MPSoCs

(0710.4652) defines a greedy scheduling algorithm operating in two phases:

    1. Identify independent processes and, if their number exceeds cores, remove those with maximal shared data until parity is reached.
    1. Iteratively chain processes on each core by maximizing pairwise shared data SSi,j|\text{SS}_{i,j}| (the size of the shared set between ii and jj).

The process is formalized in pseudocode involving selection and scheduling based on a sharing matrix M[i][j]M[i][j].

Data Re-mapping to Reduce Conflict Misses

(0710.4652) introduces a formula for address re-mapping among arrays:

addr(A[x,y])=2addr(A[x,y])(addr(A[x,y])mod(C/2))+b\text{addr}'(A[x,y]) = 2\cdot\text{addr}(A[x,y]) - (\text{addr}(A[x,y])\bmod(C/2)) + b

where b{0,C/2}b \in \{0,\, C/2\} biases the mapping to avoid overlapping cache sets for different arrays scheduled in the same core context.

Real-time and Affinity-aware Scheduling

Static task scheduling in hard real-time multicore systems (Zaourar et al., 2013) employs integer (or quadratic) programming, with variables indicating per-interval, per-core assignment xijkx_{ijk} and quantities such as working set size (WSS) and communication affinities ai,ia_{i,i'}. The objective function:

Z=i=1nj=1mk=1Tiai,i×xi,j,k×xi,j,kZ = \sum_{i=1}^n \sum_{j=1}^m \sum_{k=1}^T \sum_{i'} a_{i,i'} \times x_{i,j,k} \times x_{i',j,k}

subjects to cache and temporal constraints.

Stack Distance Distribution Modeling

(0902.4822) introduces a model for cache-miss probability via the stack distance CDF:

P(miss)=1cdf(csls)P(\text{miss}) = 1 - \text{cdf}\left(\frac{cs}{ls}\right)

where cscs is cache size, lsls is line size, and cdf is the cumulative probability of stack distances less than cs/lscs/ls.

Multicast and Edge/Content-aware Scheduling

For multicasting and edge cache scheduling, Markov decision processes (MDPs) optimize over queue states, action costs, and reward functions (see (Zhou et al., 2015, Zhou et al., 2015)). The BeLLMan equation formalism allows derivation of threshold (“switch”) structures for optimal scheduling policies.

3. Application Contexts and Experimental Findings

Cache-aware request scheduling spans numerous runtimes and workload types:

Embedded MPSoCs

In an 8-core configuration with 8KB, 2-way caches, (0710.4652) reports that both locality-aware scheduling (LS) and locality-aware scheduling with data mapping (LSM) outperform round-robin (RRS) and random scheduling (RS) by exploiting intra-task data reuse and minimizing conflicts. When tasks run concurrently, LSM’s data layout component becomes vital, as unrelated process interleaving would otherwise increase conflict misses.

Real-time Scheduling

(Zaourar et al., 2013) demonstrates that affinity-aware static scheduling not only meets real-time deadlines but also reduces L1 cache misses by co-scheduling tasks exhibiting communication flows on the same core/cache, thereby providing deterministic cache use in hard real-time multicore environments.

Multi-level and Dynamic Partitioning

(Awan et al., 2017) presents dynamic cache partitioning in mixed-criticality systems—at mode change (e.g., entering high-criticality), cache portions reserved to low-criticality tasks are immediately reassigned to high-criticality ones, potentially reducing the latter’s WCET. Integer linear programming heuristics drive both initial and dynamic allocations, improving schedulability by up to 30.6% in weighted schedulability compared to static allocation.

Large-scale Serving and Online Systems

Recent advances leverage cache-aware scheduling to maximize utilization and minimize cost in data center AI serving. Systems such as Mell (Qianli et al., 12 Jan 2025), FlowKV (Li et al., 3 Apr 2025), and Apt-Serve (Gao et al., 10 Apr 2025) use knowledge of the KV cache (its size, sharing patterns, or type) to:

  • Balance request loads across GPUs to maximize KV cache fit (Mell).
  • Selectively use memory-efficient variants (e.g., hidden caches vs. KV cache) based on runtime pressure and batch composition (Apt-Serve).
  • Assign/route requests elastically in response to cluster cache status, minimizing transfer overhead (FlowKV, TokenLake (Wu et al., 24 Aug 2025)).

Empirical results demonstrate up to 8.8× throughput improvement (Apt-Serve), 25%–48.9% inference acceleration (FlowKV), and 2.0×–2.6× throughput and hit rate increases for prefix caching (TokenLake).

4. Abstractions and System Integration

A common trajectory has been toward greater abstraction and decoupling between cache management and request scheduling:

  • TokenLake (Wu et al., 24 Aug 2025) shifts from joint cache-scheduler coupling to a declarative unified prefix cache pool, exposing global cache load and segment status to the scheduler. A heavy-hitter-aware replication and caching algorithm performs dynamic balancing, deduplication, and defragmentation, while the scheduler can remain stateless with respect to underlying cache layout.
  • FlowKV (Li et al., 3 Apr 2025) eliminates bottlenecks in KV cache transfer between prefill and decode nodes through contiguous memory layout transformations and segment alignment, enabling flexible PD node allocation driven by real-time load-aware scheduling.

Such decoupling allows stateless, elastic, and scalable scheduling schemes while maximizing hardware utilization and minimizing redundant cache storage.

5. Methodological Advances and Broader Implications

Cache-aware request scheduling has driven several notable advancements:

  • Lightweight cache stress analysis: By modeling stack distance distributions with minimal instrumentation (0902.4822), systems can obtain fast, constant-time cache performance predictions useful for online scheduling in heterogeneous clusters.
  • Switch/threshold policy design: For content-centric networks, analysis demonstrates that optimal or near-optimal policies reduce to low-complexity threshold-based rules (queue exceeds threshold ⇒ schedule request), facilitating scalable online implementation (Zhou et al., 2015, Zhou et al., 2015).
  • Unified pooling and load balancing: Modern frameworks like TokenLake (Wu et al., 24 Aug 2025) apply global memory pooling and segment-level partitioning—avoiding data redundancy and memory fragmentation when sharing long prefixes across LLM serving clusters.
  • Theoretical performance guarantees: New online scheduling algorithms for LLMs offer formal competitive ratio bounds with respect to computed hindsight optima (Jaillet et al., 10 Feb 2025), while multi-GPU KV cache schedulers guarantee bounded migration count and close-to-optimal resource usage (Qianli et al., 12 Jan 2025).

6. Limitations and Open Challenges

While performance improvements are substantial, several limitations or research challenges persist:

  • Accurate inference of future request behavior, input sizes, and output lengths remains an open problem in online systems (Jaillet et al., 10 Feb 2025).
  • Cache-aware heuristics often rely on accurate static knowledge (e.g., WSS, data affinity, sharing matrices); adapting to non-stationary or input-dependent behavior is nontrivial (0902.4822).
  • Scalable global coordination (e.g., in multi-cache or multi-instance clusters) can introduce time or computational overhead if not managed carefully (Wu et al., 24 Aug 2025).
  • For generic adversarial inputs in online paging with multiple caches, the competitive ratio is provably unbounded (Vaze et al., 2016); optimality usually depends on i.i.d. or stochastic request assumptions.

A plausible implication is that continued advances will rely on closed-loop, data-driven methods and flexible, segment-level memory management abstractions that further decouple cache management from higher-level scheduling logic.

7. Extensions and Future Directions

Prospective directions emerging from recent research include:

  • Exploiting fine-grained pooling and rapid migration for elastic LLM serving at scale (Wu et al., 24 Aug 2025), potentially with dynamic segment sizing and adaptive scheduling to match workload patterns.
  • Integrating machine learning-based predictors for cache pressure, tail latency, or per-request resource demand, enhancing online scheduling robustness.
  • Adapting threshold/switch policy designs from wireless/content-centric networks to broader classes of distributed systems, including disaggregated compute environments.
  • Refining memory layout transformation and address remapping techniques to minimize inter-process or inter-instance conflict misses (e.g., as in (0710.4652, Li et al., 3 Apr 2025)).

This synthesis reflects the ongoing evolution of cache-aware request scheduling from localized OS or hardware patterns toward system-wide, abstracted, and stateless scheduling supported by unified memory pooling, probabilistic analysis, and declarative interfaces. Results in both experimental and theoretical domains indicate that such cache-conscious policies are central to the scalability and efficiency of modern computing and service architectures.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube