Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 87 tok/s
Gemini 2.5 Pro 60 tok/s Pro
GPT-5 Medium 28 tok/s
GPT-5 High 28 tok/s Pro
GPT-4o 81 tok/s
GPT OSS 120B 453 tok/s Pro
Kimi K2 229 tok/s Pro
2000 character limit reached

HyperFlexis: Unified LLM Serving System

Updated 25 August 2025
  • The paper introduces HyperFlexis, a unified system that optimizes LLM serving by integrating multi-SLO-aware scheduling with rapid scaling and device-to-device weight transfer.
  • It employs a centralized dispatcher with token budget estimation, adaptive prefill/ decode disaggregation, and dynamic monitoring to handle heterogeneous requests and strict SLOs.
  • Experimental results show up to 4.44× SLO improvement, 65.82% lower latency, and 19.39× reduction in cold-start latency compared to traditional scheduling methods.

HyperFlexis refers to a unified system and algorithmic design that addresses the multifaceted requirements of serving LLMs in environments characterized by workload variability and strict multi-stage service-level objectives (SLOs) (Yousefijamarani et al., 21 Aug 2025). Modern deployments of LLMs are constrained by the need to handle heterogeneous requests—varying in input length, priority, and SLOs—while maintaining low latency, high SLO compliance, and cost-effective scalability. HyperFlexis presents a tightly integrated software and architectural solution that jointly optimizes per-request scheduling and system-level scaling, incorporates novel device-to-device (D2D) weight transfer for reduced instance bring-up latency, and explicitly supports both collocated and Prefill/Decode (P/D) disaggregated serving architectures.

1. Architectural Foundations

HyperFlexis is structured around several interacting subsystems that collectively enable its multi-SLO and fast-scaling capabilities. Client requests enter through an API Server, specifying their individualized SLOs. A centralized Multi-SLO-Aware Dispatcher orchestrates execution using a global Request Priority Queue (Q_R) for management of incoming requests and a Worker Priority Queue (Q_W) to govern available compute resources. The dispatcher relies on a runtime latency predictor and token budget estimator to make admission and assignment decisions, balancing queue depth, worker utilization, and maturity times to proactively preserve each request’s SLO.

Support is provided for two execution topologies:

  • Collocated serving: Both prompt (prefill) and token (decode) stages are computed sequentially on a single worker.
  • P/D-disaggregated serving: The dispatcher schedules prompt computation on dedicated prefill workers; upon completion, the request is rescheduled via a Migrator module to decode workers, with a TransferLink Manager (TLManager) handling requisite KV cache (key/value) data movement.

A Monitoring subsystem tracks per-worker metrics—utilization, queue lengths, and latency distributions—feeding this into a Scaling Controller that dynamically provisions and decommissions worker instances based on real-time workload analysis. The interface between these modules is designed to enable rapid transitions between different operating modes, crucial for scaling elasticity.

2. Multi-SLO-Aware Scheduling

HyperFlexis optimizes scheduling for heterogeneous SLOs at both prompt (prefill) and decode stages, supporting collocated and disaggregated deployment. The system employs the following scheduling innovations:

  • Token Budget Estimation: On each scheduling event, the dispatcher predicts the time required to serve existing and candidate requests, using a model with empirically fitted coefficients reflecting batching and input length. Only requests that can be satisfied within their SLOs are admitted.
  • Request Prioritization: Requests are sorted by SLO hardness and arrival order, with prioritization directly influencing dispatch decisions. For each worker, a maturity time estimator calculates the earliest moment it would be available for new assignments without risking deadline violations.
  • Adaptive Dispatching: The dispatcher tests each request against the estimated TTL (time-to-live) margin, ensuring proactive TTFT (time-to-first-token) compliance.
  • Rolling SLO Maintenance: Both scheduling and rescheduling (particularly in P/D-disaggregated mode) guarantee that ongoing requests are continuously reassessed for SLO attainment.

This dynamic, multi-stage prioritization demonstrably increases the fraction of requests that achieve both TTFT and TPOT (time-per-output-token) targets, outperforming round-robin, FIFO, and specialized schedulers in head-to-head experiments.

3. Prefill/Decode Disaggregation and Role Transitioning

A defining aspect of HyperFlexis is explicit support for disaggregated prefill and decode serving, which is particularly relevant for high-throughput environments. Under P/D-disaggregation:

  • Stage-Specific Pools: Prefill and decode computations are assigned to isolated pools of workers, improving resource utilization and enabling finer-grained prioritization across stages.
  • Migrator-Orchestrated Rescheduling: Upon completion of prefill, the Migrator module initiates rescheduling of requests for decode, decoupling admission and execution timing between the two stages.
  • KV Cache and Weight Coordination: The TLManager handles transfer of KV cache and model weights between prefill and decode instances. This transfer is designed to accommodate both collocated (intra-node) and distributed (inter-node) operation.

The two-stage scheduling mechanism mitigates unpredictability in decode admission following prompt processing, and instance linking during scaling ensures minimal disruption when worker roles change dynamically during scaling events.

4. Cost-Effective and Rapid Scaling

HyperFlexis introduces a scaling controller that evaluates system load using metrics including worker utilization, request queue length, and active/wait times. Scaling decisions are governed by defined thresholds ε_o (scaling out) and ε_i (scaling in), and operate asynchronously via timed evaluation epochs (every τ seconds).

  • Dynamic Worker Provisioning: The system efficiently adds or removes worker instances in response to observed demand, maintaining cost parity with round-robin and outperforming more complex baseline schedulers.
  • Rapid Role Transitions: In P/D-disaggregated mode, scaling events trigger fast transitions of workers between prefill and decode roles, coordinated by the TLManager.
  • Device-to-Device (D2D) Weight Transfer: Scaling out introduces new instances into the serving pool via D2D weight transfer. Instead of incurring high latency by loading model weights from disk or CPU memory, weights are transferred directly from running instances via high-bandwidth device connections, managed asynchronously by the TLManager and mapped tensor-by-tensor via the WeightManager.

Performance results indicate clear benefits: cold-start latency is reduced by up to 19.39× compared to disk-based loading, and scaling transitions become nearly instantaneous.

5. Performance Evaluation

Experimental evaluation provides quantitative measurements of HyperFlexis across multiple dimensions:

Metric Attainment / Reduction Baseline Comparison
SLO Attainment Up to 4.44× higher vs. state-of-the-art
Request Latency Up to 65.82% lower vs. round-robin/SCORPIO
Cost Efficiency Parity or ~50% reduction vs. round-robin/SCORPIO
D2D Scaling Speed Up to 19.39× cold-start reduction vs. disk or CPU reload

Additional results detail improved performance in priority-aware scheduling (up to 7.02× improvement under heavy load), robustness across monitor/scaling interval choices, and consistent gains in both collocated and P/D-disaggregated deployments. Cost is quantified as the product of cumulative active worker time and a per-unit cost (see [costEq] in the paper).

6. Implications and Future Directions

The integration of multi-SLO-aware scheduling, explicit stage decoupling, and rapid D2D scaling mechanisms in HyperFlexis yields substantial improvements in LLM serving environments. The current design meets heterogeneous service requirements while minimizing latency and cost, with experimental validation indicating strong practical utility.

The authors highlight several directions for future work: expanding HyperFlexis to heterogeneous models and hardware (including simultaneous serving of diverse model sizes), refining latency prediction under varied cross-node scenarios, and additional performance optimization across device classes. The anticipated code release will facilitate reproducibility and further exploration by both academic and industrial practitioners.

A plausible implication is that the architecture is readily extendable to forthcoming LLM workloads, including those with more complex SLO hierarchies and dynamic hardware capabilities.

7. Context Within Flexible System Design

Although bearing nominal similarity to strain-energy-based flexure synthesis in mechanism design (Koppen et al., 2021), HyperFlexis is distinct as an LLM serving system. The defining innovations—priority mapping, rapid scaling, and D2D transfer—address high-throughput, low-latency requirements in AI inference infrastructure, not physical constraint satisfaction in manufactured components. This suggests future convergence between algorithmic flexibility in software systems and adaptable topology optimization techniques in hardware design may merit comparative paper.

HyperFlexis represents an empirically validated answer to contemporary challenges in LLM serving strategy, balancing multi-SLO scheduling, cost-effective scaling, and adaptive pod role assignment in diverse deployment environments.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube