Ascendra: Dynamic Request Prioritization
- Ascendra is a dynamic request prioritization system that jointly meets dual SLOs for TTFT and TBT, ensuring efficient LLM serving.
- It organizes GPUs into high- and low-priority pools and leverages real-time urgency modeling to offload urgent requests, balancing latency with throughput.
- Empirical results demonstrate up to 1.7× throughput improvement and reduced scheduling delays, outperforming existing frameworks under high-load conditions.
Ascendra is a dynamic request prioritization system for efficient LLM serving, designed to jointly satisfy service-level objectives for both time-to-first-token (TTFT) and time-between-tokens (TBT) across high-throughput, latency-sensitive workloads. Its architecture leverages real-time urgency modeling and resource partitioning to minimize violation rates for both key metrics, outperforming prior art in aggregate throughput and goodput under realistic serving conditions (Ikram et al., 29 Apr 2025).
1. Dual Service-Level Objectives: TTFT and TBT
Modern LLM serving workloads impose two primary latency requirements:
- Time-to-First-Token (TTFT) SLO: Maximum allowable time from request arrival to emission of the first token.
- Time-Between-Tokens (TBT) SLO: Upper bound on latency between successive token generations during autoregressive decoding.
A serving system is evaluated by goodput, defined as the fraction of requests meeting both TTFT and TBT SLOs, and by total throughput (tokens per second). Contemporary systems suffer trade-offs: frameworks like vLLM minimize TTFT through aggressive prefill prioritization, often at the risk of TBT violations caused by decode interruptions; decode-centric frameworks such as Sarathi-Serve prioritize decode batching for low TBT, but induce inflated TTFT by delaying admission of new requests until batch completion. Disaggregated designs such as DistServe decouple prefill and decode across separate GPUs but require expensive, high-bandwidth interconnects that impair resource efficiency. Thus, joint satisfaction of TTFT and TBT SLOs without resource overcommitment remains a central challenge (Ikram et al., 29 Apr 2025).
2. Two-Tier GPU Partitioning
Ascendra organizes the GPU cluster into two disjoint pools:
- Low-Priority (LP) Instances: Configured for maximal throughput, admitting requests in a round-robin manner and exploiting out-of-order (OoO) batching (“piggybacked” prefill and decode) to maximize utilization. The risk is head-of-line blocking under heavy load, which leads to TTFT SLO violations.
- High-Priority (HP) Instances: Provisioned for minimal latency, with a single-batch ticketing mechanism that guarantees immediate prefill admission for any offloaded urgent request. HP forgoes throughput in favor of TTFT SLO adherence.
A centralized controller routes all new requests to LP by default. Each LP instance detects imminent TTFT deadline violations based on a dynamically computed urgency score and, when needed, offloads such requests back to the controller for HP processing. When HP is idle, it issues a ticket to opportunistically accept a freshly arriving request, boosting resource utilization under low load (Ikram et al., 29 Apr 2025).
3. Performance Modeling and Urgency Scoring
Ascendra constructs an analytical-regression model to predict prefill and decode batch latencies:
- Prefill: For a batch of size and prompt lengths ,
where is transformer head size, is FlashAttention block size, and cover linear and FFN operations.
- Decode: With ongoing decodes, using effective sequence lengths ,
- Batch Latency Prediction:
with , (hardware memory bandwidth/peak FLOPs), and coefficients fit via linear regression on live traces. The model achieves relative error over diverse batch settings (Ikram et al., 29 Apr 2025).
For each request (arrival time , configured TTFT SLO , and current time ), the remaining TTFT slack is
with the estimated remaining prefill time. The urgency score is defined as (or equivalently ), with higher values indicating greater urgency for TTFT SLO compliance (Ikram et al., 29 Apr 2025).
4. Dynamic Prioritization Algorithm
Scheduling on each LP instance proceeds through three main steps at every epoch:
a) Batch-Level Estimation: For every waiting request , compute and using the regression model, evaluate , and determine .
b) Out-of-Order Hybrid Selection: Sort the waiting set by descending (i.e., Earliest Deadline First). Greedily pack the batch to fit within compute, memory, and token constraints via a token budget heuristic:
1 2 3 4 5 |
selected ← [] for w in sort_by_descending(U_i): if fits_in_remaining_budget(w, C, M, N): selected.append(w) deduct_budget(w, C, M, N) |
c) Proactive Offloading: Any for which slack dips below a configurable threshold is removed from the LP queue and offloaded to the controller for prompt HP dispatch. Offloaded requests are queued on HP (queue length ); idle HP issues tickets allowing the next inbound request direct HP service for utilization smoothing (Ikram et al., 29 Apr 2025).
This dynamic system ensures that requests transition from low-urgency, high-throughput processing to high-urgency, low-latency processing as their TTFT slack declines, robustly maintaining joint SLO satisfaction even as load fluctuates.
5. Comparative Performance and Empirical Results
Experiments deployed Ascendra, vLLM, and Sarathi-Serve across three NVIDIA A100-80GB GPUs, serving Mistral-7B, LLaMA3.1-8B, and Qwen-14B under realistic Poisson arrival processes using the ShareGPT4 and LongBench benchmarks. Key findings (Ikram et al., 29 Apr 2025):
- Throughput: Ascendra achieved up to 1.7× higher token/sec throughput compared to both vLLM and Sarathi under high load while sustaining both SLOs.
- Goodput: At the 90% SLO threshold, Ascendra provided 19.1% higher goodput than vLLM and 17.4% higher than Sarathi-Serve on Mistral-7B; for LLaMA3.1-8B, goodput outperformed both baselines by 15.4%.
- TTFT and TBT SLOs: Ascendra kept p99 TTFT below 1s at arrival rates where vLLM experienced tails above 2s; mean TBT remained under 0.15s even with aggressive LP batching, leveraging continuous decode piggybacking and HP relief for urgent requests.
- Scheduling Delay: Urgent requests offloaded to HP experienced 4× lower scheduling delays compared to remaining on LP.
- Policy Ablation: Elastic batching on HP increased goodput by 5–10% at high QPS; Earliest Deadline First (EDF) outperformed Shortest Job First (SJF) policies by several goodput percentage points, underscoring the impact of deadline-driven scheduling.
- Resource Efficiency: Ascendra delivers substantial goodput and throughput improvements with no requirement for high-bandwidth interconnects, relying only on a small fraction of GPUs dedicated to HP tasks.
The table below summarizes core performance metrics:
| System | SLO Goodput (%) | Max Throughput (tokens/s) | p99 TTFT (s) | Mean TBT (s) |
|---|---|---|---|---|
| Ascendra | +15–20 over SOTA | up to 1.7× baseline | <1 | <0.15 |
| vLLM/Sarathi | Baseline | Baseline | >2 | Variable |
Ascendra's dynamic approach consistently outperforms static or single-metric tending baselines along both axes (Ikram et al., 29 Apr 2025).
6. Related Models and Theoretical Context
Ascendra builds on a broad body of dynamic prioritization and deadline-aware resource allocation. Patience-aware and expectation-aware scheduling models reorder request queues by either per-user patience or soft deadlines (Cardonha et al., 2013), demonstrating analytically that such policies, e.g., EDF, can maximally preserve “happiness” or minimize deadline violations under certain conditions. In Ascendra, urgency scoring by TTFT slack and dispatch to HP instances generalize this approach to SLO-centric LLM workloads with time-varying urgency.
Dynamic prioritization schemes akin to strict-priority queueing for spectrum sharing (Shnayder et al., 2014) ensure monotonicity and (in market scenarios) incentive compatibility, supporting the thesis that real-time, value-derived request scheduling can maximize system-wide efficiency subject to fairness or economic constraints.
A plausible implication is that methods of urgency quantification and out-of-order preemption—whether via patience indices, EDF, or market bidding—provide a unified lens through which to interpret the efficiency gains of Ascendra’s architecture when compared to FIFO or static scheduling models.
7. Limitations and Potential Extensions
Ascendra’s performance and design are contingent on several factors:
- Model fidelity: The execution time regression model requires accurate, continual fitting to maintain ≤10% error, especially under nonstationary loads.
- Queueing complexity: Dynamic prioritization (EDF-style) in large-scale environments may necessitate bucketed or approximate queue sorting; overhead must be managed.
- Offload sensitivity: The placement and adaptation of the threshold for LP→HP migration directly impacts both resource utilization and SLO attainment.
- Resource fragmention: Under heavy load, contention for HP capacity may risk starvation or head-of-line blocking for certain workloads, requiring ongoing adaptive tuning.
Opportunities for further work include integrating patience-aware or expectation-aware enhancements for more general user-centric QoS optimization (Cardonha et al., 2013); deploying economic prioritization or truthful bidding schemes to address multi-tenant serving (Shnayder et al., 2014); and formally quantifying the trade-off between HP pool size and system-wide goodput under new request intensity distributions.
Ascendra constitutes a robust, empirically validated solution for joint TTFT/TBT SLO satisfaction and high-throughput LLM serving, establishing a new baseline for dynamic deadline-driven resource allocation in autoregressive inference platforms (Ikram et al., 29 Apr 2025).
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free