Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
GPT-5.1
GPT-5.1 130 tok/s
Gemini 3.0 Pro 29 tok/s Pro
Gemini 2.5 Flash 145 tok/s Pro
Kimi K2 191 tok/s Pro
Claude Sonnet 4.5 34 tok/s Pro
2000 character limit reached

Ascendra: Dynamic Request Prioritization

Updated 18 November 2025
  • Ascendra is a dynamic request prioritization system that jointly meets dual SLOs for TTFT and TBT, ensuring efficient LLM serving.
  • It organizes GPUs into high- and low-priority pools and leverages real-time urgency modeling to offload urgent requests, balancing latency with throughput.
  • Empirical results demonstrate up to 1.7× throughput improvement and reduced scheduling delays, outperforming existing frameworks under high-load conditions.

Ascendra is a dynamic request prioritization system for efficient LLM serving, designed to jointly satisfy service-level objectives for both time-to-first-token (TTFT) and time-between-tokens (TBT) across high-throughput, latency-sensitive workloads. Its architecture leverages real-time urgency modeling and resource partitioning to minimize violation rates for both key metrics, outperforming prior art in aggregate throughput and goodput under realistic serving conditions (Ikram et al., 29 Apr 2025).

1. Dual Service-Level Objectives: TTFT and TBT

Modern LLM serving workloads impose two primary latency requirements:

  • Time-to-First-Token (TTFT) SLO: Maximum allowable time from request arrival to emission of the first token.
  • Time-Between-Tokens (TBT) SLO: Upper bound on latency between successive token generations during autoregressive decoding.

A serving system is evaluated by goodput, defined as the fraction of requests meeting both TTFT and TBT SLOs, and by total throughput (tokens per second). Contemporary systems suffer trade-offs: frameworks like vLLM minimize TTFT through aggressive prefill prioritization, often at the risk of TBT violations caused by decode interruptions; decode-centric frameworks such as Sarathi-Serve prioritize decode batching for low TBT, but induce inflated TTFT by delaying admission of new requests until batch completion. Disaggregated designs such as DistServe decouple prefill and decode across separate GPUs but require expensive, high-bandwidth interconnects that impair resource efficiency. Thus, joint satisfaction of TTFT and TBT SLOs without resource overcommitment remains a central challenge (Ikram et al., 29 Apr 2025).

2. Two-Tier GPU Partitioning

Ascendra organizes the GPU cluster into two disjoint pools:

  • Low-Priority (LP) Instances: Configured for maximal throughput, admitting requests in a round-robin manner and exploiting out-of-order (OoO) batching (“piggybacked” prefill and decode) to maximize utilization. The risk is head-of-line blocking under heavy load, which leads to TTFT SLO violations.
  • High-Priority (HP) Instances: Provisioned for minimal latency, with a single-batch ticketing mechanism that guarantees immediate prefill admission for any offloaded urgent request. HP forgoes throughput in favor of TTFT SLO adherence.

A centralized controller routes all new requests to LP by default. Each LP instance detects imminent TTFT deadline violations based on a dynamically computed urgency score and, when needed, offloads such requests back to the controller for HP processing. When HP is idle, it issues a ticket to opportunistically accept a freshly arriving request, boosting resource utilization under low load (Ikram et al., 29 Apr 2025).

3. Performance Modeling and Urgency Scoring

Ascendra constructs an analytical-regression model to predict prefill and decode batch latencies:

  • Prefill: For a batch of size BpB_p and prompt lengths {li}\{l_i\},

Mp=i=0Bp1[2sli+3slilib]+MGEMM,Fp=i=0Bp12sli2+FGEMMM_p = \sum_{i=0}^{B_p-1} [2 s l_i + 3 s l_i \frac{l_i}{b}] + M_{\mathrm{GEMM}}, \quad F_p = \sum_{i=0}^{B_p-1} 2 s l_i^2 + F_{\mathrm{GEMM}}

where ss is transformer head size, bb is FlashAttention block size, and MGEMM,FGEMMM_{\mathrm{GEMM}}, F_{\mathrm{GEMM}} cover linear and FFN operations.

  • Decode: With BdB_d ongoing decodes, using effective sequence lengths {l^i}\{\hat l_i\},

Md=i=0Bd1(2sl^i+2s)+MGEMM,Fd=i=0Bd12sl^i+FGEMMM_d = \sum_{i=0}^{B_d-1}(2s \hat l_i + 2s) + M_{\mathrm{GEMM}}, \quad F_d = \sum_{i=0}^{B_d-1} 2s \hat l_i + F_{\mathrm{GEMM}}

  • Batch Latency Prediction:

tpred=C1(tM+tF)+C2max(tM,tF)+C3tM+C4tF+C5t_{\mathrm{pred}} = C_1(t_M + t_F) + C_2\max(t_M, t_F) + C_3 t_M + C_4 t_F + C_5

with tM=M/MHt_M = M / M_H, tF=F/FHt_F = F / F_H (hardware memory bandwidth/peak FLOPs), and coefficients {Cj}\{C_j\} fit via linear regression on live traces. The model achieves 10%\leq 10\% relative error over diverse batch settings (Ikram et al., 29 Apr 2025).

For each request ii (arrival time aia_i, configured TTFT SLO ΔiTTFT\Delta_i^{\rm TTFT}, and current time tnowt_{\text{now}}), the remaining TTFT slack is

slacki=ΔiTTFT(tnowai)τiprefill\text{slack}_i = \Delta_i^{\rm TTFT} - (t_{\text{now}} - a_i) - \tau_i^{\rm prefill}

with τiprefill\tau_i^{\rm prefill} the estimated remaining prefill time. The urgency score is defined as Ui=slackiU_i = - \text{slack}_i (or equivalently Ui=1/(slacki+ε)U_i = 1/(\text{slack}_i + \varepsilon)), with higher values indicating greater urgency for TTFT SLO compliance (Ikram et al., 29 Apr 2025).

4. Dynamic Prioritization Algorithm

Scheduling on each LP instance proceeds through three main steps at every epoch:

a) Batch-Level Estimation: For every waiting request wiw_i, compute τiprefill\tau_i^{\rm prefill} and τidecode\tau_i^{\rm decode} using the regression model, evaluate slacki\text{slack}_i, and determine UiU_i.

b) Out-of-Order Hybrid Selection: Sort the waiting set WW by descending UiU_i (i.e., Earliest Deadline First). Greedily pack the batch to fit within compute, memory, and token constraints (C,M,N)(C, M, N) via a token budget heuristic:

1
2
3
4
5
selected  []
for w in sort_by_descending(U_i):
    if fits_in_remaining_budget(w, C, M, N):
        selected.append(w)
        deduct_budget(w, C, M, N)

c) Proactive Offloading: Any wjw_j for which slack dips below a configurable threshold θ\theta is removed from the LP queue and offloaded to the controller for prompt HP dispatch. Offloaded requests are queued on HP (queue length 1\leq 1); idle HP issues tickets allowing the next inbound request direct HP service for utilization smoothing (Ikram et al., 29 Apr 2025).

This dynamic system ensures that requests transition from low-urgency, high-throughput processing to high-urgency, low-latency processing as their TTFT slack declines, robustly maintaining joint SLO satisfaction even as load fluctuates.

5. Comparative Performance and Empirical Results

Experiments deployed Ascendra, vLLM, and Sarathi-Serve across three NVIDIA A100-80GB GPUs, serving Mistral-7B, LLaMA3.1-8B, and Qwen-14B under realistic Poisson arrival processes using the ShareGPT4 and LongBench benchmarks. Key findings (Ikram et al., 29 Apr 2025):

  • Throughput: Ascendra achieved up to 1.7× higher token/sec throughput compared to both vLLM and Sarathi under high load while sustaining both SLOs.
  • Goodput: At the 90% SLO threshold, Ascendra provided 19.1% higher goodput than vLLM and 17.4% higher than Sarathi-Serve on Mistral-7B; for LLaMA3.1-8B, goodput outperformed both baselines by 15.4%.
  • TTFT and TBT SLOs: Ascendra kept p99 TTFT below 1s at arrival rates where vLLM experienced tails above 2s; mean TBT remained under 0.15s even with aggressive LP batching, leveraging continuous decode piggybacking and HP relief for urgent requests.
  • Scheduling Delay: Urgent requests offloaded to HP experienced 4× lower scheduling delays compared to remaining on LP.
  • Policy Ablation: Elastic batching on HP increased goodput by 5–10% at high QPS; Earliest Deadline First (EDF) outperformed Shortest Job First (SJF) policies by several goodput percentage points, underscoring the impact of deadline-driven scheduling.
  • Resource Efficiency: Ascendra delivers substantial goodput and throughput improvements with no requirement for high-bandwidth interconnects, relying only on a small fraction of GPUs dedicated to HP tasks.

The table below summarizes core performance metrics:

System SLO Goodput (%) Max Throughput (tokens/s) p99 TTFT (s) Mean TBT (s)
Ascendra +15–20 over SOTA up to 1.7× baseline <1 <0.15
vLLM/Sarathi Baseline Baseline >2 Variable

Ascendra's dynamic approach consistently outperforms static or single-metric tending baselines along both axes (Ikram et al., 29 Apr 2025).

Ascendra builds on a broad body of dynamic prioritization and deadline-aware resource allocation. Patience-aware and expectation-aware scheduling models reorder request queues by either per-user patience or soft deadlines (Cardonha et al., 2013), demonstrating analytically that such policies, e.g., EDF, can maximally preserve “happiness” or minimize deadline violations under certain conditions. In Ascendra, urgency scoring by TTFT slack and dispatch to HP instances generalize this approach to SLO-centric LLM workloads with time-varying urgency.

Dynamic prioritization schemes akin to strict-priority queueing for spectrum sharing (Shnayder et al., 2014) ensure monotonicity and (in market scenarios) incentive compatibility, supporting the thesis that real-time, value-derived request scheduling can maximize system-wide efficiency subject to fairness or economic constraints.

A plausible implication is that methods of urgency quantification and out-of-order preemption—whether via patience indices, EDF, or market bidding—provide a unified lens through which to interpret the efficiency gains of Ascendra’s architecture when compared to FIFO or static scheduling models.

7. Limitations and Potential Extensions

Ascendra’s performance and design are contingent on several factors:

  • Model fidelity: The execution time regression model requires accurate, continual fitting to maintain ≤10% error, especially under nonstationary loads.
  • Queueing complexity: Dynamic prioritization (EDF-style) in large-scale environments may necessitate bucketed or approximate queue sorting; overhead must be managed.
  • Offload sensitivity: The placement and adaptation of the threshold θ\theta for LP→HP migration directly impacts both resource utilization and SLO attainment.
  • Resource fragmention: Under heavy load, contention for HP capacity may risk starvation or head-of-line blocking for certain workloads, requiring ongoing adaptive tuning.

Opportunities for further work include integrating patience-aware or expectation-aware enhancements for more general user-centric QoS optimization (Cardonha et al., 2013); deploying economic prioritization or truthful bidding schemes to address multi-tenant serving (Shnayder et al., 2014); and formally quantifying the trade-off between HP pool size and system-wide goodput under new request intensity distributions.

Ascendra constitutes a robust, empirically validated solution for joint TTFT/TBT SLO satisfaction and high-throughput LLM serving, establishing a new baseline for dynamic deadline-driven resource allocation in autoregressive inference platforms (Ikram et al., 29 Apr 2025).

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Ascendra: Dynamic Request Prioritization.