SLO-Aware LLM Inference (SLAI)
- SLAI is a framework that ensures LLM inference workloads meet explicit SLOs by managing latency, throughput, and reliability through integrated measurement and control.
- It employs real-time monitoring, predictive modeling, and adaptive control loops to perform latency sculpting, dynamic batching, and speculative decoding for efficient service.
- Empirical evaluations demonstrate that SLAI frameworks significantly reduce p95 latency and SLO violations while boosting throughput and overall system efficiency.
SLO-Aware LLM Inference (SLAI) defines and orchestrates the complete stack of techniques, algorithms, and system architectures that ensure LLM serving workloads consistently meet explicit Service-Level Objectives (SLOs) for latency, throughput, and reliability. The primary driver is that even minimal SLO violations—especially in tail latency—can compromise user experience and economic efficiency, particularly in production-scale deployments spanning diverse hardware and workload heterogeneity. SLAI frameworks integrate measurement, prediction, scheduling, profiling, and real-time control loops to systematically align inference engine behavior with multi-dimensional SLO constraints. Representative implementations include LatencyPrism (Yin et al., 14 Jan 2026), which targets online, non-intrusive batch-level anomaly detection and latency sculpting; adaptive SLO-oriented speculative decoding in SpecServe (Huang et al., 7 Mar 2025); bucket-based dynamic batching as in BucketServe (Zheng et al., 23 Jul 2025); and resource-aware placement and proactive SLO-aware rotary scheduling exemplified in SuperInfer (Yu et al., 28 Jan 2026).
1. Fundamental SLO Metrics in LLM Inference
LLM inference SLOs are typically specified in several interrelated metrics, reflecting both user-observable and system-level requirements:
- Time-to-First-Token (TTFT): Wall-clock time from request arrival to emission of the first output token (), a critical interactive metric (Yin et al., 14 Jan 2026, Huang et al., 7 Mar 2025, Yu et al., 28 Jan 2026).
- Time-Per-Output-Token (TPOT)/Time-Between-Tokens (TBT): Per-token generation time; regulated as , where is a configurable bound (Huang et al., 7 Mar 2025, Yu et al., 28 Jan 2026, Zheng et al., 23 Jul 2025).
- Percentile SLOs: Attainment is usually stated as “p95 latency” or “SLO attainment rate,” e.g., “p95 200 ms” or (Yin et al., 14 Jan 2026, Huang et al., 7 Mar 2025, Yu et al., 28 Jan 2026, Zheng et al., 23 Jul 2025).
- Violated Fraction: Fraction of requests/batches exceeding the SLO (e.g., violation rate baseline vs with SLO-aware sculpting (Yin et al., 14 Jan 2026)).
- System Capacity: Maximum sustainable request rate under which SLO violation stays below a target (e.g., BucketServe reaching RPS at 80% SLO attainment (Zheng et al., 23 Jul 2025)).
SLAI systems usually monitor these metrics in real time, triggering adaptive control, scheduling, or root-cause analysis if violations or anomalies are detected (Yin et al., 14 Jan 2026).
2. Core Architectural and Algorithmic Components
SLAI frameworks operate through tightly integrated, multi-layer system components. A canonical architecture, as in LatencyPrism (Yin et al., 14 Jan 2026), includes:
Perception (Measurement) Layer:
- Non-intrusive instrumentation via eBPF hooks, dynamic ptrace probes, CUPTI/ROCm for GPU, and periodic telemetry collection. This enables high-resolution tracking of scheduling, kernel, and pipeline events without code modification or service restart.
Comprehension (Modeling & Alignment) Layer:
- Multi-domain trace alignment that merges CPU, Python, and GPU timestamps into a unified timeline.
- Classification of inference cycles into granular stages (Prefill, Decode), with explicit identification of pipeline bottlenecks using function signatures (e.g.,
forward_prefill) (Yin et al., 14 Jan 2026, Shen et al., 17 Mar 2025). - Physically-informed baseline latency models, often trained via GBDT or similar regressors using features such as batch size, prompt+output length, and aggregate KV traffic ().
Adaptation (Anomaly Detection & SLO Enforcement) Layer:
- Real-time monitoring of residual error between observed and predicted latency, smoothed via windowed moving averages.
- Dynamic control charts with computed upper/lower bounds for anomaly alerting; e.g., .
- If predicted violations occur, systems trigger deeper (high-fidelity) tracing or throttle lower-priority jobs, ensuring SLOs are enforced as hard constraints.
Root Cause Analysis:
- Per-operation cycle attribution via time-share (), utilization (), anomaly z-scores, and suspicion scoring, yielding ranked potential sources (e.g., kernel stalls or interconnect congestion).
This stack can be mapped, with local adaptations, to other major SLAI architectures: BucketServe’s bucket-based batching and dynamic splitting (Zheng et al., 23 Jul 2025); SuperInfer’s rotary scheduler and full-duplex memory management (Yu et al., 28 Jan 2026); and SpecServe’s analytic-constrained adaptive decoding (Huang et al., 7 Mar 2025).
3. Online Latency Sculpting, Anomaly Detection, and SLO Control
LatencyPrism exemplifies the state of the art in online latency sculpting (Yin et al., 14 Jan 2026):
- Latency Decomposition: Total inference time per batch is measured as , where each denotes a pipeline stage. Timestamps aligned at the granularity of function entry/exit points deliver precise, minimally-invasive attribution.
- Residual Monitoring and Sculpting: For each cycle, the Positive Prediction Error is , aggregated as a moving average. Dynamic window-based control charts determine if the system is deviating from its SLO.
- Triggering Response: If the error exceeds upper limits, the system initiates deep traces (stack, kernel stream capture) or performs batch throttling.
- Distinguishing Anomaly Types: By comparing to baseline models trained per XPU/type and updating on-the-fly using physical features, the system separates workload-driven noise from genuine breakdowns.
- Quantitative Effectiveness: Production deployment across 1,000+ NVIDIA A100 XPUs yielded p95 latency improvements (from 210 ms to 198 ms), SLO violation rates improved (7.2% 1.1%), and F1-scores for anomaly detection at 0.985 with 0.1% added latency overhead.
This architectural style is notable for delivering zero-intrusion, sub-millisecond alerting, and actionable operator feedback at cluster scale.
4. Dynamic, SLO-Aware Decoding, Batching, and Scheduling
Beyond measurement and detection, SLAI emphasizes real-time adaptive mechanisms:
- Adaptive Speculative Decoding (SpecServe): Dynamically searches for optimal speculative length (SL) during drafting, using closed-form models of drafting/verification time and token acceptance probability (Huang et al., 7 Mar 2025). SLOs are enforced as hard constraints during SL search (goodput if violated), avoiding over-speculation which would bloat per-step latency.
- Dynamic Batch Sizing (BucketServe): Requests are bucketed by sequence length, and batch sizes per bucket are continuously optimized relative to current memory availability (“safe” margin) and real-time observed parameters. Splitting/merging buckets minimizes GPU utilization loss from heterogeneous input lengths (Zheng et al., 23 Jul 2025).
- SLO-Aware Rotary Scheduling (SuperInfer): In memory-constrained Superchip environments, “RotaSched” computes a per-request virtual lag time (VLT) to prioritize swap-in/out decisions, maximizing SLO attainment under strict KV-cache and bandwidth budgets. Proactive rotation is guided by SLO slack, not just head-of-line pressure (Yu et al., 28 Jan 2026).
- Iteration/Cycle-Aware Prioritization: Token/batch admission is based on predicted remaining slack to the request's SLO deadline; critical decode cycles are always scheduled, with non-critical work admitted only if resources suffice.
Extensive production and simulation evidence indicates that SLAI-style adaptive mechanisms outpace static or naively parameterized baselines in both SLO attainment and utilization, across varied models and heterogeneous hardware (Zheng et al., 23 Jul 2025, Yu et al., 28 Jan 2026, Huang et al., 7 Mar 2025).
5. Evaluation, Overheads, and Quantitative Impact
SLAI frameworks consistently demonstrate significant improvements in empirical SLO satisfaction, throughput, and system utilization, while maintaining very low overhead:
| System | p95 Latency | SLO Violations (%) | Overhead |
|---|---|---|---|
| Baseline | 210 ms | 7.2 | -- |
| LatencyPrism | 198 ms | 1.1 | +0.1% latency |
- Overhead: LatencyPrism reports CPU overhead 0.5% and 0.1% latency overhead in “Sentinel” mode. High-fidelity probing (“Deep-Dive”) adds up to 7% CPU, but only during anomaly windows (Yin et al., 14 Jan 2026).
- SLO Attainment: Across 1,000,000 requests, SLO violations are cut by an order of magnitude (from 7.2% to 1.1%) (Yin et al., 14 Jan 2026).
- Detection F1-score: Anomaly detection maintains high precision (0.971), recall (0.999), and F1 (0.985) (Yin et al., 14 Jan 2026).
- Performance vs. Baselines: BucketServe achieves 3.58 throughput and supports 1.93 more requests per second than best-known baselines at the same SLO attainment (80% at p90200 ms) (Zheng et al., 23 Jul 2025).
- Responsiveness and Generality: Adaptivity sustains rapid detection ({}2 ms alert lag) and robust SLO tracking under batch size, workload, and architecture heterogeneity (Yin et al., 14 Jan 2026, Zheng et al., 23 Jul 2025, Yu et al., 28 Jan 2026).
6. Challenges, Trade-offs, and Future Directions
SLAI research highlights several ongoing technical challenges:
- Scalability and Heterogeneity: Scalability to multi-node, multi-XPU clusters demands decentralized coordination of measurement, control, and SLO enforcement, as well as integration with diverse LLM engines and hardware (XPUs, Superchips, etc.) (Yin et al., 14 Jan 2026, Zheng et al., 23 Jul 2025, Yu et al., 28 Jan 2026).
- Anomaly Attribution and Mitigation: Accurately separating incident-induced anomalies from background workload variance is critical; root-cause scoring must remain robust across complex, highly multiplexed serving environments.
- SLO Policy Flexibility: Current practice primarily focuses on tail-latency SLOs; other SLOs (e.g., cost per request, energy, or even composite utility functions) need additional model features and control logic.
- Minimizing Monitoring/Adaptation Overhead: As system complexity grows, minimizing the overhead from measurement (e.g., deep tracing) and online inference for adaptation while retaining high-frequency responsiveness becomes nontrivial.
- Hybrid, Hierarchical Control: Richer hierarchy-aware orchestrators (e.g., integrating node-local and cluster-global SLO models) are yet to be fully explored.
- Limitations: Current deployment is focused on batch(cycle) granularity; per-token granularity, improved clustering for bucketing, and model heterogeneity remain as explicit directions for extension (Yin et al., 14 Jan 2026, Zheng et al., 23 Jul 2025).
Future work is expected to extend SLAI frameworks to cover more sophisticated multi-objective SLOs, tighter multi-node resource scaling, and integration with energy/user-experience-aware controls.
7. Representative Systems and Empirical Evidence
Selected systems that have defined the frontier of SLO-aware LLM inference include:
- LatencyPrism: Zero-intrusion cross-stack latency sculpting, deployed at XPUs, delivering sub-0.1% latency overhead and 98% anomaly F1-score (Yin et al., 14 Jan 2026).
- BucketServe: Bucket-based dynamic batching, achieving up to throughput improvement and 1% scheduling overhead (Zheng et al., 23 Jul 2025).
- SuperInfer: Proactive rotary scheduling and full-duplex memory rotation achieve TTFT SLO gains on GH200 Superchips (Yu et al., 28 Jan 2026).
- SpecServe: Online-adaptive speculative decoding boosts throughput by up to under tight TPOT SLO constraints (Huang et al., 7 Mar 2025).
These systems provide definitive blueprints and reference data for constructing high-reliability, production-grade SLO-aware LLM inference platforms.
References:
(Yin et al., 14 Jan 2026) LatencyPrism: Online Non-intrusive Latency Sculpting for SLO-Guaranteed LLM Inference (Huang et al., 7 Mar 2025) SpecServe: Efficient and SLO-Aware LLM Serving with Adaptive Speculative Decoding (Zheng et al., 23 Jul 2025) BucketServe: Bucket-Based Dynamic Batching for Smart and Efficient LLM Inference Serving (Yu et al., 28 Jan 2026) SuperInfer: SLO-Aware Rotary Scheduling and Memory Management for LLM Inference on Superchips