SLO-Aware Feedback: Control & Optimization
- SLO-aware feedback is a systems paradigm that embeds real-time performance metrics in automated control loops to ensure service level objectives are met.
- It incorporates domain-specific models and per-request adjustments that optimize resource allocation to reduce latency and operational costs.
- Empirical evaluations show significant reductions in SLO violations and resource expenses by dynamically tuning system configurations under varying workloads.
SLO-aware feedback is an overarching systems and optimization paradigm in which real-time observations of Service Level Objective (SLO) attainment are used to close a feedback loop, dynamically steer system configuration, resource allocation, and workload management to maximize the likelihood that operational SLOs—quantitative guarantees on metrics such as end-to-end latency, tail response times, throughput, or accuracy levels—are satisfied. This design pattern appears across serverless memory tuners, LLM serving stacks, autoscalers, accelerator orchestration, 5G edge stacks, storage controllers, and collaborative vehicular systems. Core elements include direct measurement and modeling of SLO metrics, per-request or per-batch feedback, continuous adjustment or control based on observed or predicted SLO violation probability, and explicit optimization under SLO constraints.
1. SLO-aware Feedback Control: Definition and General Principles
SLO-aware feedback control consists of embedding SLO metrics—such as -percentile tail latencies, per-token throughput, or accuracy floors—into system optimization, typically via a continuous measurement-feedback-action loop. The process includes:
- SLO measurement: Directly observe operational metrics (e.g., 95th/99th percentile latency, goodput, error rate) per function, request, or workload segment.
- Predictive or compositional modeling: Use statistical models (performance predictors, queuing models, regression forests, Bayesian networks) to map configuration or resource allocation to SLO metrics.
- Constraint-driven adjustment: Select configurations (e.g., memory, quota, replica count, batching) that provably (or with high probability) keep measured or predicted SLO metrics within declared targets.
- Closed-loop operation: Periodically or event-driven, reapply the measurement/modeling/adjustment process, integrating real-time system telemetry as feedback.
This methodology subsumes both classical feedback control (e.g., PI/PID controllers) and application-aware resource optimization, extending them with workload-specific, per-SLO logic and typically non-convex or combinatorial search over discrete system parameters. It also enables rapid reaction to workload bursts, SLO drift, and environmental changes.
2. Domain-specific SLO Models and Metrics
SLO-aware feedback is instantiated with domain- and application-specific SLO metrics and models:
- Serverless Function Latency: SLAM (Safaryan et al., 2022) targets -percentile end-to-end application latency, composing per-function statistics into a call-graph-based estimator:
- GPU-driven Inference: HAS-GPU (Gu et al., 4 May 2025) models per-inference latency as predicted via a Graph Attention Network, enforcing . Cost minimization is subject to throughput and tail-latency.
- Batch Analytics/Video Inference: Tangram (Peng et al., 2024) ensures that for each batch, for all constituent patches.
- Multi-LLM Routers: PROTEUS (Bhatti et al., 27 Jan 2026) enforces user-declared per-query accuracy targets with a dual Lagrange-multiplier-driven RL policy (tracking accuracy vs. cost).
- Per-token Latency in LLMs: BrownoutServe (Hu et al., 23 Jul 2025), OrbitFlow (Ma et al., 5 Jan 2026), SpecServe (Huang et al., 7 Mar 2025), and Tempo (Zhang et al., 24 Apr 2025) focus on time-to-first-token (TTFT), time-per-output-token (TPOT), and other per-stage SLOs.
- Storage/Network/Accelerator Backends: QWin (Ma et al., 2021) and Arcus (Zhao et al., 2024) target backend 99th/99.9th (or stricter) tail-latency, converting SLOs into resource (CPU/core, bandwidth, token-bucket) budgets and using per-interval feedback.
- Kubernetes Autoscalers: Explicitly collect -percentile end-to-end latency, backlog depth, and resource utilization to trigger proportional scaling for SLO satisfaction (Punniyamoorthy et al., 29 Dec 2025).
- Vehicular/Edge/MEC: Hybrid SLOs include processing time, energy, and quality, typically composed via utility functions or SLO fulfillment indicators, with per-request feedback (Sedlak et al., 2024, Zhang et al., 27 Jan 2026).
3. Feedback Loop Architectures and Algorithms
SLO-aware feedback controllers exhibit several architectural/algorithmic patterns:
| System | Feedback Loop Core | Actuation Targets |
|---|---|---|
| SLAM | Trace-based per-function profiling | FaaS function memory |
| HAS-GPU | Hybrid vertical/horizontal scaling | SM/time quota, pods |
| QWin | Per-window request analysis | CPU core assignment |
| Arcus | Hardware PI controller | Token-bucket rates |
| BrownoutServe | AIMD threshold tuning | MoE expert routing |
| OrbitFlow | ILP-based quick reconfiguration | Layer KV placement |
| Tempo | Density-first dynamic admission | Request scheduling |
| Kubernetes SLO-Auto | Proportional/signal-driven control | Replica count, nodes |
| Faro | Smoothed utility+probabilistic pred. | Replicas per job |
Across these systems, key elements include:
- Event-driven or periodic triggers: Feedback is invoked on arrival, request completion, SLO violation, or at fixed intervals.
- Fine-grained per-metric monitoring: Histograms, percentiles, and sliding-window averages adjust quickly to workload spikes or resource contention.
- Control policies: Additive-increase/multiplicative-decrease (AIMD), proportional control, integer linear programming (ILP), and online binary or combinatorial search.
- Rapid actuation: Hardware-offloaded actions (Arcus), on-the-fly policy changes (BrownoutServe), and asynchronous configuration (SLAM).
The models are often stochastic and exploit offline/online profilers, neural predictors (HAS-GPU's RaPP, Faro's N-HiTS), or data-driven CPTs (BNs in collaborative platoons).
4. Evaluation Metrics, Empirical Results, and Effectiveness
Systems implementing SLO-aware feedback realize substantial empirical gains:
- SLO Attainment: For example, SLAM achieves of requests meeting SLOs (Safaryan et al., 2022), HAS-GPU reduces SLO violations by 4.8× vs. previous methods (violation rate 1.2%) (Gu et al., 4 May 2025), and BrownoutServe cuts SLO violation by 66–90% under bursts (Hu et al., 23 Jul 2025).
- Resource Efficiency and Cost: HAS-GPU delivers 10.8× lower GPU cost (Gu et al., 4 May 2025), SLAM’s resource-optimized configs reduce Lambda cost by 1.6–3× (Safaryan et al., 2022), Arcus maintains tail-latency within 1% threshold with up to 45% lower 99.9th latency (Zhao et al., 2024).
- Throughput vs. SLO Trade-off: BrownoutServe increases MoE throughput 1.58–2.07× while sustaining 5% accuracy loss, with sub-4% additional SLO violations in all cases (Hu et al., 23 Jul 2025).
- Adaptivity and Scalability: Many designs (QWin (Ma et al., 2021), Tempo (Zhang et al., 24 Apr 2025), PROTEUS (Bhatti et al., 27 Jan 2026)) achieve microsecond to millisecond response and scale to hundreds of jobs or functions, maintaining SLO compliance with minimal operator tuning.
- Fairness and Predictability: Faro enforces per-job SLOs with 2–23× fewer violations than per-job greedy baselines (dynamically smoothing utility step-functions for fast control) (Jeon et al., 2024).
5. Illustrative Techniques: Modeling, Optimization, and Measurement
SLO-aware feedback systems rely on modeling, prediction, and optimization techniques specific to their constraints:
- Percentile-based Modeling: High-percentile statistics (e.g., 95th, 99th) are systematically chosen as proxy metrics to absorb stochastic tail events and cold-start effects (Safaryan et al., 2022, Ma et al., 2021).
- Call Graph Aggregation: Serverless applications’ latencies are modeled by call-graph analysis: parallel branches compose by max-aggregation, sequences by additive structure (Safaryan et al., 2022).
- Graph Neural Predictors: Fine-grained resource-to-latency mapping via GAT+MLP or similar architectures (HAS-GPU’s RaPP (Gu et al., 4 May 2025)).
- ILP and Constrained Optimization: OrbitFlow's per-decode-step ILP (minimize batch time under memory/SLO constraints) (Ma et al., 5 Jan 2026), BrownoutServe’s partitioning and assignment (Hu et al., 23 Jul 2025).
- Hybrid Symbolic-Empirical Approaches: Mixing closed-form controllers (AIMD, bounded P-control) with data-driven empirical policies (offline microbenchmarks, banded control for frequency scaling in GreenLLM (Liu et al., 22 Aug 2025)).
- Token-by-token or batchwise adaptation: In LLM serving (Tempo, SpecServe) SLO-aware feedback is exerted at both fine (token) and coarse (request, batch) timescales, using per-token measurements to re-tune schedules or batch composition (Zhang et al., 24 Apr 2025, Huang et al., 7 Mar 2025).
- Probabilistic Prediction and Sloppification: Faro’s “sloppification” relaxes plateaued utility functions to ensure the optimization process is tractable and responsive (Jeon et al., 2024).
6. Representative Applications across System Layers
SLO-aware feedback is implemented in diverse environments:
- Cloud and Serverless: SLAM automates memory sizing for Lambda workflows under end-to-end SLOs (Safaryan et al., 2022), while Kubernetes SLO-driven autoscaling augments standard HPA/VPA for SLO compliance with guardrails and explainability (Punniyamoorthy et al., 29 Dec 2025).
- Accelerator and Storage Backends: Arcus delivers microsecond-latency traffic shaping for shared accelerators with per-flow SLO guarantees under multi-tenant contention (Zhao et al., 2024), QWin dynamically partitions CPU cores for LC tenants in storage clusters (Ma et al., 2021).
- LLM Inference and Serving: Tempo, BrownoutServe, SpecServe, GreenLLM, and OrbitFlow apply SLO-aware adaptive batching, frequency scaling, speculative decoding, and KV-cache management to maximize SLO success rate and throughput for heterogeneous model serving (Zhang et al., 24 Apr 2025, Huang et al., 7 Mar 2025, Liu et al., 22 Aug 2025, Hu et al., 23 Jul 2025, Ma et al., 5 Jan 2026).
- Edge and Vehicular Systems: SMEC uses decomposed RAN/Edge feedback to guarantee deadline satisfaction under high-latency variance in 5G MEC (Zhang et al., 27 Jan 2026); BNs underpin probabilistic SLO-aware offloading in AV platoons (Sedlak et al., 2024).
7. Limitations, Trade-offs, and Future Directions
Limitations and open challenges in SLO-aware feedback systems include:
- Modeling Fidelity and Domains: Some systems (e.g., SLAM, QWin) assume statically profiled workload models; extensions to dynamic or time-varying workloads require online adaptation or retraining (Safaryan et al., 2022, Ma et al., 2021).
- Control Granularity: Discrete resource configuration (e.g., only eight AWS Lambda memory sizes) may limit solution optimality without continuous optimization (Safaryan et al., 2022).
- Complexity of Compositionality: As workflow graphs and resource interdependencies grow complex, tractable real-time solutions may require pruning (OrbitFlow prunes search space conservatively at minor cost (Ma et al., 5 Jan 2026)).
- Tradeoffs: There are often Pareto trade-offs between SLO compliance, resource cost, accuracy, and throughput. BrownoutServe exposes controllable tradeoff knobs to allow tuning along this frontier (Hu et al., 23 Jul 2025).
- Cross-domain Applicability: While SLO-aware feedback designs generalize, care is required in specifying how SLOs are monitored, composed, and enforced per domain/function.
Possible future directions suggested in the literature include more adaptive models (continuous optimization, retraining-aware feedback), richer SLO specification (deadline distribution, adaptive decay models), and broader generality to multi-modal objective spaces (cost, quality, and fairness simultaneously).
References:
(Safaryan et al., 2022, Gu et al., 4 May 2025, Peng et al., 2024, Bhatti et al., 27 Jan 2026, Ma et al., 5 Jan 2026, Zhang et al., 24 Apr 2025, Ma et al., 2021, Zhang et al., 27 Jan 2026, Zhao et al., 2024, Liu et al., 22 Aug 2025, Sedlak et al., 2024, Huang et al., 7 Mar 2025, Hu et al., 23 Jul 2025, Punniyamoorthy et al., 29 Dec 2025, Jeon et al., 2024)