Papers
Topics
Authors
Recent
2000 character limit reached

SLO-Aware Feedback: Control & Optimization

Updated 28 January 2026
  • SLO-aware feedback is a systems paradigm that embeds real-time performance metrics in automated control loops to ensure service level objectives are met.
  • It incorporates domain-specific models and per-request adjustments that optimize resource allocation to reduce latency and operational costs.
  • Empirical evaluations show significant reductions in SLO violations and resource expenses by dynamically tuning system configurations under varying workloads.

SLO-aware feedback is an overarching systems and optimization paradigm in which real-time observations of Service Level Objective (SLO) attainment are used to close a feedback loop, dynamically steer system configuration, resource allocation, and workload management to maximize the likelihood that operational SLOs—quantitative guarantees on metrics such as end-to-end latency, tail response times, throughput, or accuracy levels—are satisfied. This design pattern appears across serverless memory tuners, LLM serving stacks, autoscalers, accelerator orchestration, 5G edge stacks, storage controllers, and collaborative vehicular systems. Core elements include direct measurement and modeling of SLO metrics, per-request or per-batch feedback, continuous adjustment or control based on observed or predicted SLO violation probability, and explicit optimization under SLO constraints.

1. SLO-aware Feedback Control: Definition and General Principles

SLO-aware feedback control consists of embedding SLO metrics—such as pp-percentile tail latencies, per-token throughput, or accuracy floors—into system optimization, typically via a continuous measurement-feedback-action loop. The process includes:

  • SLO measurement: Directly observe operational metrics (e.g., 95th/99th percentile latency, goodput, error rate) per function, request, or workload segment.
  • Predictive or compositional modeling: Use statistical models (performance predictors, queuing models, regression forests, Bayesian networks) to map configuration or resource allocation to SLO metrics.
  • Constraint-driven adjustment: Select configurations (e.g., memory, quota, replica count, batching) that provably (or with high probability) keep measured or predicted SLO metrics within declared targets.
  • Closed-loop operation: Periodically or event-driven, reapply the measurement/modeling/adjustment process, integrating real-time system telemetry as feedback.

This methodology subsumes both classical feedback control (e.g., PI/PID controllers) and application-aware resource optimization, extending them with workload-specific, per-SLO logic and typically non-convex or combinatorial search over discrete system parameters. It also enables rapid reaction to workload bursts, SLO drift, and environmental changes.

2. Domain-specific SLO Models and Metrics

SLO-aware feedback is instantiated with domain- and application-specific SLO metrics and models:

  • Serverless Function Latency: SLAM (Safaryan et al., 2022) targets pp-percentile end-to-end application latency, composing per-function (memory,ti(m))(\text{memory}, t_i(m)) statistics into a call-graph-based estimator:

P[LatencySLO]95%.\mathbb{P}[\text{Latency} \leq \text{SLO}] \geq 95\%.

  • GPU-driven Inference: HAS-GPU (Gu et al., 4 May 2025) models per-inference latency as Li(xi,ti,bi)L_i(x_i, t_i, b_i) predicted via a Graph Attention Network, enforcing i,Li()SLO\forall i, L_i(\cdot) \leq \text{SLO}. Cost minimization is subject to throughput and tail-latency.
  • Batch Analytics/Video Inference: Tangram (Peng et al., 2024) ensures that for each batch, Twait,i+Tf(batch(i))SLOiT_{\text{wait}, i} + T_f(\text{batch}(i)) \leq \text{SLO}_i for all constituent patches.
  • Multi-LLM Routers: PROTEUS (Bhatti et al., 27 Jan 2026) enforces user-declared per-query accuracy targets τ\tau with a dual Lagrange-multiplier-driven RL policy (tracking accuracy vs. cost).
  • Per-token Latency in LLMs: BrownoutServe (Hu et al., 23 Jul 2025), OrbitFlow (Ma et al., 5 Jan 2026), SpecServe (Huang et al., 7 Mar 2025), and Tempo (Zhang et al., 24 Apr 2025) focus on time-to-first-token (TTFT), time-per-output-token (TPOT), and other per-stage SLOs.
  • Storage/Network/Accelerator Backends: QWin (Ma et al., 2021) and Arcus (Zhao et al., 2024) target backend 99th/99.9th (or stricter) tail-latency, converting SLOs into resource (CPU/core, bandwidth, token-bucket) budgets and using per-interval feedback.
  • Kubernetes Autoscalers: Explicitly collect pp-percentile end-to-end latency, backlog depth, and resource utilization to trigger proportional scaling for SLO satisfaction (Punniyamoorthy et al., 29 Dec 2025).
  • Vehicular/Edge/MEC: Hybrid SLOs include processing time, energy, and quality, typically composed via utility functions or SLO fulfillment indicators, with per-request feedback (Sedlak et al., 2024, Zhang et al., 27 Jan 2026).

3. Feedback Loop Architectures and Algorithms

SLO-aware feedback controllers exhibit several architectural/algorithmic patterns:

System Feedback Loop Core Actuation Targets
SLAM Trace-based per-function profiling FaaS function memory
HAS-GPU Hybrid vertical/horizontal scaling SM/time quota, pods
QWin Per-window request analysis CPU core assignment
Arcus Hardware PI controller Token-bucket rates
BrownoutServe AIMD threshold tuning MoE expert routing
OrbitFlow ILP-based quick reconfiguration Layer KV placement
Tempo Density-first dynamic admission Request scheduling
Kubernetes SLO-Auto Proportional/signal-driven control Replica count, nodes
Faro Smoothed utility+probabilistic pred. Replicas per job

Across these systems, key elements include:

  • Event-driven or periodic triggers: Feedback is invoked on arrival, request completion, SLO violation, or at fixed intervals.
  • Fine-grained per-metric monitoring: Histograms, percentiles, and sliding-window averages adjust quickly to workload spikes or resource contention.
  • Control policies: Additive-increase/multiplicative-decrease (AIMD), proportional control, integer linear programming (ILP), and online binary or combinatorial search.
  • Rapid actuation: Hardware-offloaded actions (Arcus), on-the-fly policy changes (BrownoutServe), and asynchronous configuration (SLAM).

The models are often stochastic and exploit offline/online profilers, neural predictors (HAS-GPU's RaPP, Faro's N-HiTS), or data-driven CPTs (BNs in collaborative platoons).

4. Evaluation Metrics, Empirical Results, and Effectiveness

Systems implementing SLO-aware feedback realize substantial empirical gains:

  • SLO Attainment: For example, SLAM achieves 95%\geq 95\% of requests meeting SLOs (Safaryan et al., 2022), HAS-GPU reduces SLO violations by 4.8× vs. previous methods (violation rate \approx1.2%) (Gu et al., 4 May 2025), and BrownoutServe cuts SLO violation by 66–90% under bursts (Hu et al., 23 Jul 2025).
  • Resource Efficiency and Cost: HAS-GPU delivers 10.8× lower GPU cost (Gu et al., 4 May 2025), SLAM’s resource-optimized configs reduce Lambda cost by 1.6–3× (Safaryan et al., 2022), Arcus maintains tail-latency within 1% threshold with up to 45% lower 99.9th latency (Zhao et al., 2024).
  • Throughput vs. SLO Trade-off: BrownoutServe increases MoE throughput 1.58–2.07× while sustaining \leq5% accuracy loss, with sub-4% additional SLO violations in all cases (Hu et al., 23 Jul 2025).
  • Adaptivity and Scalability: Many designs (QWin (Ma et al., 2021), Tempo (Zhang et al., 24 Apr 2025), PROTEUS (Bhatti et al., 27 Jan 2026)) achieve microsecond to millisecond response and scale to hundreds of jobs or functions, maintaining SLO compliance with minimal operator tuning.
  • Fairness and Predictability: Faro enforces per-job SLOs with 2–23× fewer violations than per-job greedy baselines (dynamically smoothing utility step-functions for fast control) (Jeon et al., 2024).

5. Illustrative Techniques: Modeling, Optimization, and Measurement

SLO-aware feedback systems rely on modeling, prediction, and optimization techniques specific to their constraints:

  • Percentile-based Modeling: High-percentile statistics (e.g., 95th, 99th) are systematically chosen as proxy metrics to absorb stochastic tail events and cold-start effects (Safaryan et al., 2022, Ma et al., 2021).
  • Call Graph Aggregation: Serverless applications’ latencies are modeled by call-graph analysis: parallel branches compose by max-aggregation, sequences by additive structure (Safaryan et al., 2022).
  • Graph Neural Predictors: Fine-grained resource-to-latency mapping via GAT+MLP or similar architectures (HAS-GPU’s RaPP (Gu et al., 4 May 2025)).
  • ILP and Constrained Optimization: OrbitFlow's per-decode-step ILP (minimize batch time under memory/SLO constraints) (Ma et al., 5 Jan 2026), BrownoutServe’s partitioning and assignment (Hu et al., 23 Jul 2025).
  • Hybrid Symbolic-Empirical Approaches: Mixing closed-form controllers (AIMD, bounded P-control) with data-driven empirical policies (offline microbenchmarks, banded control for frequency scaling in GreenLLM (Liu et al., 22 Aug 2025)).
  • Token-by-token or batchwise adaptation: In LLM serving (Tempo, SpecServe) SLO-aware feedback is exerted at both fine (token) and coarse (request, batch) timescales, using per-token measurements to re-tune schedules or batch composition (Zhang et al., 24 Apr 2025, Huang et al., 7 Mar 2025).
  • Probabilistic Prediction and Sloppification: Faro’s “sloppification” relaxes plateaued utility functions to ensure the optimization process is tractable and responsive (Jeon et al., 2024).

6. Representative Applications across System Layers

SLO-aware feedback is implemented in diverse environments:

7. Limitations, Trade-offs, and Future Directions

Limitations and open challenges in SLO-aware feedback systems include:

  • Modeling Fidelity and Domains: Some systems (e.g., SLAM, QWin) assume statically profiled workload models; extensions to dynamic or time-varying workloads require online adaptation or retraining (Safaryan et al., 2022, Ma et al., 2021).
  • Control Granularity: Discrete resource configuration (e.g., only eight AWS Lambda memory sizes) may limit solution optimality without continuous optimization (Safaryan et al., 2022).
  • Complexity of Compositionality: As workflow graphs and resource interdependencies grow complex, tractable real-time solutions may require pruning (OrbitFlow prunes search space conservatively at minor cost (Ma et al., 5 Jan 2026)).
  • Tradeoffs: There are often Pareto trade-offs between SLO compliance, resource cost, accuracy, and throughput. BrownoutServe exposes controllable tradeoff knobs (k,threshold)(k, \text{threshold}) to allow tuning along this frontier (Hu et al., 23 Jul 2025).
  • Cross-domain Applicability: While SLO-aware feedback designs generalize, care is required in specifying how SLOs are monitored, composed, and enforced per domain/function.

Possible future directions suggested in the literature include more adaptive models (continuous optimization, retraining-aware feedback), richer SLO specification (deadline distribution, adaptive decay models), and broader generality to multi-modal objective spaces (cost, quality, and fairness simultaneously).


References:

(Safaryan et al., 2022, Gu et al., 4 May 2025, Peng et al., 2024, Bhatti et al., 27 Jan 2026, Ma et al., 5 Jan 2026, Zhang et al., 24 Apr 2025, Ma et al., 2021, Zhang et al., 27 Jan 2026, Zhao et al., 2024, Liu et al., 22 Aug 2025, Sedlak et al., 2024, Huang et al., 7 Mar 2025, Hu et al., 23 Jul 2025, Punniyamoorthy et al., 29 Dec 2025, Jeon et al., 2024)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SLO-Aware Feedback.