SLO-Aware Dynamic Frequency Scaling
- SLO-aware DFS is a dynamic method that adjusts hardware frequency and voltage based on strict SLOs to balance energy efficiency with performance.
- It employs profiling, predictive modeling, and real-time feedback control to safely reduce energy consumption without breaching response time or throughput limits.
- Empirical studies demonstrate energy savings up to 34% with minimal performance impact, highlighting its practical value for CPU and GPU workloads.
Service-Level Objective (SLO)-aware Dynamic Frequency Scaling (DFS) refers to runtime adaptation of processor or accelerator frequency and voltage parameters, subject to domain-appropriate performance constraints formulated as SLOs. These techniques balance energy efficiency with strict guarantees on response time, throughput, or other user-facing metrics. Recent work extends classic DVFS (Dynamic Voltage and Frequency Scaling) by fusing profiling, online inference, and feedback control with SLO compliance, achieving significant energy reduction in both CPU and modern GPU contexts.
1. Fundamental Concepts
SLO-aware DFS operates by dynamically tuning hardware performance parameters (for CPUs: voltage/frequency, for GPUs: Streaming Multiprocessor clocks and engine parallelism) so that specified SLOs—e.g., maximum allowable average or tail latency per request, or minimal tokens per second—are never violated. SLOs are typically formalized as upper bounds on response time percentiles (such as p99 for end-to-end latency, or per-iteration time between tokens in LLM inference) and are dictated by application semantics or user experience constraints.
DVFS techniques, when applied naïvely, risk breaching performance constraints or incurring excessive runtime overhead. SLO-aware strategies avoid this via predictive modeling, fine-grained phase detection, and admission and adaptation protocols that always guarantee SLO adherence.
2. Methodological Approaches
SLO-aware DFS methodologies span several key strategies:
- Profiling and Modeling:
- For CPU systems (e.g., (Yadav et al., 2019)), profiling captures phase-dependent metrics such as memory-accesses-per-instruction (MAPI). Offline characterization across all P-states produces look-up tables (LUTs) that map workload features to "safe" frequencies under bounded slowdown.
- For GPU-based LLM inference (Kakolyris et al., 2024, Liu et al., 22 Aug 2025), system-specific microbenchmarks are used to fit latency and power models as a function of frequency, engine size, and working set (e.g., KV-cache footprint, batch size).
- Prediction and Admission Control:
- Workload phases are dynamically profiled. Predictive models (e.g., XGBoost regressors for per-iteration throughput in "throttLL'eM" (Kakolyris et al., 2024)) or latency–power polynomials (in GreenLLM (Liu et al., 22 Aug 2025)) forecast the effect of DFS actions before they are applied.
- At admission, new queries are simulated at maximum frequency to preempt SLO violations; only after passing the SLO checks are they subject to frequency reduction attempts.
- Fine-Grained, Feedback-Driven Adjustment:
- Timeslice partitioning: CPU workloads adapt the frequency at sub-second granularity (10–100 ms timeslice) based on recent MAPI history, maintaining per-slice and cumulative slowdown within user-set SLOs (Yadav et al., 2019).
- GPU LLM serving splits control across prompt (“prefill”) and decode phases. Prefill frequencies are assigned for prompt length classes by solving energy minimization under queuing-based tail-latency constraints (Liu et al., 22 Aug 2025). Decode employs a dual-loop controller: coarse-grained throughput bucketing, and a fine-grained feedback loop tracking p95 time-between-tokens (TBT), with hysteresis and sub-20 ms correction lags.
- Discrete Search and Controller Design:
- For fast search over discrete frequency settings, binary search is commonly applied (as in "throttLL'eM" (Kakolyris et al., 2024)) since the performance-energy feasible set is unimodal for a single SLO.
- Controllers exploit workload characteristics: for memory-bound decode loops, frequency can be throttled aggressively, whereas compute-bound prefill requires maintaining higher clocks.
3. Algorithmic Foundations and Formulations
The core of SLO-aware DFS algorithms is the mapping from phase or workload metrics to frequency settings, subject to SLO constraints. Canonical examples:
CPU Timeslice Mapping (Yadav et al., 2019)
- Profiling: Offline mapping MAPI, slowdown at each P-state.
- Runtime Algorithm:
For timeslice , 1. Predict MAPI using last observations. 2. Select from LUT ensuring projected cumulative slowdown SLO; if not, increment . 3. Enact frequency change (via privileged register write). 4. Monitor overhead: control actions introduce 1% runtime penalty.
- **Mathematical Model
0
1
LLM Serving with Iteration-Level Control (Kakolyris et al., 2024)
- Performance constraints:
- TBT SLO: 2
- E2E SLO: 3
- Autonomous Throttling:
- For new queries at 4, simulate with ML-based IPS predictors.
- Binary search over 5 to identify the minimal setting that meets all SLOs.
- All frequency changes are accompanied by explicit prediction of future batch size and KV-cache demand via scoreboard projection.
Two-Stage LLM Phase-Aware DVFS (Liu et al., 22 Aug 2025)
- Prefill Optimization:
- Service time model: 6, 7
- Power: 8, 9
- Optimization: 0 s.t. 1
- Decode Dual-Loop Controller:
- Coarse (200 ms): Bucket throughput, map to a frequency band.
- Fine (20 ms): Track TBT p95, ramp frequency up if 2, ramp down if 3.
4. Empirical Results and Practical Impact
Multiple studies report measurable success for SLO-aware DFS:
| System | Workload | Energy Savings (%) | SLO Miss (%) | Perf. Loss (%) / Comments |
|---|---|---|---|---|
| CPU Timeslice (Yadav et al., 2019) | NAS NPB (CG, FT, MG, SP) | 4–7 | 0 | All perf. loss 43% (mean 2.4%) |
| throttLL'eM (Kakolyris et al., 2024) | LLM inference (Azure 60-min trace) | 24.7–43.8 | 0 | 1.71–1.785 energy efficiency |
| GreenLLM (Liu et al., 22 Aug 2025) | LLMs (Alibaba/Azure traces) | 6–34 | 63.5 | TTFT, TBT pass rates 796% |
In (Yadav et al., 2019), per-timeslice overhead for control logic is measured at 80.2% per slice (DVFS transition 9–0s; perf counter read 1s), making the approach scalable to fine-grained adaptation on CPU-based systems. In GPU LLM serving, SLO-aware DFS strategies outperform default hardware governors—GreenLLM demonstrates up to 34% total GPU energy reduction purely by separating phase treatment and tightly controlling per-request power delivery (Liu et al., 22 Aug 2025). Importantly, the incidence of SLO violation is not significantly increased; TTFT and TBT pass rates remain 296%, even at high load or during token generation phases.
5. System-Specific Design Considerations
Phase Awareness
- In LLM inference, differential control per phase (prefill vs. decode) is crucial. The prefill phase, compute-bound and latency-critical, is optimized via static SLO-constrained frequency choices determined by an analytic queueing and latency-power model. Decode, being memory-bound with unpredictable length, admits more aggressive, fine-grained throttling and real-time feedback.
Predictor and Model Selection
- Lightweight regression (e.g., XGBoost, as chosen in (Kakolyris et al., 2024)) outperforms black-box neural networks when inference must be low-latency.
- For simpler CPU tasks, direct metric-to-frequency lookup tables suffice, provided hardware phase-change behavior is well-characterized.
Overhead Management
- Control actions (DVFS frequency/voltage changes and monitoring counter reads) are amortized over timeslice or token window durations, ensuring total system overhead remains an order of magnitude below 1% of runtime.
Scalability and Applicability
- These techniques are naturally extensible to exascale or multi-engine contexts (as in per-core or per-GPU DFS), provided per-task frequency domains are hardware-exposed.
- Queueing-aware and percentile-based SLO mechanisms facilitate generalization beyond single request-average latency to p99 or more sophisticated service constraints.
6. Limitations and Future Directions
Several limitations are documented:
- Offline mapping or profiling in CPUs (Yadav et al., 2019) couples the method to the specific hardware and workloads used during characterization; dynamic workload changes are less well tolerated.
- For memory-bound phases, the performance–energy decoupling becomes less predictable; tuning for minimal energy can underexploit compute-bound slices, leaving some efficiency gains untapped.
- Addressing complex SLOs (e.g., multi-tenant fairness, soft real-time, or percentile-based latency targets) may require hierarchical or per-task enforcement logic, potentially incorporating model-based or RL predictors.
Promising extensions include on-the-fly phase model fitting, exploitation of richer hardware metrics (IPC, cache miss rates), multi-level coordination across memory and network DVFS domains, and further splitting control to sub-modules (e.g., separate SM cluster clocks on future GPUs) (Yadav et al., 2019, Liu et al., 22 Aug 2025). Increased DFS granularity and hybrid model feedback (classical control + ML) are anticipated to enable even tighter SLO compliance with minimal energy.
7. Broader Significance
SLO-aware DFS exemplifies a transition toward tightly-coupled, model-driven, and user-centric power optimization frameworks for modern AI and HPC serving scenarios. These designs enable substantive energy reduction (often above 30%) without sacrificing mission-critical SLO adherence. The separation of phase treatment in workloads with distinct compute/memory characteristics—highlighted by GreenLLM’s dual-pool architecture—is increasingly relevant in the context of LLM inference and other deep learning serving systems. As new hardware generations expose finer-grained DFS controls and telemetry, these approaches are poised for further sophistication and broader deployment (Liu et al., 22 Aug 2025, Kakolyris et al., 2024, Yadav et al., 2019).