SLO-Aware Dynamic Frequency Scaling

Updated 3 April 2026

SLO-aware DFS is a dynamic method that adjusts hardware frequency and voltage based on strict SLOs to balance energy efficiency with performance.
It employs profiling, predictive modeling, and real-time feedback control to safely reduce energy consumption without breaching response time or throughput limits.
Empirical studies demonstrate energy savings up to 34% with minimal performance impact, highlighting its practical value for CPU and GPU workloads.

Service-Level Objective (SLO)-aware Dynamic Frequency Scaling (DFS) refers to runtime adaptation of processor or accelerator frequency and voltage parameters, subject to domain-appropriate performance constraints formulated as SLOs. These techniques balance energy efficiency with strict guarantees on response time, throughput, or other user-facing metrics. Recent work extends classic DVFS (Dynamic Voltage and Frequency Scaling) by fusing profiling, online inference, and feedback control with SLO compliance, achieving significant energy reduction in both CPU and modern GPU contexts.

1. Fundamental Concepts

SLO-aware DFS operates by dynamically tuning hardware performance parameters (for CPUs: voltage/frequency, for GPUs: Streaming Multiprocessor clocks and engine parallelism) so that specified SLOs—e.g., maximum allowable average or tail latency per request, or minimal tokens per second—are never violated. SLOs are typically formalized as upper bounds on response time percentiles (such as p99 for end-to-end latency, or per-iteration time between tokens in LLM inference) and are dictated by application semantics or user experience constraints.

DVFS techniques, when applied naïvely, risk breaching performance constraints or incurring excessive runtime overhead. SLO-aware strategies avoid this via predictive modeling, fine-grained phase detection, and admission and adaptation protocols that always guarantee SLO adherence.

2. Methodological Approaches

SLO-aware DFS methodologies span several key strategies:

Profiling and Modeling:
- For CPU systems (e.g., (Yadav et al., 2019)), profiling captures phase-dependent metrics such as memory-accesses-per-instruction (MAPI). Offline characterization across all P-states produces look-up tables (LUTs) that map workload features to "safe" frequencies under bounded slowdown.
- For GPU-based LLM inference (Kakolyris et al., 2024, Liu et al., 22 Aug 2025), system-specific microbenchmarks are used to fit latency and power models as a function of frequency, engine size, and working set (e.g., KV-cache footprint, batch size).
Prediction and Admission Control:
- Workload phases are dynamically profiled. Predictive models (e.g., XGBoost regressors for per-iteration throughput in "throttLL'eM" (Kakolyris et al., 2024)) or latency–power polynomials (in GreenLLM (Liu et al., 22 Aug 2025)) forecast the effect of DFS actions before they are applied.
- At admission, new queries are simulated at maximum frequency to preempt SLO violations; only after passing the SLO checks are they subject to frequency reduction attempts.
Fine-Grained, Feedback-Driven Adjustment:
- Timeslice partitioning: CPU workloads adapt the frequency at sub-second granularity (10–100 ms timeslice) based on recent MAPI history, maintaining per-slice and cumulative slowdown within user-set SLOs (Yadav et al., 2019).
- GPU LLM serving splits control across prompt (“prefill”) and decode phases. Prefill frequencies are assigned for prompt length classes by solving energy minimization under queuing-based tail-latency constraints (Liu et al., 22 Aug 2025). Decode employs a dual-loop controller: coarse-grained throughput bucketing, and a fine-grained feedback loop tracking p95 time-between-tokens (TBT), with hysteresis and sub-20 ms correction lags.
Discrete Search and Controller Design:
- For fast search over discrete frequency settings, binary search is commonly applied (as in "throttLL'eM" (Kakolyris et al., 2024)) since the performance-energy feasible set is unimodal for a single SLO.
- Controllers exploit workload characteristics: for memory-bound decode loops, frequency can be throttled aggressively, whereas compute-bound prefill requires maintaining higher clocks.

3. Algorithmic Foundations and Formulations

The core of SLO-aware DFS algorithms is the mapping from phase or workload metrics to frequency settings, subject to SLO constraints. Canonical examples:

Profiling: Offline mapping $($ MAPI, slowdown $)$ at each P-state.
Runtime Algorithm:

For timeslice $i:\Delta$ , 1. Predict MAPI using last $n$ observations. 2. Select $f_{sel}$ from LUT ensuring projected cumulative slowdown $<$ SLO; if not, increment $f_{sel}$ . 3. Enact frequency change (via privileged register write). 4. Monitor overhead: control actions introduce $<$ 1% runtime penalty.

**Mathematical Model $:$

$t(f) = t_{on} \cdot \frac{f_{max}}{f} + t_{off}$

$)$ 0

$)$ 1

Performance constraints:
- TBT SLO: $)$ 2
- E2E SLO: $)$ 3
Autonomous Throttling:
- For new queries at $)$ 4, simulate with ML-based IPS predictors.
- Binary search over $)$ 5 to identify the minimal setting that meets all SLOs.
- All frequency changes are accompanied by explicit prediction of future batch size and KV-cache demand via scoreboard projection.

Prefill Optimization:
- Service time model: $)$ 6, $)$ 7
- Power: $)$ 8, $)$ 9
- Optimization: $i:\Delta$ 0 s.t. $i:\Delta$ 1
Decode Dual-Loop Controller:
- Coarse (200 ms): Bucket throughput, map to a frequency band.
- Fine (20 ms): Track TBT p95, ramp frequency up if $i:\Delta$ 2, ramp down if $i:\Delta$ 3.

4. Empirical Results and Practical Impact

Multiple studies report measurable success for SLO-aware DFS:

System	Workload	Energy Savings (%)	SLO Miss (%)	Perf. Loss (%) / Comments
CPU Timeslice (Yadav et al., 2019)	NAS NPB (CG, FT, MG, SP)	4–7	0	All perf. loss $i:\Delta$ 43% (mean 2.4%)
throttLL'eM (Kakolyris et al., 2024)	LLM inference (Azure 60-min trace)	24.7–43.8	0	1.71–1.78 $i:\Delta$ 5 energy efficiency
GreenLLM (Liu et al., 22 Aug 2025)	LLMs (Alibaba/Azure traces)	6–34	$i:\Delta$ 63.5	TTFT, TBT pass rates $i:\Delta$ 796%

In (Yadav et al., 2019), per-timeslice overhead for control logic is measured at $i:\Delta$ 80.2% per slice (DVFS transition $i:\Delta$ 9– $n$ 0s; perf counter read $n$ 1s), making the approach scalable to fine-grained adaptation on CPU-based systems. In GPU LLM serving, SLO-aware DFS strategies outperform default hardware governors—GreenLLM demonstrates up to 34% total GPU energy reduction purely by separating phase treatment and tightly controlling per-request power delivery (Liu et al., 22 Aug 2025). Importantly, the incidence of SLO violation is not significantly increased; TTFT and TBT pass rates remain $n$ 296%, even at high load or during token generation phases.

5. System-Specific Design Considerations

Phase Awareness

In LLM inference, differential control per phase (prefill vs. decode) is crucial. The prefill phase, compute-bound and latency-critical, is optimized via static SLO-constrained frequency choices determined by an analytic queueing and latency-power model. Decode, being memory-bound with unpredictable length, admits more aggressive, fine-grained throttling and real-time feedback.

Predictor and Model Selection

Lightweight regression (e.g., XGBoost, as chosen in (Kakolyris et al., 2024)) outperforms black-box neural networks when inference must be low-latency.
For simpler CPU tasks, direct metric-to-frequency lookup tables suffice, provided hardware phase-change behavior is well-characterized.

Overhead Management

Control actions (DVFS frequency/voltage changes and monitoring counter reads) are amortized over timeslice or token window durations, ensuring total system overhead remains an order of magnitude below 1% of runtime.

Scalability and Applicability

These techniques are naturally extensible to exascale or multi-engine contexts (as in per-core or per-GPU DFS), provided per-task frequency domains are hardware-exposed.
Queueing-aware and percentile-based SLO mechanisms facilitate generalization beyond single request-average latency to p99 or more sophisticated service constraints.

6. Limitations and Future Directions

Several limitations are documented:

Offline mapping or profiling in CPUs (Yadav et al., 2019) couples the method to the specific hardware and workloads used during characterization; dynamic workload changes are less well tolerated.
For memory-bound phases, the performance–energy decoupling becomes less predictable; tuning for minimal energy can underexploit compute-bound slices, leaving some efficiency gains untapped.
Addressing complex SLOs (e.g., multi-tenant fairness, soft real-time, or percentile-based latency targets) may require hierarchical or per-task enforcement logic, potentially incorporating model-based or RL predictors.

Promising extensions include on-the-fly phase model fitting, exploitation of richer hardware metrics (IPC, cache miss rates), multi-level coordination across memory and network DVFS domains, and further splitting control to sub-modules (e.g., separate SM cluster clocks on future GPUs) (Yadav et al., 2019, Liu et al., 22 Aug 2025). Increased DFS granularity and hybrid model feedback (classical control + ML) are anticipated to enable even tighter SLO compliance with minimal energy.

7. Broader Significance

SLO-aware DFS exemplifies a transition toward tightly-coupled, model-driven, and user-centric power optimization frameworks for modern AI and HPC serving scenarios. These designs enable substantive energy reduction (often above 30%) without sacrificing mission-critical SLO adherence. The separation of phase treatment in workloads with distinct compute/memory characteristics—highlighted by GreenLLM’s dual-pool architecture—is increasingly relevant in the context of LLM inference and other deep learning serving systems. As new hardware generations expose finer-grained DFS controls and telemetry, these approaches are poised for further sophistication and broader deployment (Liu et al., 22 Aug 2025, Kakolyris et al., 2024, Yadav et al., 2019).

Markdown Report Issue Upgrade to Chat

References (3)

Energy Saving Strategy Based on Profiling (2019)

SLO-aware GPU Frequency Scaling for Energy Efficient LLM Inference Serving (2024)

GreenLLM: SLO-Aware Dynamic Frequency Scaling for Energy-Efficient LLM Serving (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SLO-Aware Dynamic Frequency Scaling.

SLO-Aware Dynamic Frequency Scaling

1. Fundamental Concepts

2. Methodological Approaches

3. Algorithmic Foundations and Formulations

CPU Timeslice Mapping (Yadav et al., 2019)

LLM Serving with Iteration-Level Control (Kakolyris et al., 2024)

Two-Stage LLM Phase-Aware DVFS (Liu et al., 22 Aug 2025)

4. Empirical Results and Practical Impact

5. System-Specific Design Considerations

Phase Awareness

Predictor and Model Selection

Overhead Management

Scalability and Applicability

6. Limitations and Future Directions

7. Broader Significance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

SLO-Aware Dynamic Frequency Scaling

1. Fundamental Concepts

2. Methodological Approaches

3. Algorithmic Foundations and Formulations

CPU Timeslice Mapping (Yadav et al., 2019)

LLM Serving with Iteration-Level Control (Kakolyris et al., 2024)

Two-Stage LLM Phase-Aware DVFS (Liu et al., 22 Aug 2025)

4. Empirical Results and Practical Impact

5. System-Specific Design Considerations

Phase Awareness

Predictor and Model Selection

Overhead Management

Scalability and Applicability

6. Limitations and Future Directions

7. Broader Significance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research