Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 76 tok/s
Gemini 2.5 Pro 59 tok/s Pro
GPT-5 Medium 24 tok/s Pro
GPT-5 High 23 tok/s Pro
GPT-4o 95 tok/s Pro
Kimi K2 207 tok/s Pro
GPT OSS 120B 449 tok/s Pro
Claude Sonnet 4 37 tok/s Pro
2000 character limit reached

VoltanaLLM: Energy-Efficient LLM Serving

Updated 9 September 2025
  • VoltanaLLM is an energy-efficient serving system for LLM inference that dynamically adjusts GPU frequency and request routing to meet strict service-level objectives.
  • It employs a dual-module design with EcoFreq for per-instance GPU frequency tuning and EcoRoute for state-space routing to minimize energy consumption.
  • Leveraging control-theoretic feedback mechanisms, VoltanaLLM achieves up to 36.3% energy savings while maintaining nearly perfect SLO attainment in disaggregated architectures.

VoltanaLLM is an energy-efficient serving system for LLM inference, designed to optimize the trade-off between energy consumption and request latency under service-level objectives (SLOs). VoltanaLLM jointly manages GPU frequency scaling and request routing in prefill/decode disaggregated architectures, enabling dynamic, phase-aware, fine-grained control that leverages principles from control theory. Its design and implementation achieve up to 36.3% energy savings without compromising near-perfect SLO attainment, representing a significant advance in sustainable and intelligent LLM deployment (Yu et al., 5 Sep 2025).

1. System Architecture and Control Loop

VoltanaLLM is structured around two core modules that interact in the online serving loop:

  • EcoFreq (Feedback-Driven Frequency Controller): Runs on each serving instance, adaptively selecting the GPU frequency for its assigned phase (prefill or decode). EcoFreq uses lightweight latency predictors to select the minimum frequency at which batch processing can be completed within the prescribed SLO.
  • EcoRoute (State-Space Router): Makes global routing decisions for the decode phase. Given the current state (including each instance’s GPU frequency and load), EcoRoute simulates (“what-if” analysis) the assignment of an incoming request to each decode instance and chooses the destination that minimizes incremental energy by avoiding frequency “boundary” transitions.

The overall workflow is as follows: When a request arrives, the prefill phase is handled round-robin, while the decode phase is routed through EcoRoute using a state-aware policy. Each instance runs a control thread that samples local state (batch size, token count, wait times) every iteration, performs a fast what-if analysis over frequency choices, and sets the GPU clock accordingly. This per-batch iteration control loop is central to VoltanaLLM’s adaptive, closed-loop optimization.

2. Frequency Scaling via Feedback Control

EcoFreq formulates fine-grained GPU frequency tuning as a real-time control problem. At each iteration, it predicts latency as a function of frequency and batch parameters, using offline-calibrated models such as:

Tprefill(f,Nbt)=afNbt+cfT^{\text{prefill}}(f, N^{\text{bt}}) = a_f N^{\text{bt}} + c_f

Tdecode(f,Nreq,Nkv)=afNreq+bfNkv+cfT^{\text{decode}}(f, N^{\text{req}}, N^{\text{kv}}) = a_f N^{\text{req}} + b_f N^{\text{kv}} + c_f

where T()T^{(\cdot)} is the predicted latency, ff is the GPU frequency, NbtN^{\text{bt}} is number of batched tokens, NreqN^{\text{req}} is the decode batch size (number of requests), and NkvN^{\text{kv}} the size of the key–value cache.

The controller samples the system state and, for each candidate frequency, uses these models to check if the SLO is met. It then selects the lowest such frequency (“greedy” energy-minimization) while increasing frequency preemptively when SLO violations are predicted, realizing a robust, closed-loop design. The controller’s low computational overhead (decision cycle in milliseconds) permits per-iteration scaling.

3. State-Space Routing Across Decode Instances

During the decode phase, each instance can operate at a different batch size and frequency. Energy–latency profiles exhibit “boundary effects,” with energy cost sometimes increasing abruptly when the batch size passes a threshold requiring a higher frequency.

EcoRoute maintains a discrete state space parameterized by:

  • Current GPU frequency ff of each instance
  • Batch size (number of requests)
  • Key–value cache size

When a request arrives, EcoRoute simulates (“what-if” analysis) allocation to each decode instance, predicting whether it will trigger a frequency increase. It chooses the assignment that avoids crossing costly boundaries and minimizes overall frequency (and thus energy). If all assignments necessitate a frequency jump, fallback heuristics (e.g., round-robin or minimum incremental frequency change) are used.

This global “navigation” in the state space enables balanced load distribution, preventing clusters of requests from forcing all decode instances to maximum frequency.

4. Performance Evaluation and Metrics

VoltanaLLM was implemented in SGLang and benchmarked on production workloads using multiple state-of-the-art LLMs. Key metrics include:

  • Latency: Time-To-First-Token (TTFT) for prefill and Inter-Token Latency (ITL) for decode, with SLO attainment defined as the percentage of requests within the latency constraint.
  • Energy Consumption: Measured in joules per inference batch, focusing on GPU energy, the main contributor to inference cost.

Empirical results:

LLM Family Max Freq Energy Baseline VoltanaLLM Energy Relative Energy Savings SLO Attainment Rate
SOTA LLMs 100% ≈63.7% up to 36.3% ≈99.7–100%

VoltanaLLM dynamically ramps frequency up or down in response to request load, maintaining SLOs even during bursty traffic.

5. Technical Innovations

  • Phase-Decoupled Fine-Grained Control: Distinguishes between compute-bound prefill and memory-bound decode, enabling per-phase optimization not available in monolithic or coarse-grained frequency scaling.
  • Predictive Closed-Loop Feedback: Application of control theory—sampling current metrics, predicting error (predicted latency minus SLO target), and adaptively tuning the process variable (GPU frequency).
  • State-Space Routing: First to cast LLM request routing as navigation in a state space defined by instance load and energy–latency boundaries, enabling energy-aware global optimization during decode.
  • Lightweight Latency Models: Uses compact, regression-based predictors for latency, parameterized by frequency and load, tractable for real-time inference scheduling.
  • Practical System Integration: Implemented in SGLang, VoltanaLLM operates in batch-serving, disaggregated (prefill/decode) LLM architectures increasingly adopted for efficient high-throughput inference.

6. Implications for Sustainable and Scalable LLM Serving

VoltanaLLM’s approach demonstrates that the combination of real-time, feedback-driven frequency tuning and global, state-aware routing can dramatically lower the energy cost of interactive LLM inference without sacrificing user-perceived latency or violating SLOs. This architecture is directly relevant for data centers aiming to manage operational costs and environmental impact as LLM deployment scales.

A plausible implication is that similar control-theoretic and state-space optimization strategies may generalize to other mixed-phase, performance-critical inference workloads beyond LLM serving—whenever computation and memory characteristics differ across processing phases.

7. Summary Table: Main Components and Dependencies

Module Input State Control Output Objective
EcoFreq Per-instance: frequency, load, wait time, batch GPU frequency (per-iteration) Minimize energy, ensure SLOs
EcoRoute All decode instance states: freq, load, KV cache Routing decision (assignment) Minimize system-wide frequency
Latency Model Offline profiled coefficients (a_f, b_f, c_f, etc.) Used by EcoFreq, EcoRoute Predict phase-specific latency

VoltanaLLM establishes a foundation for control-theoretic, phase-disaggregated optimization in LLM inference, modeling energy–latency trade-offs with explicit regression and feedback mechanisms, achieving state-of-the-art energy reductions at near-maximum SLO fulfiLLMent (Yu et al., 5 Sep 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to VoltanaLLM.