VoltanaLLM: Energy-Efficient LLM Serving
- VoltanaLLM is an energy-efficient serving system for LLM inference that dynamically adjusts GPU frequency and request routing to meet strict service-level objectives.
- It employs a dual-module design with EcoFreq for per-instance GPU frequency tuning and EcoRoute for state-space routing to minimize energy consumption.
- Leveraging control-theoretic feedback mechanisms, VoltanaLLM achieves up to 36.3% energy savings while maintaining nearly perfect SLO attainment in disaggregated architectures.
VoltanaLLM is an energy-efficient serving system for LLM inference, designed to optimize the trade-off between energy consumption and request latency under service-level objectives (SLOs). VoltanaLLM jointly manages GPU frequency scaling and request routing in prefill/decode disaggregated architectures, enabling dynamic, phase-aware, fine-grained control that leverages principles from control theory. Its design and implementation achieve up to 36.3% energy savings without compromising near-perfect SLO attainment, representing a significant advance in sustainable and intelligent LLM deployment (Yu et al., 5 Sep 2025).
1. System Architecture and Control Loop
VoltanaLLM is structured around two core modules that interact in the online serving loop:
- EcoFreq (Feedback-Driven Frequency Controller): Runs on each serving instance, adaptively selecting the GPU frequency for its assigned phase (prefill or decode). EcoFreq uses lightweight latency predictors to select the minimum frequency at which batch processing can be completed within the prescribed SLO.
- EcoRoute (State-Space Router): Makes global routing decisions for the decode phase. Given the current state (including each instance’s GPU frequency and load), EcoRoute simulates (“what-if” analysis) the assignment of an incoming request to each decode instance and chooses the destination that minimizes incremental energy by avoiding frequency “boundary” transitions.
The overall workflow is as follows: When a request arrives, the prefill phase is handled round-robin, while the decode phase is routed through EcoRoute using a state-aware policy. Each instance runs a control thread that samples local state (batch size, token count, wait times) every iteration, performs a fast what-if analysis over frequency choices, and sets the GPU clock accordingly. This per-batch iteration control loop is central to VoltanaLLM’s adaptive, closed-loop optimization.
2. Frequency Scaling via Feedback Control
EcoFreq formulates fine-grained GPU frequency tuning as a real-time control problem. At each iteration, it predicts latency as a function of frequency and batch parameters, using offline-calibrated models such as:
where is the predicted latency, is the GPU frequency, is number of batched tokens, is the decode batch size (number of requests), and the size of the key–value cache.
The controller samples the system state and, for each candidate frequency, uses these models to check if the SLO is met. It then selects the lowest such frequency (“greedy” energy-minimization) while increasing frequency preemptively when SLO violations are predicted, realizing a robust, closed-loop design. The controller’s low computational overhead (decision cycle in milliseconds) permits per-iteration scaling.
3. State-Space Routing Across Decode Instances
During the decode phase, each instance can operate at a different batch size and frequency. Energy–latency profiles exhibit “boundary effects,” with energy cost sometimes increasing abruptly when the batch size passes a threshold requiring a higher frequency.
EcoRoute maintains a discrete state space parameterized by:
- Current GPU frequency of each instance
- Batch size (number of requests)
- Key–value cache size
When a request arrives, EcoRoute simulates (“what-if” analysis) allocation to each decode instance, predicting whether it will trigger a frequency increase. It chooses the assignment that avoids crossing costly boundaries and minimizes overall frequency (and thus energy). If all assignments necessitate a frequency jump, fallback heuristics (e.g., round-robin or minimum incremental frequency change) are used.
This global “navigation” in the state space enables balanced load distribution, preventing clusters of requests from forcing all decode instances to maximum frequency.
4. Performance Evaluation and Metrics
VoltanaLLM was implemented in SGLang and benchmarked on production workloads using multiple state-of-the-art LLMs. Key metrics include:
- Latency: Time-To-First-Token (TTFT) for prefill and Inter-Token Latency (ITL) for decode, with SLO attainment defined as the percentage of requests within the latency constraint.
- Energy Consumption: Measured in joules per inference batch, focusing on GPU energy, the main contributor to inference cost.
Empirical results:
LLM Family | Max Freq Energy Baseline | VoltanaLLM Energy | Relative Energy Savings | SLO Attainment Rate |
---|---|---|---|---|
SOTA LLMs | 100% | ≈63.7% | up to 36.3% | ≈99.7–100% |
VoltanaLLM dynamically ramps frequency up or down in response to request load, maintaining SLOs even during bursty traffic.
5. Technical Innovations
- Phase-Decoupled Fine-Grained Control: Distinguishes between compute-bound prefill and memory-bound decode, enabling per-phase optimization not available in monolithic or coarse-grained frequency scaling.
- Predictive Closed-Loop Feedback: Application of control theory—sampling current metrics, predicting error (predicted latency minus SLO target), and adaptively tuning the process variable (GPU frequency).
- State-Space Routing: First to cast LLM request routing as navigation in a state space defined by instance load and energy–latency boundaries, enabling energy-aware global optimization during decode.
- Lightweight Latency Models: Uses compact, regression-based predictors for latency, parameterized by frequency and load, tractable for real-time inference scheduling.
- Practical System Integration: Implemented in SGLang, VoltanaLLM operates in batch-serving, disaggregated (prefill/decode) LLM architectures increasingly adopted for efficient high-throughput inference.
6. Implications for Sustainable and Scalable LLM Serving
VoltanaLLM’s approach demonstrates that the combination of real-time, feedback-driven frequency tuning and global, state-aware routing can dramatically lower the energy cost of interactive LLM inference without sacrificing user-perceived latency or violating SLOs. This architecture is directly relevant for data centers aiming to manage operational costs and environmental impact as LLM deployment scales.
A plausible implication is that similar control-theoretic and state-space optimization strategies may generalize to other mixed-phase, performance-critical inference workloads beyond LLM serving—whenever computation and memory characteristics differ across processing phases.
7. Summary Table: Main Components and Dependencies
Module | Input State | Control Output | Objective |
---|---|---|---|
EcoFreq | Per-instance: frequency, load, wait time, batch | GPU frequency (per-iteration) | Minimize energy, ensure SLOs |
EcoRoute | All decode instance states: freq, load, KV cache | Routing decision (assignment) | Minimize system-wide frequency |
Latency Model | Offline profiled coefficients (a_f, b_f, c_f, etc.) | Used by EcoFreq, EcoRoute | Predict phase-specific latency |
VoltanaLLM establishes a foundation for control-theoretic, phase-disaggregated optimization in LLM inference, modeling energy–latency trade-offs with explicit regression and feedback mechanisms, achieving state-of-the-art energy reductions at near-maximum SLO fulfiLLMent (Yu et al., 5 Sep 2025).