GreenLLM: Sustainable LLM Serving
- GreenLLM is a framework designed to minimize energy consumption and carbon footprints using disaggregation and dynamic frequency scaling techniques while ensuring SLO compliance.
- It employs phase-aware resource management by splitting LLM inference into prefill and decode phases across heterogeneous GPUs to optimize performance and reduce energy draw.
- Empirical evaluations report up to 40% carbon reduction and 34% energy savings, demonstrating the framework's potential for sustainable, large-scale LLM deployments.
GreenLLM refers to a class of serving frameworks for LLMs, each designed to reduce energy consumption or carbon emissions during inference while meeting service-level objectives (SLOs) for latency and throughput. The most prominent instantiations are: (1) disaggregation across heterogeneous GPUs for carbon minimization (Shi et al., 2024), and (2) SLO-aware dynamic frequency scaling for GPU energy reduction (Liu et al., 22 Aug 2025). Both frameworks target the computational and environmental inefficiencies inherent to contemporary LLM deployment at scale, leveraging phase-aware resource management and data-driven scheduling.
1. System Architectures and Design Principles
Disaggregation-Based GreenLLM
The GreenLLM architecture for carbon minimization comprises three principal components (Shi et al., 2024):
- Disaggregated Execution Layer: Employs two optimizers—Disg-Pref-Decode and Disg-Spec-Decode—that disaggregate computation between a new, high-performance GPU (e.g., NVIDIA A100) and an older, lower-performance GPU (e.g., T4 or V100).
- Profiler: A lightweight module that exhaustively measures per-phase latency (TTFT: time-to-first-token, TPOT: time-per-output-token), per-phase energy draw, and per-request carbon breakdown across a grid of request parameters.
- SLO-Aware Scheduler: A runtime engine that, for each incoming request characterized by input length and QPS, and given a target latency SLO, selects the configuration that minimizes total carbon emissions while satisfying SLO constraints via lookup tables and collaborative filtering for missing profiles.
A disaggregated cluster typically involves at least one node with a new GPU and one with an old GPU, connected by ≥10 Gbps network bandwidth, with orchestration and inter-GPU data transfer managed by the disaggregators.
DVFS-Based GreenLLM
The energy-focused GreenLLM variant (Liu et al., 22 Aug 2025) integrates as a thin control layer in the serving stack (e.g., NVIDIA Dynamo + TensorRT-LLM):
- Ingress and Length-Based Routing: Requests are classified based on prompt length, enabling queueing to separate short from long prompts to mitigate head-of-line blocking.
- Prefill and Decode Pools: Separate worker pools for compute-bound prefill and memory-bound decode phases, each with phase-specific resource management.
- Telemetry and Profiling: Real-time collection of performance and energy metrics.
- Control Plane: Issues streaming-multiprocessor (SM) frequency updates to prefill and decode worker pools via NVML app-clocks, informed by latency, queue, and energy modeling.
2. Disaggregation Use Cases and Phase Optimization
Phase-Splitting (Disg-Pref-Decode)
- Prefill Phase (): Processes the input prompt and builds the key-value (KV) cache, running on the new GPU due to high TFLOPs demand.
- Decoding Phase: Autoregressive generation is memory-bound; runs on the old GPU . At the prefill/decoding boundary, the entire KV cache is serialized and DMA-transferred, requiring ≥10 Gbps for 7B-parameter models at 1 QPS.
Speculative-Splitting (Disg-Spec-Decode)
- Draft Model on Old GPU (): Emits speculative tokens and draft probabilities.
- Target Model on New GPU (): Only token IDs are initially transferred; the validation and final output generation occur on the high-performance node.
- Bandwidth Efficiency: Communication consists only of token IDs and draft probabilities, reducing inter-GPU bandwidth demand by – compared to phase split.
Phase-Specific DVFS
In the dynamic frequency scaling variant (Liu et al., 22 Aug 2025):
- Prefill: Latency-power models (quadratic in prompt length) determine the energy-optimal clock frequency for a batch of prefill jobs, within SLO constraints.
- Decode: A dual-loop controller uses token-per-second (TPS) measurements and tail TBT to dynamically adjust GPU frequency via a combination of coarse (200 ms, TPS-bucketed) and fine (20 ms, TBT-based) feedback loops.
3. SLO-Aware Scheduling and Control Algorithms
Both GreenLLM variants operationalize SLO-aware optimization:
- Scheduler for Disaggregated GreenLLM: Maintains matrices (carbon per token) and (fraction SLO compliance) for each configuration and workload bin 0. At runtime:
- Loads/completes the metrics using collaborative filtering.
- Identifies feasible configurations with 1.
- Selects 2, falling back to maximal SLO compliance otherwise.
DVFS-Selection for Energy-GreenLLM: For prefill, an explicit optimization is solved per queue to minimize
3
subject to busy time constraint
4
For decode, dual nested loops maintain TBT within SLO by discrete frequency nudges.
4. Theoretical Analysis of Carbon and Energy Savings
- Carbon Model for Disaggregated Serving (Shi et al., 2024):
- Operational Carbon 5 for GPU 6 (energy 7, carbon intensity 8),
- Embodied Carbon 9 (manufacturing carbon 0 amortized over lifetime 1 and usage time 2).
- Total Carbon per Request:
3 - Savings Condition: Disaggregation strictly reduces carbon if
4
where primed variables are those for the new GPU under disaggregation. - Savings Scale with Grid Intensity and Lifetime: Savings increase with higher grid carbon (5), older GPU reuse (6 up), and shorter new GPU amortization (7 down).
Energy Model for DVFS (Liu et al., 22 Aug 2025):
- For compute-bound prefill,
8
9
The optimal 0 is the minimal-energy frequency meeting all pending jobs' deadlines.
This suggests that programmer-level and operational control at the request and phase levels can be systematically leveraged for resource, energy, and carbon optimization in LLM serving.
5. Implementation
Disaggregation (Carbon-Focused) GreenLLM:
- Built atop vLLM (Shi et al., 2024), refactored to enable phase-split and spec-split plugin execution models.
- Profiler (Python + Shell) polls GPU power via pynvml at 200 ms intervals; utilizes REST for aggregation.
- Heterogeneous processing with Ray actors, NCCL and PyTorch nonblocking communication for overlap of compute and transfer.
- Scheduler in Python (~200 LOC) with collaborative filtering for profile completion (Paragon-style).
- DVFS (Energy-Focused) GreenLLM:
- Integrated as a control plane over commodity GPU serving stacks (e.g., TensorRT-LLM).
- Implements length-based routing, per-phase SM frequency updates via NVML.
- Real-time phase-specific energy and latency profiling, convex modeling, enumerative frequency selection, and dual-level feedback control.
6. Empirical Evaluation and Reported Effectiveness
Carbon Reduction (Disaggregation (Shi et al., 2024))
- Up to 40.6% total carbon reduction (ShareGPT app, under ≥90% SLO compliance).
- Savings up to 27.9% at lowest regional grid carbon intensity (NCSW, 17 gCO1/kWh).
- Bandwidth ≥8 Gbps required for DPD mode at appreciable QPS; at lower bandwidths or higher QPS, SpecDecode and DSD dominate.
- Carbon savings scale with increasing old-GPU reuse (lifetime up to 10 years) and decrease for shortened new-GPU amortization (2 years).
Energy Reduction (DVFS (Liu et al., 22 Aug 2025))
- Up to 34% total energy reduction for Qwen3-14B on Alibaba and Azure workloads, with <3.5% additional SLO violations.
- At low/mid QPS and TPS, energy-optimal frequencies reduce prefill energy by 10–30% and decode energy by 8–25%.
- TTFT/TBT SLOs are met for ≥88–94% of requests, up to saturation.
General Observations
- Both frameworks maintain model-agnosticity, requiring no changes to LLM internals.
- Efficacy increases as operators can exploit request/phase heterogeneity, grid carbon knowledge, and hardware lifecycles.
- All results hold across several representative LLMs (Qwen3-14B, Llama-7B), workloads (chatbot, code gen, summarization), and system regimes.
7. Significance, Scope, and Extensions
GreenLLM frameworks demonstrate that SLO-aware, phase-specific resource management—either across heterogeneous hardware (disaggregation) or via dynamic device-level control (DVFS)—can realize substantial operational efficiency gains for LLM inference workloads. Their architectures present validated paths for carbon or energy abatement at scale without sacrificing service-level objectives. The analytical models provide quantitative bounds and inform principled scheduling policy. A plausible implication is that these approaches generalize to broader distributed DNN serving contexts, subject to extensions for emerging accelerator and networking substrates. The frameworks’ reliance on data-driven modeling, collaborative profile completion, and lightweight runtime controls positions them as practical solutions for sustainable, production-grade LLM deployment (Shi et al., 2024, Liu et al., 22 Aug 2025).