Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 78 tok/s
Gemini 2.5 Pro 52 tok/s Pro
GPT-5 Medium 24 tok/s Pro
GPT-5 High 26 tok/s Pro
GPT-4o 120 tok/s Pro
Kimi K2 193 tok/s Pro
GPT OSS 120B 459 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

GreenLLM: Sustainable LLM Serving

Updated 8 October 2025
  • GreenLLM is a comprehensive framework for energy and carbon efficient LLM serving using SLO-aware scheduling and heterogeneous GPU utilization.
  • It leverages disaggregated serving and dynamic, phase-specific frequency scaling to minimize both operational and embodied carbon emissions.
  • The system achieves up to 40.6% carbon savings and 34% energy reduction by balancing throughput, latency, and environmental impact.

GreenLLM refers to two distinct but closely related SLO-aware frameworks for energy- and carbon-efficient serving of LLMs, each targeting optimization across environmental and operational dimensions. The original GreenLLM framework (Shi et al., 29 Dec 2024) centers on minimizing overall carbon emissions by leveraging heterogeneous GPU fleets and computation disaggregation. The subsequent GreenLLM work (Liu et al., 22 Aug 2025) focuses on dynamic, phase-specific frequency scaling to minimize GPU energy without compromising latency guarantees. Both systems explicitly balance throughput, latency service-level objectives, and environmental impact through hardware-aware scheduling and control strategies.

1. Motivation and Problem Formulation

The widespread adoption of LLMs in production settings has resulted in significant environmental costs. Two primary sources of carbon emission are addressed: operational carbon—the emissions from power consumed during inference—and embodied carbon, incurred from the manufacturing and premature disposal of GPUs. Traditional LLM serving systems exclusively utilize high-performing, newer GPUs for all inference stages, which amplifies electronic waste and inflates embodied carbon, as older GPUs become underutilized or discarded.

The GreenLLM frameworks posit that reusing older, less power-efficient GPUs in tandem with new ones can substantially reduce overall carbon emissions, provided system-level performance objectives are maintained. Additionally, classic GPU governors treat LLM inference as a uniform task, failing to account for unique phase asymmetries—especially the quadratic scaling of the prefill phase with prompt length and the unpredictable, iterative nature of decoding—which leads to suboptimal voltage-frequency settings, energy waste, and service latency violations.

2. Framework Architectures and Key Components

Both versions of GreenLLM implement SLO-aware logic for minimizing environmental impact:

GreenLLM (Carbon Emissions Reduction) (Shi et al., 29 Dec 2024):

  • Disaggregated Serving System: Supports multiple configurations (Disg-Pref-Decode, Disg-Spec-Decode), allocating different computation phases or model components to GPUs of varying generations.
  • Profiler: Continuously measures latency, energy consumption, and carbon emissions per GPU/workload combination.
  • SLO-Aware Scheduler: Selects a serving configuration that minimizes total carbon cost while guaranteeing latency SLOs.
  • Carbon Cost Model:

Creq=Creq,e+Creq,o=(treqLTCe)+(EreqCI)C_{\text{req}} = C_{\text{req,e}} + C_{\text{req,o}} = \left(\frac{t_{\text{req}}}{LT} \cdot C_e\right) + (E_{\text{req}} \cdot CI)

where treqt_{\text{req}} is request execution time, LTLT is hardware lifetime, CeC_e is embodied carbon, EreqE_{\text{req}} is energy consumed, and CICI is power grid carbon intensity.

GreenLLM (Energy-Efficient Frequency Scaling) (Liu et al., 22 Aug 2025):

  • Adaptive Routing: Requests are partitioned into queues based on prompt length to alleviate head-of-line blocking.
  • Prefill Phase Controller: Uses short traces to fit latency-power-frequency models (e.g., tkrefaLk2+bLk+ct_k^{\text{ref}} \approx aL_k^2 + bL_k + c and P(f)k3f3+k2f2+k1f+k0P(f) \approx k_3f^3 + k_2f^2 + k_1f + k_0) and solves an energy-minimization problem under SLO constraints:

Etotal(f)=P(f)busy(f)+Pidle(Dbusy(f)),subject to busy(f)DE_{\text{total}}(f) = P(f) \cdot \text{busy}(f) + P_{\text{idle}} \cdot (D - \text{busy}(f)), \quad \text{subject to } \text{busy}(f) \leq D

  • Decode Phase Controller: Employs a dual-loop, token-throughput-tracking mechanism to adjust GPU frequency in fine-grained steps, maintaining the desired token-level latency (typically 95th percentile TBT within target bounds).
  • Explicit SLO Constraints: Both prefill and decode controllers ensure latency targets at TTFT (time-to-first-token) and TBT (time-between-tokens).

3. Use Case Analysis and Disaggregation Strategies

Disaggregated Prefill and Decoding (Disg-Pref-Decode) (Shi et al., 29 Dec 2024):

  • Prefill (prompt processing) is compute-bound and latency-critical, thus runs on a new GPU (e.g., NVIDIA A100).
  • Decoding (token generation) is memory-bound; offloading to an older GPU (T4/V100) extends hardware lifetimes.
  • Requires high network bandwidth for large KV cache transfers.

Disaggregated Speculative Decoding (Disg-Spec-Decode) (Shi et al., 29 Dec 2024):

  • Draws on speculative decoding: a small draft model on the old GPU generates candidate tokens, verified by a larger target model on the new GPU.
  • Only token IDs and probability distributions are transferred, reducing communication overhead—especially beneficial when bandwidth is constrained.

Both strategies are evaluated for their efficiency in balancing operational versus embodied carbon, with the choice being contingent on application latency requirements and hardware resource profiles.

4. Theoretical Framework: Carbon Savings and Parameter Dependencies

Formal analysis (Shi et al., 29 Dec 2024) establishes when and how disaggregation yields carbon savings:

  • Operational Carbon Decrease Condition:

OA(OA+OB)>0O_A - (O'_A + O_B) > 0

Operational carbon of new GPU solo (OAO_A) must exceed combined operational carbon across split tasks.

  • Carbon Intensity Sensitivity:

OA+EA+OB+EBOA+EA=NA+NBNA+EA+EB(NA+NBNA)EANAα+EA\frac{O'_A + E'_A + O_B + E_B}{O_A + E_A} = \frac{N'_A + N_B}{N_A} + \frac{E'_A + E_B - \left(\frac{N'_A + N_B}{N_A}\right) E_A}{N_A \alpha + E_A}

Higher grid carbon intensity (α\alpha) amplifies savings from energy-efficient allocation.

  • Hardware Lifetime Impact:

Savings(tATANA+NBNAtATA)A+(tBTB)B\text{Savings} \propto \left(\frac{t'_A}{T_A} - \frac{N'_A + N_B}{N_A} \frac{t_A}{T_A}\right)\mathcal{A} + \left(\frac{t_B}{T_B}\right)\mathcal{B}

Maximal savings arise when new GPU has a short accumulated lifetime (high embodied carbon per request) and reused GPU operates over a longer lifetime.

These results clarify that practical carbon reductions depend on energy usage, device lifetimes, and grid composition.

5. SLO-Aware Scheduling and Frequency Control

GreenLLM incorporates SLO-aware scheduling to ensure latency targets while minimizing energy/carbon:

  • Prefill Phase:

Uses latency and power models to solve constrained optimization for each request class, partitioned by prompt length.

  • Decode Phase:

Dual-loop controller periodically adapts frequency band and fine-tunes steps based on observed token throughput and TBT statistics.

  • Queueing Mechanisms: Adaptive routing into length-based queues prevents head-of-line blocking.
  • Implementation Overheads: While the phase-aware adaptive controls introduce additional runtime complexity (profiling and decision logic), empirical results indicate that overheads are outweighed by reduction in energy/carbon per request.

6. Experimental Evaluation and Performance Metrics

GreenLLM has undergone extensive evaluation across a range of applications (chatbots, code assistants, document summarization) and hardware types (A100, V100, T4) (Shi et al., 29 Dec 2024, Liu et al., 22 Aug 2025):

  • Carbon Emissions Reduction: Up to 40.6% lower overall carbon emissions compared to serving solely on new GPUs; savings manifest in both operational and embodied carbon components.
  • Energy Savings: Up to 34% total GPU energy reduction compared with the default DVFS baseline under Alibaba and Azure production traces.
  • SLO Compliance: Latency SLOs are met for more than 90% of requests. TTFT and TBT remain within target bounds, with a less than 3.5% increase in deadline violations.
  • Bandwidth Sensitivity: Speculative decoding configuration demonstrates robust carbon savings even at constrained network bandwidth.
  • Scaling Implications: As SLO slack increases, additional energy/carbon reductions are attainable at moderate latency penalty—a fundamental energy-latency tradeoff.

7. Future Directions and System Extensibility

GreenLLM's principles extend to broader contexts and future hardware:

  • Multi-node Scheduling: Scaling SLO-constrained, phase-aware adaptation to distributed clusters and heterogeneous fleets.
  • Modeling Refinements: Enhanced latency-power prediction (potentially via ML-based regime recognition) to track dynamic workload characteristics.
  • Integration: Combinatorial scheduling with queueing/batching algorithms, or as an extension to frameworks like Splitwise and DynamoLLM.
  • Generality: The methodology applies not only to LLM serving but also to any workload with heterogeneous compute and memory phases, subject to SLO requirements.

GreenLLM's frameworks thus represent a comprehensive and modular approach to sustainable AI inference, directly linking workload scheduling, hardware reuse, and dynamic frequency control with quantifiable improvements in carbon and energy efficiency.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to GreenLLM.