How Hungry is AI? Benchmarking Energy, Water, and Carbon Footprint of LLM Inference (2505.09598v2)

Published 14 May 2025 in cs.CY and cs.AI

Abstract: This paper introduces a novel infrastructure-aware benchmarking framework for quantifying the environmental footprint of LLM inference across 30 state-of-the-art models as deployed in commercial data centers. Our framework combines public API performance data with region-specific environmental multipliers and statistical inference of hardware configurations. We additionally utilize cross-efficiency Data Envelopment Analysis (DEA) to rank models by performance relative to environmental cost. Our results show that o3 and DeepSeek-R1 emerge as the most energy-intensive models, consuming over 33 Wh per long prompt, more than 70 times the consumption of GPT-4.1 nano, and that Claude-3.7 Sonnet ranks highest in eco-efficiency. While a single short GPT-4o query consumes 0.43 Wh, scaling this to 700 million queries/day results in substantial annual environmental impacts. These include electricity use comparable to 35,000 U.S. homes, freshwater evaporation matching the annual drinking needs of 1.2 million people, and carbon emissions requiring a Chicago-sized forest to offset. These findings illustrate a growing paradox: Although AI is becoming cheaper and faster, its global adoption drives disproportionate resource consumption. Our study provides a standardized, empirically grounded methodology for benchmarking the sustainability of LLM deployments, laying a foundation for future environmental accountability in AI development and sustainability standards.

PDF Abstract

This paper, "How Hungry is AI? Benchmarking Energy, Water, and Carbon Footprint of LLM Inference" (Jegham et al., 14 May 2025 ), addresses the critical need for a standardized methodology to quantify the environmental footprint of LLM inference at the per-query level, particularly for models deployed in commercial data centers. While past research has focused on the environmental costs of LLM training, inference is becoming the dominant contributor due to its continuous, large-scale nature. Existing benchmarking methods are often limited to training, lack real-time granularity for inference, are restricted to local setups, or cannot benchmark proprietary models, largely due to the opacity of commercial AI providers regarding model-specific inference data.

To overcome these limitations, the authors introduce a novel infrastructure-aware benchmarking framework. This framework integrates several data sources:

Performance Metrics: Latency and tokens-per-second (TPS) data for 30 state-of-the-art models (including proprietary ones from OpenAI, Anthropic, and open-source like LLaMA and DeepSeek) are obtained from public API performance evaluations under standardized short, medium, and long prompt configurations.
Hardware Specifications: Published GPU and system power specifications for typical data center hardware like NVIDIA DGX systems (A100, H100, H200, H800) are incorporated.
Statistical Inference: Two-way ANOVA and Tukey HSD analysis are used to statistically estimate the underlying hardware configurations used for models where this information is not disclosed (e.g., attributing GPT-4, GPT-4 Turbo, and GPT-4o mini to A100 systems based on performance analysis compared to H100/H200 deployments).
Environmental Multipliers: Region-specific data center overheads are accounted for using standard multipliers:
- Power Usage Effectiveness (PUE): Ratio of total data center energy to IT energy.
- Water Usage Effectiveness (WUE): Water used per kWh of IT energy, considering both on-site cooling (site WUE) and water embedded in electricity generation (source WUE). The paper focuses on water consumption (evaporation).
- Carbon Intensity Factor (CIF): Carbon emissions (kgCO $_2$ e) per kWh, reflecting the regional electricity mix (focusing on Scope 2 emissions from purchased electricity).

The core of the framework is a formula to estimate per-query energy consumption, $E_{\text{query}}$ , in kWh:

$E_{\text{query (kWh)}} = \left( \frac{\text{Output Length}}{\text{TPS}} + \text{Latency} \right) \times \frac{1}{3600} \times (P_{\text{GPU}} \times U_{\text{GPU total}} + P_{\text{non-GPU}} \times U_{\text{non-GPU total}}) \times \text{PUE}$

where $P_{\text{GPU}}$ and $P_{\text{non-GPU}}$ are max rated powers, $U_{\text{GPU total}}$ and $U_{\text{non-GPU total}}$ are total power utilization fractions based on assigned GPUs, node size, and batch size, and the first term calculates total inference time in hours. Water consumption (L) is then calculated as $E_{\text{query}} \cdot \text{PUE} \cdot \text{WUE}_{\text{site}} + E_{\text{query}} \cdot \text{WUE}_{\text{source}}$ , and carbon emissions (kgCO $_2$ e) as $E_{\text{query}} \cdot \text{CIF}$ .

The paper benchmarks 30 models across short, medium, and long prompts. Key findings include:

Energy Consumption: Significant variations exist. GPT-4.1 nano is highly efficient (0.454 Wh for long prompts), while models like o3 (39.223 Wh), DeepSeek-R1 (33.634 Wh), and GPT-4.5 (30.495 Wh) consume substantially more, over 70 times that of GPT-4.1 nano for long prompts. Deployment infrastructure significantly impacts energy; GPT-4o mini uses more energy than GPT-4o on long queries due to its presumed A100 deployment versus GPT-4o's H100/H200 deployment.
Water and Carbon Emissions: These metrics generally follow energy consumption trends, but are also heavily influenced by the regional environmental multipliers of the data center location. DeepSeek models, deployed in China, exhibit higher emissions and water use partly due to regional grid intensity and data center efficiencies. DeepSeek-R1 can emit over 14 grams of CO $_2$ e and consume over 150 ml of water per query.

To contextualize environmental impact relative to performance, the authors apply cross-efficiency Data Envelopment Analysis (DEA), using environmental factors as inputs and a composite AI Index score (reflecting reasoning, math, and coding abilities) as output. This analysis revealed that eco-efficiency depends on both strong performance and low environmental cost. Claude-3.7 Sonnet emerged as the most eco-efficient (0.886 score), followed by OpenAI's smaller reasoning models like o4-mini (high) (0.867). DeepSeek-R1 and DeepSeek-V3 had the lowest scores, highlighting their infrastructural inefficiencies relative to their capability.

A significant practical application of the framework is a case paper estimating the annual environmental footprint of GPT-4o inference at scale in 2025. Assuming 700 million daily queries to GPT-4o and a modest growth rate, the paper projects:

Annual energy consumption between 391,509 MWh and 463,269 MWh, comparable to the total electricity use of 35,000 U.S. homes.
Annual water consumption (evaporation) between 1,334,991 kL and 1,579,680 kL, equivalent to over 500 Olympic-sized swimming pools or the annual drinking needs of nearly 1.2 million people.
Annual carbon emissions between 138,125 and 163,441 tons of CO $_2$ e, comparable to the emissions from 30,000 gasoline cars or requiring a Chicago-sized forest to offset.

These results underscore a crucial point for practitioners: even low per-query costs accumulate to massive aggregate impacts due to the sheer scale of LLM adoption, illustrating the Jevons Paradox. Increased efficiency per task drives total usage, potentially increasing overall resource consumption.

For developers and engineers, the paper highlights the importance of considering the deployment infrastructure and model choice beyond just performance. Choosing models known for higher efficiency (like GPT-4.1 nano or Claude-3.7 Sonnet) or deploying on more efficient hardware with lower PUE, WUE, and CIF can significantly reduce the environmental footprint. The paper also implicitly points to optimization strategies like dynamic batching, which can reduce per-query energy by increasing hardware utilization, though this involves trade-offs with latency (discussed in Appendix B).

The authors note limitations, including conservatively excluding idle GPU power, estimating non-GPU power and hardware for proprietary models, and relying on regional averages for environmental multipliers when facility-specific data is unavailable. Future work should leverage more detailed telemetry and facility-level reporting, and extend analysis to other modalities like image or video generation.

In conclusion, this paper provides a valuable, empirically grounded framework for benchmarking the environmental cost of LLM inference, revealing that infrastructure plays a critical role alongside model architecture. The paper serves as a foundational step towards enabling infrastructure-aware decision-making, enhancing accountability, and developing sustainability standards in AI deployment.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Nidhal Jegham (3 papers)
Marwen Abdelatti (1 paper)
Lassad Elmoubarki (1 paper)
Abdeltawab Hendawi (5 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/JohnNosta/status/1923083554939916541

https://twitter.com/WGOV/status/1923043958923436404

https://twitter.com/ednewtonrex/status/1932555830320378325

https://twitter.com/mehenryme/status/1935148512561287306

https://twitter.com/_simonsmith/status/1923036097820430715

https://twitter.com/arxivsanitybot/status/1923570845198450921