Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 91 tok/s
Gemini 2.5 Pro 45 tok/s Pro
GPT-5 Medium 37 tok/s
GPT-5 High 35 tok/s Pro
GPT-4o 105 tok/s
GPT OSS 120B 463 tok/s Pro
Kimi K2 235 tok/s Pro
2000 character limit reached

Measuring the environmental impact of delivering AI at Google Scale (2508.15734v1)

Published 21 Aug 2025 in cs.AI

Abstract: The transformative power of AI is undeniable - but as user adoption accelerates, so does the need to understand and mitigate the environmental impact of AI serving. However, no studies have measured AI serving environmental metrics in a production environment. This paper addresses this gap by proposing and executing a comprehensive methodology for measuring the energy usage, carbon emissions, and water consumption of AI inference workloads in a large-scale, AI production environment. Our approach accounts for the full stack of AI serving infrastructure - including active AI accelerator power, host system energy, idle machine capacity, and data center energy overhead. Through detailed instrumentation of Google's AI infrastructure for serving the Gemini AI assistant, we find the median Gemini Apps text prompt consumes 0.24 Wh of energy - a figure substantially lower than many public estimates. We also show that Google's software efficiency efforts and clean energy procurement have driven a 33x reduction in energy consumption and a 44x reduction in carbon footprint for the median Gemini Apps text prompt over one year. We identify that the median Gemini Apps text prompt uses less energy than watching nine seconds of television (0.24 Wh) and consumes the equivalent of five drops of water (0.26 mL). While these impacts are low compared to other daily activities, reducing the environmental impact of AI serving continues to warrant important attention. Towards this objective, we propose that a comprehensive measurement of AI serving environmental metrics is critical for accurately comparing models, and to properly incentivize efficiency gains across the full AI serving stack.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper introduces a full-stack measurement method that captures energy consumption, carbon emissions, and water usage across the entire AI serving infrastructure.
  • It leverages internal telemetry to include active accelerators, idle machines, and data center overhead, revealing a 33x energy and 44x emissions reduction over time.
  • Empirical results from Gemini Apps highlight significant efficiency gains driven by hardware-software optimizations and aggressive clean energy procurement.

Comprehensive Measurement of AI Serving Environmental Impact at Google Scale

Introduction

This paper presents a rigorous, production-scale methodology for quantifying the environmental impact of AI inference, specifically focusing on energy consumption, carbon emissions, and water usage associated with serving LLMs at Google. The work addresses the lack of first-party, in-situ data from major AI providers and proposes a comprehensive measurement boundary that encompasses all material energy sources under operational control, including active AI accelerators, host CPU/DRAM, idle machine capacity, and data center overhead. The methodology is applied to Google's Gemini Apps product, yielding detailed metrics and revealing substantial efficiency improvements over time.

Measurement Boundaries and Methodological Advances

A central contribution is the explicit definition and justification of the measurement boundary for AI inference energy. Previous studies have typically restricted measurement to active AI accelerator energy, omitting host system power, idle capacity, and data center overhead. This paper argues that such omissions lead to significant underestimation and misallocation of environmental impact, impeding both transparency and optimization efforts. Figure 1

Figure 1: Existing and proposed boundaries for AI inference energy measurements. The comprehensive approach includes all components of the serving stack, not just active accelerators.

The comprehensive approach leverages internal telemetry to capture real-time power draw across the entire serving stack, mapping LLM jobs to machine IDs and aggregating energy consumption over both active and idle periods. Overhead energy is incorporated via the Power Usage Effectiveness (PUE) metric, and exclusions are transparently documented (e.g., external networking, end-user devices, and training energy).

Empirical Results: Energy, Emissions, and Water Consumption

Applying the methodology to Gemini Apps, the median text prompt in May 2025 consumed 0.24 Wh of energy, generated 0.03 gCO2_2e, and used 0.26 mL of water. These values are substantially lower than many public estimates and prior measurements, which often report per-prompt energy consumption an order of magnitude higher due to narrower boundaries and less efficient serving conditions. Figure 2

Figure 2: Energy per prompt for large production AI models versus LMArena score, illustrating wide variability in published metrics and the impact of measurement boundary.

The breakdown of energy consumption per prompt is as follows: active AI accelerators (58%), host CPU/DRAM (25%), idle machines (10%), and data center overhead (8%). Figure 3

Figure 3: Components of total LLM energy consumption per prompt across the production serving stack, based on the comprehensive measurement approach.

Emissions are calculated using market-based (MB) grid emission factors and include both operational (Scope 2) and embodied (Scope 1+3) contributions. Water consumption is normalized via the Water Usage Effectiveness (WUE) metric, with Google's data centers averaging 1.15 L/kWh for consumptive use.

A notable finding is the dramatic improvement in environmental efficiency over a 12-month period. The median Gemini Apps prompt saw a 33x reduction in energy consumption and a 44x reduction in total emissions, driven by software optimizations, improved hardware utilization, clean energy procurement, and reductions in embodied emissions. Figure 4

Figure 4: Median Gemini Apps text prompt emissions over time, showing a 47x reduction in Scope 2 MB emissions and a 36x reduction in Scope 1+3 emissions per prompt.

Key contributors to these gains include:

  • Model architecture improvements (e.g., Mixture-of-Experts, efficient attention mechanisms)
  • Quantization and distillation for serving-optimized models
  • Speculative decoding and batch inference for higher hardware utilization
  • Custom TPU hardware co-designed for energy efficiency
  • Dynamic resource allocation to minimize idle energy
  • Ultra-efficient data centers (fleet-wide PUE of 1.09)
  • Aggressive clean energy procurement (30% reduction in MB emissions factor from 2023 to 2024)

Implications and Future Directions

The results demonstrate that comprehensive, in-situ measurement yields more accurate and actionable environmental metrics for AI serving. The findings challenge prevailing estimates and highlight the necessity of standardized measurement boundaries for cross-provider comparability and effective optimization. The methodology enables identification of efficiency hotspots and incentivizes improvements across the full serving stack, not just model or hardware design.

Theoretical implications include the need to revisit lifecycle analyses of AI systems, incorporating operational realities of large-scale deployment. Practically, the approach provides a template for other providers to adopt, facilitating industry-wide transparency and accountability. Future work should extend comprehensive measurement to training workloads, integrate site-specific normalization for water and emissions, and explore further hardware-software co-design for environmental optimization.

Conclusion

This paper establishes a comprehensive framework for measuring the environmental impact of AI inference at production scale, revealing that existing, narrow approaches substantially underestimate true costs. For Gemini Apps, the median prompt's energy, emissions, and water usage are lower than previously reported, attributable to efficient serving practices and holistic measurement. The work underscores the importance of standardized, full-stack metrics for driving environmental efficiency in AI deployment and advocates for their widespread adoption as AI continues to scale globally.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube