Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 63 tok/s
Gemini 2.5 Pro 50 tok/s Pro
GPT-5 Medium 19 tok/s Pro
GPT-5 High 29 tok/s Pro
GPT-4o 101 tok/s Pro
Kimi K2 212 tok/s Pro
GPT OSS 120B 438 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

Measuring the environmental impact of delivering AI at Google Scale (2508.15734v1)

Published 21 Aug 2025 in cs.AI

Abstract: The transformative power of AI is undeniable - but as user adoption accelerates, so does the need to understand and mitigate the environmental impact of AI serving. However, no studies have measured AI serving environmental metrics in a production environment. This paper addresses this gap by proposing and executing a comprehensive methodology for measuring the energy usage, carbon emissions, and water consumption of AI inference workloads in a large-scale, AI production environment. Our approach accounts for the full stack of AI serving infrastructure - including active AI accelerator power, host system energy, idle machine capacity, and data center energy overhead. Through detailed instrumentation of Google's AI infrastructure for serving the Gemini AI assistant, we find the median Gemini Apps text prompt consumes 0.24 Wh of energy - a figure substantially lower than many public estimates. We also show that Google's software efficiency efforts and clean energy procurement have driven a 33x reduction in energy consumption and a 44x reduction in carbon footprint for the median Gemini Apps text prompt over one year. We identify that the median Gemini Apps text prompt uses less energy than watching nine seconds of television (0.24 Wh) and consumes the equivalent of five drops of water (0.26 mL). While these impacts are low compared to other daily activities, reducing the environmental impact of AI serving continues to warrant important attention. Towards this objective, we propose that a comprehensive measurement of AI serving environmental metrics is critical for accurately comparing models, and to properly incentivize efficiency gains across the full AI serving stack.

Summary

  • The paper establishes a comprehensive, production-scale framework for measuring AI inference's environmental impact, including energy, emissions, and water use.
  • It integrates energy data from AI accelerators, host components, idle capacity, and data center overhead, revealing a 2.4x underestimation in narrow measurements.
  • Empirical findings demonstrate a 44x reduction in per-prompt emissions and 33x energy reduction through model, hardware, and software optimizations.

Comprehensive Environmental Impact Assessment of AI Inference at Google Scale

Introduction

The rapid proliferation of large-scale AI systems, particularly LLMs, has shifted the focus of environmental impact analysis from training to inference. As AI products serve billions of prompts globally, quantifying and mitigating the energy, carbon, and water footprint of inference is critical for both operational sustainability and policy development. This paper presents a rigorous, production-scale methodology for measuring the environmental impact of AI inference at Google, with a focus on the Gemini Apps product. The paper introduces a comprehensive measurement boundary, empirically quantifies per-prompt energy, emissions, and water consumption, and demonstrates the substantial efficiency gains achieved through full-stack optimizations.

Measurement Boundaries and Methodological Advances

A central contribution of this work is the explicit definition and operationalization of a comprehensive measurement boundary for AI inference energy accounting. Prior studies have typically restricted measurement to the active AI accelerator, omitting host CPU/DRAM, idle capacity, and data center overhead. This narrow focus leads to significant underestimation and poor comparability across studies.

The proposed methodology expands the boundary to include:

  • Active AI Accelerator energy: Direct measurement of all accelerators involved in inference.
  • Active CPU/DRAM energy: Host system energy required for accelerator operation.
  • Idle Machine energy: Energy consumed by provisioned but idle capacity, essential for reliability and latency.
  • Overhead energy: Data center infrastructure overhead, normalized via PUE. Figure 1

    Figure 1: Existing and proposed boundaries for AI inference energy measurements, highlighting the inclusion of all serving stack components in the comprehensive approach.

This boundary is operationalized through internal telemetry, mapping LLM jobs to machine IDs and collecting PSU-level power data. The methodology also transparently excludes external networking, end-user devices, and training energy, focusing strictly on inference.

Empirical Results: Energy, Emissions, and Water Consumption

Applying the comprehensive methodology to Gemini Apps, the median text prompt in May 2025 consumed 0.24 Wh, generated 0.03 gCOâ‚‚e, and used 0.26 mL of water. Notably, the active AI accelerator accounts for only 58% of total energy; host CPU/DRAM, idle capacity, and overhead contribute the remainder. Figure 2

Figure 2: Components of the total LLM energy consumption per prompt across a production LLM serving stack, as measured for Gemini Apps.

Comparison with a narrower, accelerator-only approach yields a 2.4x underestimation (0.10 Wh/prompt), underscoring the necessity of comprehensive measurement for accurate environmental accounting.

When benchmarked against public estimates and measurements for comparable models, the Gemini Apps per-prompt energy is one to two orders of magnitude lower than many prior results. This discrepancy is attributed to three factors: (1) in-situ, production-scale measurement; (2) use of highly optimized, proprietary models and hardware; and (3) efficient batching and utilization in production environments. Figure 3

Figure 3: Energy per prompt for large production AI models versus LMArena score, illustrating the impact of measurement boundary and methodology.

A longitudinal analysis reveals a 44x reduction in per-prompt emissions and a 33x reduction in per-prompt energy for Gemini Apps over a 12-month period. These gains are decomposed into:

  • 23x reduction from model and software improvements
  • 1.4x reduction from improved machine utilization
  • 1.4x reduction in emissions intensity via clean energy procurement
  • 36x reduction in Scope 1+3 (embodied) emissions per prompt Figure 4

    Figure 4: Median Gemini Apps text prompt emissions over time, showing a 47x reduction in Scope 2 MB emissions and a 36x reduction in Scope 1+3 emissions per prompt.

Key drivers include architectural advances (e.g., MoE, quantization, speculative decoding), custom hardware (TPUs), optimized software stacks (XLA, Pallas, Pathways), dynamic resource allocation, and aggressive clean energy procurement. The paper also highlights Google's water stewardship initiatives, with a trend toward air-cooled data centers in high-stress regions and a fleetwide WUE of 1.15 L/kWh.

Implications and Future Directions

The findings have several important implications:

  • Standardization: The order-of-magnitude variability in published per-prompt energy and emissions metrics is primarily due to inconsistent measurement boundaries. Adoption of comprehensive, production-scale methodologies is essential for meaningful cross-model and cross-provider comparisons.
  • Optimization Incentives: Full-stack measurement exposes new levers for efficiency, incentivizing optimizations beyond the accelerator (e.g., host utilization, idle management, data center operations).
  • Policy and Reporting: Accurate, comprehensive metrics are necessary for regulatory compliance, sustainability reporting, and public transparency.
  • Scalability: While per-prompt impacts are low relative to other activities, the aggregate effect at global scale remains significant, justifying continued focus on efficiency and decarbonization.

Future work should extend comprehensive measurement to training, incorporate end-to-end lifecycle analysis, and develop open standards for environmental reporting in AI.

Conclusion

This paper establishes a rigorous, production-scale methodology for measuring the environmental impact of AI inference, demonstrating that existing, narrow approaches substantially underestimate true costs. For Gemini Apps, the median prompt's energy, emissions, and water footprint are lower than most public estimates, due to both methodological comprehensiveness and operational efficiency. The demonstrated 44x reduction in per-prompt emissions over one year highlights the potential for rapid, compounding gains when full-stack metrics are used to guide optimization. Widespread adoption of such comprehensive frameworks is critical for ensuring that AI's environmental efficiency keeps pace with its growing capabilities and societal impact.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 22 posts and received 4619 likes.

Youtube Logo Streamline Icon: https://streamlinehq.com