Carbon-Aware Computing for Datacenters (2106.11750v1)

Published 11 Jun 2021 in cs.DC, cs.SY, and eess.SY

Abstract: The amount of CO$_2$ emitted per kilowatt-hour on an electricity grid varies by time of day and substantially varies by location due to the types of generation. Networked collections of warehouse scale computers, sometimes called Hyperscale Computing, emit more carbon than needed if operated without regard to these variations in carbon intensity. This paper introduces Google's system for Carbon-Intelligent Compute Management, which actively minimizes electricity-based carbon footprint and power infrastructure costs by delaying temporally flexible workloads. The core component of the system is a suite of analytical pipelines used to gather the next day's carbon intensity forecasts, train day-ahead demand prediction models, and use risk-aware optimization to generate the next day's carbon-aware Virtual Capacity Curves (VCCs) for all datacenter clusters across Google's fleet. VCCs impose hourly limits on resources available to temporally flexible workloads while preserving overall daily capacity, enabling all such workloads to complete within a day. Data from operation shows that VCCs effectively limit hourly capacity when the grid's energy supply mix is carbon intensive and delay the execution of temporally flexible workloads to "greener" times.

PDF Abstract

This paper introduces Google's Carbon-Intelligent Compute Management System (CICS), designed to reduce the carbon footprint and operational costs of its global datacenter fleet by shifting temporally flexible workloads to times when grid electricity is less carbon-intensive (Radovanovic et al., 2021 ). The system addresses the growing energy consumption of datacenters and the variability in carbon intensity of electricity grids based on time and location.

The core mechanism employed by CICS is the Virtual Capacity Curve (VCC). A VCC is an hourly limit imposed on the total compute resources (specifically CPU, measured in Google Compute Units or GCUs) available to flexible workloads within a datacenter cluster for the next day. These flexible workloads typically include batch processing jobs like data compaction, machine learning training, simulations, and video processing, which can tolerate delays as long as they complete within a 24-hour window. User-facing services and customer VMs are classified as inflexible and are not affected.

System Architecture and Implementation:

CICS operates through a suite of analytical pipelines executed daily:

Carbon Fetching Pipeline: Retrieves hourly, day-ahead forecasts of average carbon intensity (kgCO $_2$ e/kWh) for the grid zones where Google's datacenters are located. The primary source for this data is Tomorrow (electricityMap.org).
Power Modeling Pipeline: Trains and updates models that map cluster-level CPU usage to power consumption. The paper highlights that a piecewise linear model accurately estimates Power Distribution Unit (PDU) power based on CPU usage alone, with a daily Mean Absolute Percent Error (MAPE) below 5% for most PDs [(Radovanovic et al., 2021 ), see also (Daltro et al., 2021 )]. The relationship between cluster CPU usage ( $u_{CPU}^{(c)}$ ) and power ( $Pow^{(c)}$ ) is approximated locally as:

$Pow^{(c)}(u_{CPU}^{(c)} + \Delta u_{CPU}^{(c)}) - Pow^{(c)}(u_{CPU}^{(c)}) \approx \pi^{(c)} (u_{CPU}^{(c)}) \Delta u_{CPU}^{(c)}$

where $\pi^{(c)}$ is the cluster power sensitivity derived from individual PDU models and their average CPU usage fractions. This accurate mapping is crucial for the optimization process.
Load Forecasting Pipeline: Predicts cluster-level compute demand for the next day. Key forecasts include:
- Hourly inflexible CPU usage ( $U_{IF}^{(c)}(h)$ ).
- Total daily flexible CPU usage ( $T_{U,F}^{(c)}(d)$ ).
- Total daily CPU reservations ( $T_R^{(c)}(d)$ ) (reservations are typically higher than actual usage to guarantee resources).
- Hourly CPU reservation-to-usage ratio ( $\mathcal{R}^{(c)}(h)$ ). Forecasting uses methods like Exponentially Weighted Moving Averages (EWMA) and linear models to capture weekly patterns and daily deviations. Forecast accuracy is generally high (median APE < 10% for most clusters), which is vital for the system's effectiveness.
Optimization Pipeline: This daily process computes the optimal VCCs for all clusters. It minimizes a weighted sum of the expected fleetwide carbon footprint and peak power consumption costs:

$\min \limits_{\delta, y } \lambda_e \sum\limits_{c,h} \eta^{(c)}(h) \left( Pow^{(c)}(\hat{U}_{nom}^{(c)}(h)) + \pi^{(c)} (\hat{U}_{nom}^{(c)}(h)) \delta(c,h)\frac{\tau_U^{(c)}(d)}{24} \right) + \lambda_p \sum\limits_{c} y^{(c)}(d)$

* $\eta^{(c)}(h)$ : Forecasted carbon intensity. * $Pow^{(c)}(\cdot)$ : Power model output. * $\pi^{(c)}(\cdot)$ : Power model sensitivity. * $\delta(c,h)$ : Optimized hourly deviation of flexible usage from its daily average. * $\tau_U^{(c)}(d)$ : Risk-aware forecast of daily flexible usage. * $y^{(c)}(d)$ : Cluster peak power upper bound. * $\lambda_e, \lambda_p$ : Weights for carbon and peak power costs. The optimization is subject to constraints: * Daily Flexible Usage Conservation: $\sum_h \delta(c,h) = 0$ . * Risk-Aware SLOs: Ensures total daily capacity meets the 97th percentile of predicted reservation demand ( $\sum_h VCC^{(c)}(h) = \Theta^{(c)}(d)$ ), preventing frequent violations of flexible workload completion. * Power Capping: Prevents exceeding cluster power limits based on inflexible load quantiles. * Campus Power Contracts: Limits total peak power for clusters within a datacenter. * Machine Capacity: $VCC^{(c)}(h)$ cannot exceed physical capacity. The final VCC for cluster $c$ at hour $h$ is calculated as:

$VCC^{(c)}(h) = \left( \hat{U}_{IF}^{(c)}(h) + (1+\delta(c,h)) \frac{\tau_U^{(c)}(d)}{24} \right) \hat{\mathcal{R}^{(c)}(h)}$

SLO Violation Detection: Monitors if clusters consistently fail to meet their daily flexible compute targets. If violations persist (e.g., due to unpredicted demand growth), shaping for that cluster is temporarily paused to allow forecasts to adapt.

Operation and Impact:

The computed VCCs are pushed daily to Google's cluster management system (Borg). Borg's scheduler uses the VCC as the upper limit for total CPU reservations at any given hour. When the VCC is low (typically during high carbon intensity periods), the admission controller queues new flexible tasks or potentially preempts running ones, delaying their execution until the VCC increases (during lower carbon intensity periods).

The paper demonstrates the system's effectiveness using operational data:

Cluster Examples: Shows how VCCs successfully reduce CPU reservations and power consumption during peak carbon hours in clusters with sufficient flexible load and predictable demand. Effectiveness diminishes if flexible load is small or demand forecasts have high uncertainty (requiring higher VCC headroom).
Campus-Level Impact: A controlled experiment showed that activating CICS resulted in an average power drop of 1-2% during the highest carbon intensity hours compared to control days.
Trade-offs: Optimizing solely for carbon might increase peak resource needs, whereas the dual objective balances environmental goals with infrastructure efficiency. More aggressive shaping might lead to a slight decrease in total daily flexible work completed, potentially due to jobs migrating or task intolerance to longer delays.

Practical Considerations:

Scheduler-Agnostic: CICS provides capacity constraints (VCCs) but doesn't modify the underlying scheduling algorithms.
Reliability: Designed with gradual rollouts, monitoring, and feedback loops to ensure stability and adherence to SLOs.
Scalability: Centralized optimization based on aggregate cluster-level forecasts is more scalable than job-level approaches.
Day-Ahead Planning: Leverages the predictability of aggregate demand and day-ahead carbon forecasts, decoupling complex optimization from real-time scheduling.

The paper concludes that CICS effectively shifts load temporally to reduce carbon emissions and improve efficiency, demonstrating a scalable, first-of-its-kind implementation. Future work includes incorporating spatial load shifting (moving jobs between datacenters).

PDF Markdown Bookmark Chat (Pro)

Authors (15)

Ana Radovanovic (7 papers)
Ross Koningstein (1 paper)
Ian Schneider (6 papers)
Bokan Chen (3 papers)
Alexandre Duarte (3 papers)
Binz Roy (3 papers)
Diyue Xiao (1 paper)
Maya Haridasan (2 papers)
Patrick Hung (1 paper)
Nick Care (1 paper)
Saurav Talukdar (25 papers)
Eric Mullen (1 paper)
Kendal Smith (2 papers)
MariEllen Cottman (1 paper)
Walfredo Cirne (2 papers)

Citations (113)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/Jeffinatorator/status/1797461433946677711

https://twitter.com/JKJones_/status/1864134591927194063

YouTube

Show All Videos