CarbonScaler: Leveraging Cloud Workload Elasticity for Optimizing Carbon-Efficiency (2302.08681v2)
Abstract: Cloud platforms are increasing their emphasis on sustainability and reducing their operational carbon footprint. A common approach for reducing carbon emissions is to exploit the temporal flexibility inherent to many cloud workloads by executing them in periods with the greenest energy and suspending them at other times. Since such suspend-resume approaches can incur long delays in job completion times, we present a new approach that exploits the elasticity of batch workloads in the cloud to optimize their carbon emissions. Our approach is based on the notion of "carbon scaling," similar to cloud autoscaling, where a job dynamically varies its server allocation based on fluctuations in the carbon cost of the grid's energy. We develop a greedy algorithm for minimizing a job's carbon emissions via carbon scaling that is based on the well-known problem of marginal resource allocation. We implement a CarbonScaler prototype in Kubernetes using its autoscaling capabilities and an analytic tool to guide the carbon-efficient deployment of batch applications in the cloud. We then evaluate CarbonScaler using real-world machine learning training and MPI jobs on a commercial cloud platform and show that it can yield i) 51% carbon savings over carbon-agnostic execution; ii) 37% over a state-of-the-art suspend-resume policy; and iii) 8% over the best static scaling policy.
- J. Sverre Aarseth. 1985. 12 - Direct Methods for N-Body Simulations. In Multiple Time Scales. Academic Press, 377–418. https://doi.org/10.1016/B978-0-12-123420-1.50017-3
- Carbon Explorer: A Holistic Framework for Designing Carbon Aware Datacenters. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2 (Vancouver, BC, Canada) (ASPLOS 2023). Association for Computing Machinery, New York, NY, USA, 118–132. https://doi.org/10.1145/3575693.3575754
- Gene M Amdahl. 1967. Validity of the Single Processor Approach to Achieving Large Scale Computing Capabilities. In Proceedings of the Spring Joint Computer Conference.
- Anders S. G. Andrae and Tomas Edler. 2015. On Global Electricity Usage of Communication Technology: Trends to 2030. Challenges 6, 1 (2015), 117–157. https://doi.org/10.3390/challe6010117
- Scaling Spark in the Real World: Performance and Usability. Proc. VLDB Endow. 8, 12 (aug 2015), 1840–1843. https://doi.org/10.14778/2824032.2824080
- AWS. 2022. AWS Auto Scaling. https://aws.amazon.com/autoscaling/.
- Luiz André Barroso and Urs Hölzle. 2009. The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines. Springer Nature, Europe. 189 pages.
- Take it to the Limit: Peak Prediction-driven Resource Overcommitment in Datacenters. In Proceedings of the Sixteenth European Conference on Computer Systems (Online Event, United Kingdom) (EuroSys ’21). Association for Computing Machinery, New York, NY, USA, 556–573. https://doi.org/10.1145/3447786.3456259
- On the Promise and Pitfalls of Optimizing Embodied Carbon. In Proceedings of the 2nd Workshop on Sustainable Computer Systems (HotCarbon). ACM, New York, NY, USA, 6 pages.
- Sustainable Computing – Without the Hot Air. In HotCarbon: Workshop on Sustainable Computer Systems Design and Implementation. ACM, New York, NY, USA, 7 pages.
- Powerapi: A Software Library to Monitor the Energy Consumed at the Process-level. ERCIM News (2013).
- Seán Boyle and Casey Junod. 2023. Accelerating our climate commitments on Earth Day. https://blog.twitter.com/en_us/topics/company/2022/accelerating-our-climate-commitments-on-earth-day.
- Neuralpower: Predict and Deploy Energy-efficient Convolutional Neural Networks. In Asian Conference on Machine Learning.
- A. Chien. 2021. Driving the Cloud to True Zero Carbon. Communication of the ACM 64, 2 (February 2021).
- Process-Level Power Estimation in VM-Based Systems. In Proceedings of the Tenth European Conference on Computer Systems (Bordeaux, France) (EuroSys ’15). Association for Computing Machinery, New York, NY, USA, Article 14, 14 pages. https://doi.org/10.1145/2741948.2741971
- Job placement advisor based on turnaround predictions for HPC hybrid clouds. Future Generation Computer Systems 67 (2017), 35–46. https://doi.org/10.1016/j.future.2016.08.010
- RAPL: Memory Power Estimation and Capping. In ACM/IEEE International Symposium on Low-Power Electronics and Design (ISLPED).
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 4171–4186. https://doi.org/10.18653/v1/N19-1423
- Measuring the Carbon Intensity of AI in Cloud Instances. In 2022 ACM Conference on Fairness, Accountability, and Transparency (FAccT ’22).
- EC2 2022. Amazon EC2 Spot Instances. https://aws.amazon.com/ec2/spot/.
- EPA. 2023. Green Power Partnership Long-term Contracts. https://www.epa.gov/greenpower/green-power-partnership-long-term-contracts
- Awi Federgruen and Henri Groenevelt. 1986. The Greedy Procedure for Resource Allocation Problems: Necessary and Sufficient Conditions for Optimality. Oper. Res. 34, 6 (dec 1986), 909–918.
- SmartWatts: Self-Calibrating Software-Defined Power Meter for Containers. In 2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID). 479–488. https://doi.org/10.1109/CCGrid49817.2020.00-45
- Message P Forum. 1994. MPI: A Message-Passing Interface Standard. Technical Report. USA.
- E-HPC: A Library for Elastic Resource Management in HPC Environments. In Proceedings of the 12th Workshop on Workflows in Support of Large-Scale Science (Denver, Colorado) (WORKS ’17). Association for Computing Machinery, New York, NY, USA, Article 1, 11 pages. https://doi.org/10.1145/3150994.3150996
- AutoScale: Dynamic, Robust Capacity Management for Multi-Tier Data Centers. ACM Trans. Comput. Syst. 30, 4, Article 14 (nov 2012), 26 pages. https://doi.org/10.1145/2382553.2382556
- Environment-conscious scheduling of HPC applications on distributed cloud-oriented data centers. J. Parallel and Distrib. Comput. 71, 6 (2011), 732–749.
- Google. 2022. Google’s Green PPAs: What, How, and Why. https://static.googleusercontent.com/media/www.google.com/en//green/pdfs/renewable-energy.pdf.
- The War of the Efficiencies: Understanding the Tension between Carbon and Energy Optimization. In Proc. 2nd ACM Workshop on Hot Topics in Sustainable Computing Systems (HotCarbon’23).
- Fiona Harvey. 2021. The Guardian, Major Climate Changes Inevitable and Irreversible – IPCC’s Starkest Warning Yet. https://www.theguardian.com/science/2021/aug/09/humans-have-caused-unprecedented-and-/irreversible-change-to-climate-scientists-warn.
- Deep Residual Learning for Image Recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).
- Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center. In USENIX Symposium on Networked Systems Design and Implementation (NSDI). USENIX Association, Boston, MA, 14. https://www.usenix.org/conference/nsdi11/mesos-platform-fine-grained-resource-sharing-data-center
- VMware Inc. 2023. Journey to Net Zero. https://www.vmware.com/company/net-zero.html.
- World Resource Institute. 2022. GreenHouseGas Protocol. https://ghgprotocol.org/
- Towards Scalable Parallel Training of Deep Neural Networks. In Proceedings of the Machine Learning on HPC Environments (MLHPC’17).
- Network-Aware Scheduling for Data-Parallel Jobs: Plan When You Can. In Proceedings of the 2015 ACM Conference on Special Interest Group on Data Communication (London, United Kingdom) (SIGCOMM ’15). Association for Computing Machinery, New York, NY, USA, 407–420. https://doi.org/10.1145/2785956.2787488
- Nicola Jones. 2018. How to Stop Data Centres from Gobbling Up the World’s Electricity. Nature (2018).
- Predicting the Computational Cost of Deep Learning Models. In 2018 IEEE International Conference on Big Data (Big Data).
- Virtual Machine Power Metering and Provisioning. In Proceedings of the 1st ACM Symposium on Cloud Computing (Indianapolis, Indiana, USA) (SoCC ’10). Association for Computing Machinery, New York, NY, USA, 39–50. https://doi.org/10.1145/1807128.1807136
- Kubeflow. 2022. Kubeflow: The Machine Learning Toolkit for Kubernetes. https://www.kubeflow.org/. Accessed: 2022-10-03.
- Kubernetes. 2022. Kubernetes: Production-grade Container Orchestration. https://kubernetes.io/. Accessed: 2022-10-03.
- Sustainable HPC: Modeling, Characterization, and Implications of Carbon Footprint in Modern HPC Systems. arXiv:2306.13177 [cs.DC]
- CarbonCast: Multi-Day Forecasting of Grid Carbon Intensity. In Proceedings of the 9th ACM International Conference on Systems for Energy-Efficient Buildings, Cities, and Transportation (Boston, Massachusetts) (BuildSys ’22). Association for Computing Machinery, New York, NY, USA, 198–207. https://doi.org/10.1145/3563357.3564079
- DACF: Day-Ahead Carbon Intensity Forecasting of Power Grids Using Machine Learning. In Proceedings of the Thirteenth ACM International Conference on Future Energy Systems (e-Energy’22).
- Electricity Maps. 2022. Electricity Map. https://www.electricitymap.org/map.
- Recalibrating Global Data Center Energy-use Estimates. Science (2020).
- Summary for Policymakers. In: Climate Change 2021: The Physical Science Basis. Contribution of Working Group I to the Sixth Assessment Report of the Intergovernmental Panel on Climate Change. Technical Report. United Nation Intergovernmental Panel on Climate Change (IPCC).
- META. 2022. How We’re Helping Fight Climate Change. https://about.fb.com/news/2021/06/2020-sustainability-report-how-were-helping-fight-climate-change/.
- Microsoft. 2022a. AWS Customer Carbon Footprint Tool. https://aws.amazon.com/blogs/aws/new-customer-carbon-footprint-tool/.
- Microsoft. 2022b. Microsoft Carbon accouting tool. https://www.microsoft.com/en-us/sustainability/emissions-impact-dashboard.
- Microsoft. 2022c. Microsoft is Changing the Way It Buys Renewable Energy. https://www.theverge.com/2021/7/14/22574431/microsoft-renewable-energy-purchases.
- Carbon-aware Distributed Cloud: Multi-level Grouping Genetic Algorithm. Cluster Computing (2014).
- NVIDIA. 2022. Manage and Monitor GPUs in Cluster Environments. https://developer.nvidia.com/dcgm. Accessed: 2022-10-08.
- Predicting Statistics of Asynchronous SGD Parameters for a Large-scale Distributed Deep Learning System on GPU Supercomputers. In 2016 IEEE International Conference on Big Data (Big Data).
- PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems (NIPS’19).
- Iteration Time Prediction for CNN in Multi-GPU Platform: Modeling and Analysis. IEEE Access (2019).
- Optimus: An Efficient Dynamic Resource Scheduler for Deep Learning Clusters. In Proceedings of the Thirteenth EuroSys Conference (Porto, Portugal) (EuroSys ’18). Association for Computing Machinery, New York, NY, USA, Article 3, 14 pages. https://doi.org/10.1145/3190508.3190517
- Paleo: A Performance Model for Deep Neural Networks. In The International Conference on Learning Representations (ICLR’17).
- Carbon-Aware Computing for Datacenters. IEEE Transactions on Power Systems (2022), 1–1. https://doi.org/10.1109/TPWRS.2022.3173250
- Scalable system scheduling for HPC and big data. J. Parallel and Distrib. Comput. 111 (2018), 76–92. https://doi.org/10.1016/j.jpdc.2017.06.009
- Alexander Sergeev and Mike Del Balso. 2018. Horovod: Fast and Easy Distributed Deep Learning in TensorFlow. arXiv preprint arXiv:1802.05799 (2018).
- United States Data Center Energy Usage Report. (6 2016). https://doi.org/10.2172/1372902
- Performance Modeling and Evaluation of Distributed Deep Learning Frameworks on GPUs. In 2018 IEEE 16th Intl Conf on Dependable, Autonomic and Secure Computing, 16th Intl Conf on Pervasive Intelligence and Computing, 4th Intl Conf on Big Data Intelligence and Computing and Cyber Science and Technology Congress(DASC/PiCom/DataCom/CyberSciTech). 949–957. https://doi.org/10.1109/DASC/PiCom/DataCom/CyberSciTec.2018.000-4
- Kubernetes SIGs. 2022. Kubernetes Metrics Server. Kubernetes SIGs. https://github.com/kubernetes-sigs/metrics-server
- Ecovisor: A Virtual Energy System for Carbon-Efficient Applications. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2 (Vancouver, BC, Canada) (ASPLOS 2023). Association for Computing Machinery, New York, NY, USA, 252–265. https://doi.org/10.1145/3575693.3575709
- Garrick Staples. 2006. TORQUE resource manager. In Proceedings of the 2006 ACM/IEEE conference on Supercomputing. ACM, New York, NY, USA, 8.
- Emma Stewart. 2023. Net Zero + Nature: Our Commitment to the Environment. https://about.netflix.com/en/news/net-zero-nature-our-climate-commitment.
- Quantifying the Benefits of Carbon-Aware Temporal and Spatial Workload Shifting in the Cloud. arXiv:2306.06502 [cs.DC]
- Mingxing Tan and Quoc Le. 2019. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In Proceedings of the 36th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 97). PMLR, 6105–6114. https://proceedings.mlr.press/v97/tan19a.html
- Borg: The next Generation. In Proceedings of the Fifteenth European Conference on Computer Systems (Heraklion, Greece) (EuroSys ’20). Association for Computing Machinery, New York, NY, USA, Article 30, 14 pages. https://doi.org/10.1145/3342195.3387517
- WattTime. 2022. WattTime. https://www.watttime.org/.
- MLaaS in the wild: Workload analysis and scheduling in Large-Scale heterogeneous GPU clusters. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22). USENIX Association, 945–960.
- Let’s Wait Awhile: How Temporal Workload Shifting Can Reduce Carbon Emissions in the Cloud. In Proceedings of the 22nd International Middleware Conference (Québec city, Canada) (Middleware ’21). Association for Computing Machinery, New York, NY, USA, 260–272. https://doi.org/10.1145/3464298.3493399
- Slurm: Simple Linux Utility for Resource Management. In Workshop on Job Scheduling Strategies for Parallel Processing. Springer, New York, NY, USA, 44–60.
- Chaojie Zhang and Andrew A. Chien. 2021. Scheduling Challenges for Variable Capacity Resources. In Job Scheduling Strategies for Parallel Processing, Dalibor Klusáček, Walfredo Cirne, and Gonzalo P. Rodrigo (Eds.). Springer International Publishing, Cham, 190–209.
- Mitigating Curtailment and Carbon Emissions through Load Migration between Data Centers. Joule 4, 10 (2020), 2208–2222. https://doi.org/10.1016/j.joule.2020.08.001
- Carbon-Aware Load Balancing for Geo-distributed Cloud Services. In International Symposium on Modelling, Analysis and Simulation of Computer and Telecommunication Systems. IEEE, New York, NY, USA, 232–241. https://doi.org/10.1109/MASCOTS.2013.31
- Walid A. Hanafy (9 papers)
- Qianlin Liang (5 papers)
- Noman Bashir (32 papers)
- David Irwin (32 papers)
- Prashant Shenoy (57 papers)