Papers
Topics
Authors
Recent
Search
2000 character limit reached

Perseus: Reducing Energy Bloat in Large Model Training

Published 12 Dec 2023 in cs.LG and cs.DC | (2312.06902v2)

Abstract: Training large AI models on numerous GPUs consumes a massive amount of energy, making power delivery one of the largest limiting factors in building and operating datacenters for AI workloads. However, we observe that not all energy consumed during training directly contributes to end-to-end throughput, and a significant portion can be removed without slowing down training, which we call energy bloat. In this work, we identify two independent sources of energy bloat in large model training and propose Perseus, a training system that mitigates both. To do this, Perseus obtains the "iteration time-energy" Pareto frontier of any large model training job using an efficient graph cut-based algorithm and schedules the energy consumption of computations across time to remove both types of energy bloat. Evaluation on large models including GPT-3 and Bloom shows that Perseus reduces the energy consumption of large model training by up to 30% without any throughput loss or hardware modification, enabling energy reduction -- and therefore cost savings -- otherwise unattainable before.

Citations (3)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.