Efficiency and cost impacts of cluster “slicing” for training
Characterize the efficiency and cost impacts of training AI models using large numbers of less powerful chips (cluster slicing) versus fewer more powerful chips with the same theoretical throughput, including implications for decentralized or disaggregated training configurations.
References
Another open problem is the efficiency and cost impact of using a larger number of less powerful chips within a cluster, as opposed to using a smaller number of more powerful chips totaling the same theoretical throughput, sometimes known as slicing.
— Open Problems in Technical AI Governance
(2407.14981 - Reuel et al., 20 Jul 2024) in Section 3.2.1 Definition of Chip and Cluster Specifications for Model Training