cedar: Optimized and Unified Machine Learning Input Data Pipelines (2401.08895v4)

Published 17 Jan 2024 in cs.LG, cs.DC, and cs.PF

Abstract: The input data pipeline is an essential component of each ML training job. It is responsible for reading massive amounts of training data, processing batches of samples using complex transformations, and loading them onto training nodes at low latency and high throughput. Performant input data systems are becoming increasingly critical, driven by skyrocketing data volumes and training throughput demands. Unfortunately, current input data systems cannot fully leverage key performance optimizations, resulting in hugely inefficient infrastructures that require significant resources - or worse - underutilize expensive accelerators. To address these demands, we present cedar, an optimized and unified programming framework for ML input data pipelines. cedar allows users to define input data pipelines using composable operators that support arbitrary ML frameworks and libraries. cedar introduces an extensible optimizer that systematically applies a complex combination of optimizations (e.g., offloading, caching, prefetching, fusion, and reordering). It orchestrates processing across a customizable set of local and distributed compute resources in order to improve processing performance and efficiency, all without user input. Across eight pipelines, cedar improves performance by up to 1.87x to 10.65x compared to state-of-the-art input data systems.

References (83)

Authors (3)

Mark Zhao (10 papers)
Emanuel Adamiak (1 paper)
Christos Kozyrakis (31 papers)

Citations (2)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Tweets

https://twitter.com/fly51fly/status/1747882041541263503

cedar: Optimized and Unified Machine Learning Input Data Pipelines (2401.08895v4)

Summary

Related Papers

Tweets