Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

cedar: Optimized and Unified Machine Learning Input Data Pipelines (2401.08895v4)

Published 17 Jan 2024 in cs.LG, cs.DC, and cs.PF

Abstract: The input data pipeline is an essential component of each ML training job. It is responsible for reading massive amounts of training data, processing batches of samples using complex transformations, and loading them onto training nodes at low latency and high throughput. Performant input data systems are becoming increasingly critical, driven by skyrocketing data volumes and training throughput demands. Unfortunately, current input data systems cannot fully leverage key performance optimizations, resulting in hugely inefficient infrastructures that require significant resources - or worse - underutilize expensive accelerators. To address these demands, we present cedar, an optimized and unified programming framework for ML input data pipelines. cedar allows users to define input data pipelines using composable operators that support arbitrary ML frameworks and libraries. cedar introduces an extensible optimizer that systematically applies a complex combination of optimizations (e.g., offloading, caching, prefetching, fusion, and reordering). It orchestrates processing across a customizable set of local and distributed compute resources in order to improve processing performance and efficiency, all without user input. Across eight pipelines, cedar improves performance by up to 1.87x to 10.65x compared to state-of-the-art input data systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (83)
  1. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org.
  2. The dataflow model: A practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing. Proceedings of the VLDB Endowment, 8:1792–1803, 2015.
  3. The stratosphere platform for big data analytics. The VLDB Journal, 23(6):939–964, dec 2014.
  4. Common voice: A massively-multilingual speech corpus, 2020.
  5. Delta lake: High-performance acid table storage over cloud object stores. Proc. VLDB Endow., 13(12):3411–3424, aug 2020.
  6. Spark sql: Relational data processing in spark. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, SIGMOD ’15, page 1383–1394, New York, NY, USA, 2015. Association for Computing Machinery.
  7. Tf.data service: A case for disaggregating ml input data processing. In Proceedings of the 2023 ACM Symposium on Cloud Computing, SoCC ’23, page 358–375, New York, NY, USA, 2023. Association for Computing Machinery.
  8. AWS. Aws trainium. https://aws.amazon.com/machine-learning/trainium/, 2022.
  9. Nephele/pacts: A programming model and execution framework for web-scale analytical processing. In Proceedings of the 1st ACM Symposium on Cloud Computing, SoCC ’10, page 119–130, New York, NY, USA, 2010. Association for Computing Machinery.
  10. Hyracks: A flexible and extensible foundation for data-intensive computing. In Proceedings of the 2011 IEEE 27th International Conference on Data Engineering, ICDE ’11, page 1151–1162, USA, 2011. IEEE Computer Society.
  11. Language models are few-shot learners, 2020.
  12. Apache flinktm: Stream and batch processing in a single engine. Bull. IEEE Comput. Soc. Tech. Comm. Data Eng, 36(4), 2015.
  13. A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709, 2020.
  14. Big self-supervised models are strong semi-supervised learners. arXiv preprint arXiv:2006.10029, 2020.
  15. icache: An importance-sampling-informed cache for accelerating i/o-bound dnn model training. In 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pages 220–232, 2023.
  16. Randaugment: Practical automated data augmentation with a reduced search space. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 18613–18624. Curran Associates, Inc., 2020.
  17. The snowflake elastic data warehouse. In Proceedings of the 2016 International Conference on Management of Data, SIGMOD ’16, page 215–226, New York, NY, USA, 2016. Association for Computing Machinery.
  18. Mapreduce: Simplified data processing on large clusters. Commun. ACM, 51(1):107–113, jan 2008.
  19. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009.
  20. Hugging Face. Transformers. https://huggingface.co/docs/transformers/index, 2023.
  21. Apache Software Foundation. Apache arrow. https://arrow.apache.org, 2023.
  22. Apache Software Foundation. Apache avro. https://avro.apache.org, 2023.
  23. Apache Software Foundation. Apache beam. https://beam.apache.org, 2023.
  24. Apache Software Foundation. Apache orc. https://orc.apache.org, 2023.
  25. Apache Software Foundation. Apache parquet. https://parquet.apache.org/docs/, 2023.
  26. G. Graefe and W.J. McKenna. The volcano optimizer generator: extensibility and efficient search. In Proceedings of IEEE 9th International Conference on Data Engineering, pages 209–218, 1993.
  27. Goetz Graefe. The cascades framework for query optimization. IEEE Data Eng. Bull., 18(3):19–29, 1995.
  28. Cachew: Machine learning input data processing as a service. In 2022 USENIX Annual Technical Conference (USENIX ATC 22), pages 689–706, Carlsbad, CA, July 2022. USENIX Association.
  29. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. International Conference on Learning Representations (ICLR), 2016.
  30. Learning both weights and connections for efficient neural networks. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1, NIPS’15, page 1135–1143, Cambridge, MA, USA, 2015. MIT Press.
  31. Opening the black boxes in data flow optimization. Proc. VLDB Endow., 5(11):1256–1267, jul 2012.
  32. HuggingFace. Tokenizer summary. https://huggingface.co/docs/transformers/tokenizer_summary, 2023.
  33. Dryad: Distributed data-parallel programs from sequential building blocks. In Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007, EuroSys ’07, page 59–72, New York, NY, USA, 2007. Association for Computing Machinery.
  34. Where is my training bottleneck? hidden trade-offs in deep learning preprocessing pipelines. In Proceedings of the 2022 International Conference on Management of Data, SIGMOD ’22, page 1825–1839, New York, NY, USA, 2022. Association for Computing Machinery.
  35. Tpu v4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings. In Proceedings of the 50th Annual International Symposium on Computer Architecture, ISCA ’23, New York, NY, USA, 2023. Association for Computing Machinery.
  36. The case for unifying data loading in machine learning clusters. In 11th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 19), Renton, WA, July 2019. USENIX Association.
  37. SHADE: Enable fundamental cacheability for distributed deep learning training. In 21st USENIX Conference on File and Storage Technologies (FAST 23), pages 135–152, Santa Clara, CA, February 2023. USENIX Association.
  38. Data dependencies for query optimization: A survey. The VLDB Journal, 31(1):1–22, jun 2021.
  39. Plumber: Diagnosing and removing performance bottlenecks in machine learning data pipelines. Proceedings of Machine Learning and Systems, 4:33–51, 2022.
  40. Quiver: An informed storage cache for deep learning. In 18th USENIX Conference on File and Storage Technologies (FAST 20), pages 283–296, Santa Clara, CA, February 2020. USENIX Association.
  41. An intermediate representation for optimizing machine learning pipelines. Proc. VLDB Endow., 12(11):1553–1567, jul 2019.
  42. Frederic Lardinois. Google launches a 9 exaflop cluster of cloud TPU v4 pods into public preview. https://techcrunch.com/2022/05/11/google-launches-a-9-exaflop-cluster-of-cloud-tpu-v4-pods-into-public-preview/, 2022.
  43. Refurbish your training data: Reusing partially augmented samples for faster deep neural network training. In 2021 USENIX Annual Technical Conference (USENIX ATC 21), pages 537–550. USENIX Association, July 2021.
  44. librosa: Audio and music signal analysis in python. In Proceedings of the 14th Python in Science Conference, volume 8, pages 18–25, 2015.
  45. Linq: Reconciling object, relations and xml in the .net framework. In Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data, SIGMOD ’06, page 706, New York, NY, USA, 2006. Association for Computing Machinery.
  46. Pointer sentinel mixture models, 2016.
  47. Meta. Introducing the ai research supercluster. https://ai.facebook.com/blog/ai-rsc/, 2022.
  48. Analyzing and mitigating data stalls in dnn training. Proc. VLDB Endow., 14(5):771–784, jan 2021.
  49. Ray: A distributed framework for emerging ai applications. In Proceedings of the 13th USENIX Conference on Operating Systems Design and Implementation, OSDI’18, page 561–577, USA, 2018. USENIX Association.
  50. Software-hardware co-design for fast and scalable training of deep learning recommendation models. In Proceedings of the 49th Annual International Symposium on Computer Architecture, ISCA ’22, page 993–1011, New York, NY, USA, 2022. Association for Computing Machinery.
  51. Naiad: A timely dataflow system. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, SOSP ’13, page 439–455, New York, NY, USA, 2013. Association for Computing Machinery.
  52. Tf.data: A machine learning data processing framework. Proc. VLDB Endow., 14(12):2945–2958, jul 2021.
  53. NumPy. numpy.lib.format. https://numpy.org/devdocs/reference/generated/numpy.lib.format.html#, 2023.
  54. NVIDIA. Nvidia dali. https://docs.nvidia.com/deeplearning/dali/user-guide/docs/index.html, 2023.
  55. NVIDIA. Nvidia dgx h100. https://www.nvidia.com/en-us/data-center/dgx-h100/, 2023.
  56. NVIDIA. Nvtabular. https://github.com/NVIDIA-Merlin/NVTabular, 2023.
  57. Facebook’s tectonic filesystem: Efficiency from exascale. In 19th USENIX Conference on File and Storage Technologies (FAST 21), pages 217–231. USENIX Association, February 2021.
  58. SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition. In Proc. Interspeech 2019, pages 2613–2617, 2019.
  59. PyTorch: An Imperative Style, High-Performance Deep Learning Library. Curran Associates Inc., Red Hook, NY, USA, 2019.
  60. Velox: Meta’s unified execution engine. Proc. VLDB Endow., 15(12):3372–3384, aug 2022.
  61. PyTorch. Dataloader2. https://github.com/pytorch/data/blob/a5b4720dece60565788ac4c9a85e01719188b28e/torchdata/dataloader2/README.md, 2023.
  62. PyTorch. Future of torchdata and dataloading. https://github.com/pytorch/data/issues/1196, 2023.
  63. PyTorch. Torchdata. https://pytorch.org/data/beta/index.html, 2023.
  64. PyTorch. torch.utils.data. https://pytorch.org/docs/stable/data.html, 2023.
  65. PyTorch. Torchvision. https://pytorch.org/vision/stable/index.html, 2023.
  66. Language models are unsupervised multitask learners. 2019.
  67. Ray. Ray data: Scalable datasets for ml. https://docs.ray.io/en/latest/data/data.html, 2023.
  68. Dave Salvator. Nvidia hopper, ampere gpus sweep benchmarks in ai training. https://blogs.nvidia.com/blog/2022/11/09/mlperf-ai-training-hpc-hopper/, 2023.
  69. Recshard: Statistical feature-based memory optimization for industry-scale neural recommendation. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’22, page 344–358, New York, NY, USA, 2022. Association for Computing Machinery.
  70. TensorFlow. Tensorflow text. https://www.tensorflow.org/text, 2023.
  71. TensorFlow. tf.data.experimental.service. https://www.tensorflow.org/api_docs/python/tf/data/experimental/service#limitations, 2023.
  72. TensorFlow. tf.image. https://www.tensorflow.org/api_docs/python/tf/image, 2023.
  73. TensorFlow. Tfrecord and tf.train.example. https://www.tensorflow.org/tutorials/load_data/tfrecord#tfrecord_files_using_tfdata, 2023.
  74. Uber. Petastorm. https://petastorm.readthedocs.io/en/latest/index.html, 2023.
  75. Fastflow: Accelerating deep learning model training with smart offloading of input data pipeline. Proc. VLDB Endow., 16(5):1086–1099, jan 2023.
  76. Shared foundations: Modernizing meta’s data lakehouse. In The Conference on Innovative Data Systems Research, CIDR, 2023.
  77. Dryadlinq: A system for general-purpose distributed data-parallel computing using a high-level language. In Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation, OSDI’08, page 1–14, USA, 2008. USENIX Association.
  78. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12), pages 15–28, San Jose, CA, April 2012. USENIX Association.
  79. Discretized streams: An efficient and Fault-Tolerant model for stream processing on large clusters. In 4th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 12), Boston, MA, June 2012. USENIX Association.
  80. Silod: A co-design of caching and scheduling for deep learning clusters. In Proceedings of the Eighteenth European Conference on Computer Systems, EuroSys ’23, page 883–898, New York, NY, USA, 2023. Association for Computing Machinery.
  81. Goldminer: Elastic scaling of training data pre-processing pipelines for deep learning. Proc. ACM Manag. Data, 1(2), jun 2023.
  82. Understanding data storage and ingestion for large-scale deep recommendation model training: Industrial product. In Proceedings of the 49th Annual International Symposium on Computer Architecture, ISCA ’22, page 1042–1057, New York, NY, USA, 2022. Association for Computing Machinery.
  83. Tectonic-Shift: A composite storage fabric for Large-Scale ML training. In 2023 USENIX Annual Technical Conference (USENIX ATC 23), pages 433–449, Boston, MA, July 2023. USENIX Association.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Mark Zhao (10 papers)
  2. Emanuel Adamiak (1 paper)
  3. Christos Kozyrakis (31 papers)
Citations (2)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com