tf.data service: A Case for Disaggregating ML Input Data Processing
Abstract: Machine learning (ML) computations commonly execute on expensive specialized hardware, such as GPUs and TPUs, which provide high FLOPs and performance-per-watt. For cost efficiency, it is essential to keep these accelerators highly utilized. This requires preprocessing input data at the rate at which the accelerators can ingest and perform ML computations on the data. To avoid data stalls, the host CPU and RAM required for input data processing per accelerator core used for ML computations varies across jobs. Hence, the traditional approach of processing input data on ML accelerator hosts with a fixed hardware ratio leads to either under-utilizing the accelerators or the host CPU and RAM. In this paper, we address these concerns by building a disaggregated ML data processing system. We present tf.data service, an open-source disaggregated input data processing service built on top of tf.data in TensorFlow. We show that disaggregating data preprocessing has three key advantages for large-scale ML training jobs. First, the service can horizontally scale-out to right-size CPU/RAM host resources for data processing in each job, saving 32x training time and 26x cost, on average. Second, the service can share ephemeral preprocessed data results across jobs, to optimize CPU usage and reduce redundant computations. Finally, the service supports coordinated reads, a technique that avoids stragglers due to different input sizes in distributed training, reducing training time by 2.2x, on average. Our design is inspired by lessons learned from deploying tf.data service in production, including relaxing data visitation guarantees without impacting model accuracy.
- 2022. Apache Beam: An advanced unified programming model. https://beam.apache.org/.
- 2022. Apache Flume. https://flume.apache.org/.
- TensorFlow: A system for large-scale machine learning. In Proc. of OSDI. https://www.usenix.org/system/files/conference/osdi16/osdi16-abadi.pdf
- Amazon. 2022a. Amazon EC2 Pricing. https://aws.amazon.com/ec2/pricing/.
- Amazon. 2022b. Amazon EC2 Pricing. https://aws.amazon.com/ec2/instance-types/.
- Distributed computing pipeline processing. https://patents.google.com/patent/WO2021177976A1.
- Leon Bottou. 2009. Curiously Fast Convergence of some Stochastic Gradient Descent Algorithms. In Proc. of the Symposium on Learning and Data Science.
- Léon Bottou. 2010. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’2010: 19th International Conference on Computational StatisticsParis France, August 22-27, 2010 Keynote, Invited and Contributed Papers. Springer, 177–186.
- JAX: composable transformations of Python+NumPy programs. http://github.com/google/jax
- Broadcom. 2019. Broadcom Stingray PS250 SmartNIC. https://docs.broadcom.com/doc/PS250-PB
- Scalable Neural Data Server: A Data Recommender for Transfer Learning. In Advances in Neural Information Processing Systems, M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (Eds.).
- Logic-Based Approach to Semantic Query Optimization. ACM Trans. Database Syst. 15, 2 (jun 1990), 162–207. https://doi.org/10.1145/78922.78924
- Faster Neural Network Training with Data Echoing. arXiv:1907.05550Â [cs.LG]
- Torch Contributors. 2022. PyTorch Docs: torch.utils.data. https://pytorch.org/docs/stable/data.html.
- AutoAugment: Learning Augmentation Strategies From Data. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR.
- RandAugment: Practical Automated Data Augmentation with a Reduced Search Space. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.). 18613–18624.
- The Snowflake Elastic Data Warehouse. In Proceedings of the 2016 International Conference on Management of Data (SIGMOD ’16).
- ImageNet: A Large-Scale Hierarchical Image Database. In Proc. of CVPR.
- How Much Data Are Augmentations Worth? An Investigation into Scaling Laws, Invariance, and Implicit Regularization. arXiv:2210.06441Â [cs.LG]
- Shared workload optimization. Proceedings of the VLDB Endowment 7, 6, 429–440.
- Deep Learning. MIT Press. http://www.deeplearningbook.org.
- Google. 2022. Better performance with the tf.data API. https://www.tensorflow.org/guide/data_performance
- Google. 2022a. Google Cloud: All Pricing. https://cloud.google.com/compute/all-pricing.
- Google. 2022b. Google Cloud: TPU regions and zones. https://cloud.google.com/tpu/docs/regions-zones.
- Google. 2022. tf.data service API documentation. https://www.tensorflow.org/api_docs/python/tf/data/experimental/service
- Google. 2023. Colossus under the hood: a peek into Google’s scalable storage system.
- Google. 2023. Google Storage. https://cloud.google.com/storage.
- Google. 2023a. gRPC Documentation.
- Google. 2023b. Network Pricing. https://cloud.google.com/vpc/network-pricing.
- Cachew: Machine Learning Input Data Processing as a Service. In Proc. of USENIX ATC.
- Fast AI Data Preprocessing with NVIDIA DALI. https://devblogs.nvidia.com/fast-ai-data-preprocessing-with-nvidia-dali.
- Qpipe: A simultaneously pipelined relational query engine. In Proceedings of the 2005 ACM SIGMOD international conference on Management of data. 383–394.
- Deep Residual Learning for Image Recognition. In Proc. of CVPR. IEEE Computer Society. https://doi.org/10.1109/CVPR.2016.90
- A catalog of stream processing optimizations. ACM Computing Surveys (CSUR) 46, 4 (2014), 1–34.
- Kubernetes HPA. 2023. Kubernetes Horizontal Pod Autoscaler Documentation. https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/
- Chip Huyen. 2022. Designing Machine Learning Systems. O’Reilly Media, USA.
- A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20).
- Tpu v4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings. arXiv preprint arXiv:2304.01433 (2023).
- A Domain-Specific Supercomputer for Training Deep Neural Networks. Commun. ACM 63, 7 (2020).
- In-Datacenter Performance Analysis of a Tensor Processing Unit. In Proc. of ISCA (Toronto, ON, Canada) (ISCA ’17). Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/3079856.3080246
- The Case for Unifying Data Loading in Machine Learning Clusters. In 11th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 19).
- Flash Storage Disaggregation. In Proc. EuroSys (EuroSys ’16). Article 29.
- Kubernetes. 2023. kubernetes Documentation. https://kubernetes.io/docs/home/
- Plumber: Diagnosing and Removing Performance Bottlenecks in Machine Learning Data Pipelines. In Proc. of Machine Learning and Systems, Vol. 4. 33–51.
- Abhishek Vijaya Kumar and Muthian Sivathanu. 2020. Quiver: An Informed Storage Cache for Deep Learning. In Proc. of FAST.
- Refurbish Your Training Data: Reusing Partially Augmented Samples for Faster Deep Neural Network Training. In Proc. of USENIX ATC.
- Scaling Distributed Machine Learning with the Parameter Server. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation (OSDI’14).
- Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision. 2980–2988.
- Microsoft COCO: Common Objects in Context. In Proc. of ECCV (2014-01-01). Zürich. /se3/wp-content/uploads/2014/09/coco_eccv.pdf,http://mscoco.org Oral.
- Pay one, get hundreds for free: Reducing cloud costs through shared query execution. In Proceedings of the ACM Symposium on Cloud Computing. 439–450.
- Meta. 2022. Scaling data ingestion for machine learning training at Meta. https://engineering.fb.com/2022/09/19/ml-applications/data-ingestion-machine-learning-training-meta/.
- Gimbal: Enabling Multi-Tenant Storage Disaggregation on SmartNIC JBOFs. In Proc. of ACM SIGCOMM (SIGCOMM ’21). 106–122.
- MLCommons. 2022. ML Perf v2 Google Hardware Configurations. https://github.com/mlcommons/training_results_v2.0/tree/main/Google/systems
- {{\{{CheckFreq}}\}}: Frequent,{{\{{Fine-Grained}}\}}{{\{{DNN}}\}} Checkpointing. In 19th USENIX Conference on File and Storage Technologies (FAST 21). 203–216.
- Analyzing and Mitigating Data Stalls in DNN Training. arXiv:2007.06775Â [cs.DC]
- tf.data: A Machine Learning Data Processing Framework. Proc. VLDB Endow. 14, 12 (2021).
- MXNET. 2018. Designing Efficient Data Loaders for Deep Learning. https://mxnet.apache.org/api/architecture/note_data_loading.
- PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32. Curran Associates, Inc., 8024–8035. http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf
- Autopilot: workload autoscaling at google. In Proc. of the Fifteenth European Conference on Computer Systems.
- LegoOS: A Disseminated, Distributed OS for Hardware Resource Disaggregation. In Proc. of OSDI.
- Sreekumar T. Shenoy and Z. Meral Ozsoyoglu. 1987. A System for Semantic Query Optimization. In Proceedings of the 1987 ACM SIGMOD International Conference on Management of Data (San Francisco, California, USA) (SIGMOD ’87). Association for Computing Machinery, New York, NY, USA, 181–195. https://doi.org/10.1145/38713.38736
- Connor Shorten and Taghi M Khoshgoftaar. 2019. A survey on image data augmentation for deep learning. Journal of big data 6, 1 (2019), 1–48.
- Text data augmentation for deep learning. Journal of big Data 8 (2021), 1–34.
- Best Practices for Convolutional Neural Networks Applied to Visual Document Analysis. In Proc. of ICDAR (ICDAR ’03). IEEE Computer Society, USA, 1 pages.
- Apache Spark. 2023. Spark Streaming Programming Guide. https://spark.apache.org/docs/latest/streaming-programming-guide.html.
- TensorFlow. 2022a. Module: tf.data.experimental.service. https://www.tensorflow.org/api_docs/python/tf/data/experimental/service.
- TensorFlow. 2022b. tf.data: Build TensorFlow input pipelines. https://www.tensorflow.org/guide/data.
- TensorFlow. 2023a. Tensorflow. https://github.com/tensorflow/tensorflow.
- TensorFlow. 2023b. TensorFlow Model Garden. https://github.com/tensorflow/models.
- Borg: the next generation. In Proceedings of the fifteenth European conference on computer systems. 1–14.
- FastFlow: Accelerating Deep Learning Model Training with Smart Offloading of Input Data Pipeline. Proceedings of the VLDB Endowment 16, 5 (2023), 1086–1099.
- A Survey on Distributed Machine Learning. ACM Comput. Surv. 53, 2, Article 30 (mar 2020).
- Large-scale cluster management at Google with Borg. In Proc. of EuroSys.
- Kubernetes VPA. 2023. Kubernetes Vertical Pod Autoscaler Documentation. https://cloud.google.com/kubernetes-engine/docs/concepts/verticalpodautoscaler
- Neural Data Server: A Large-Scale Search Engine for Transfer Learning Data. In Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
- Spark: Cluster Computing with Working Sets. In Proc. of HotCloud (Boston, MA) (HotCloud’10). USENIX Association, USA, 10.
- Understanding Data Storage and Ingestion for Large-Scale Deep Recommendation Model Training: Industrial Product. In Proc. of ISCA (ISCA ’22).
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.