Papers
Topics
Authors
Recent
2000 character limit reached

Modyn: Data-Centric Machine Learning Pipeline Orchestration (2312.06254v3)

Published 11 Dec 2023 in cs.LG, cs.AI, cs.DB, cs.DC, and stat.ML

Abstract: In real-world ML pipelines, datasets are continuously growing. Models must incorporate this new training data to improve generalization and adapt to potential distribution shifts. The cost of model retraining is proportional to how frequently the model is retrained and how much data it is trained on, which makes the naive approach of retraining from scratch each time impractical. We present Modyn, a data-centric end-to-end machine learning platform. Modyn's ML pipeline abstraction enables users to declaratively describe policies for continuously training a model on a growing dataset. Modyn pipelines allow users to apply data selection policies (to reduce the number of data points) and triggering policies (to reduce the number of trainings). Modyn executes and orchestrates these continuous ML training pipelines. The system is open-source and comes with an ecosystem of benchmark datasets, models, and tooling. We formally discuss how to measure the performance of ML pipelines by introducing the concept of composite models, enabling fair comparison of pipelines with different data selection and triggering policies. We empirically analyze how various data selection and triggering policies impact model accuracy, and also show that Modyn enables high throughput training with sample-level data selection.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (80)
  1. Accelerating recommendation system training by leveraging popular choices. Proceedings of the VLDB Endowment, 15(1), 2021.
  2. Online continual learning with maximal interfered retrieval. In Proceedings of the Conference on Neural Information Processing Systems (NeurIPS), 2019.
  3. Gradient based sample selection for online continual learning. In Proceedings of the Conference on Neural Information Processing Systems (NeurIPS), 2019.
  4. Amazon. Amazon sagemaker. https://docs.aws.amazon.com/sagemaker/index.html, 2023.
  5. Rainbow memory: Continual learning with a memory of diverse samples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
  6. Michael Bayer. Sqlalchemy. In The Architecture of Open Source Applications Volume II: Structure, Scale, and a Few More Fearless Hacks. aosabook.org, 2012.
  7. Continuous training for production ML in the tensorflow extended (TFX) platform. In Proceedings of the USENIX Conference on Operational Machine Learning (OpML), 2019.
  8. BentoML. Bentoml: Github organization. https://github.com/bentoml/, 2023. Accessed: 2023-11-28.
  9. Ekya: Continuous learning of video analytics models on edge compute servers. In Proceedings of the USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2022.
  10. Lukas Biewald. Experiment tracking with weights and biases. https://www.wandb.com/, 2020.
  11. Machine unlearning. In Proceedings of the IEEE Symposium on Security and Privacy (S&P), 2021.
  12. Online continual learning with natural distribution shifts: An empirical study with visual data. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021.
  13. Apache flink™: Stream and batch processing in a single engine. Bulletin of the Technical Committee on Data Engineering, 38(4), 2015.
  14. Incremental and decremental support vector machine learning. In Proceedings of the Conference on Neural Information Processing Systems (NeurIPS), 2000.
  15. Developments in MLflow: A system to accelerate the machine learning lifecycle. In Proceedings of the International Workshop on Data Management for End-to-End Machine Learning (DEEM), 2020.
  16. Towards gpu memory efficiency for distributed training at scale. In Proceedings of the Symposium on Cloud Computing (SoCC), 2023.
  17. Criteo. Download terabyte click logs. https://labs.criteo.com/2013/12/download-terabyte-click-logs/, 2013.
  18. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2009.
  19. Continual learning in practice. In Proceedings of the Workshop on Continual Learning at NeurIPS, 2018.
  20. Alex Egg. Online learning for recommendations at grubhub. In Proceedings of the Conference on Recommender Systems (RecSys), 2021.
  21. European Union. Art. 17 gdpr: Right to erasure (‘right to be forgotten’). https://gdpr.eu/article-17-right-to-be-forgotten/, 2016.
  22. Presentation: Inside NVIDIA’s AI infrastructure for self-driving cars. In Presentations of the USENIX Conference on Operational Machine Learning (OpML), 2020.
  23. A survey on concept drift adaptation. ACM Computing Surveys, 46(4):1–37, 2014.
  24. Deepcore: A comprehensive library for coreset selection in deep learning. In Proceedings of the Conference on Database and Expert Systems Applications (DEXA), 2022.
  25. Rapid adaptation in online continual learning: Are we evaluating it right?, 2023.
  26. Applied machine learning at facebook: A datacenter infrastructure perspective. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA), 2018.
  27. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  28. Practical lessons from predicting clicks on ads at facebook. In Proceedings of the International Workshop on Data Mining for Online Advertising (ADKDD), 2014.
  29. Chip Huyen. Designing Machine Learning Systems. O’Reilly Media, Inc., 2022.
  30. Chip Huyen. Real-time machine learning: challenges and solutions. https://huyenchip.com/2022/01/02/real-time-machine-learning-challenges-and-solutions.html, 2022.
  31. Scalability vs. utility: Do we have to sacrifice one for the other in data importance quantification? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
  32. A framework for monitoring and retraining language models in real-world applications, 2023.
  33. Not all samples are created equal: Deep learning with importance sampling. In Proceedings of the International Conference on Machine Learning (ICML), 2018.
  34. Andreas Kirsch. Does ‘deep learning on a data diet’ reproduce? overall yes, but grand at initialization does not. Transactions on Machine Learning Research, 2023.
  35. Online continual learning on class incremental blurry task configuration with anytime inference. In Proceedings of the International Conference on Learning Representations (ICLR), 2022.
  36. Learning multiple layers of features from tiny images. Technical report, University of Toronto, Toronto, Ontario, 2009.
  37. Mind the gap: Assessing temporal generalization in neural language models. In Proceedings of the Conference on Neural Information Processing Systems (NeurIPS), 2021.
  38. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 1998.
  39. Detecting and adapting to irregular distribution shifts in bayesian online learning. In Proceedings of the Conference on Neural Information Processing Systems (NeurIPS), 2021.
  40. Gradient episodic memory for continual learning. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2017.
  41. Cost-effective retraining of machine learning models, 2023.
  42. Matchmaker: Data drift mitigation in machine learning for large-scale systems. In Proceedings of Machine Learning and Systems (MLSys), 2022.
  43. Prioritized training on points that are learnable, worth learning, and not yet learnt. In Proceedings of the International Conference on Machine Learning (ICML), 2022.
  44. Coresets for data-efficient training of machine learning models. In Proceedings of the International Conference on Machine Learning (ICML), 2020.
  45. Tfx: A tensorflow-based production-scale machine learning platform. In Proceedings of the International Conference on Knowledge Discovery and Data Mining (KDD), 2017.
  46. Software-hardware co-design for fast and scalable training of deep learning recommendation models. In Proceedings of the Annual International Symposium on Computer Architecture (ISCA), 2022.
  47. tf.data: a machine learning data processing framework. Proceedings of the VLDB Endowment, 14(12), 2021.
  48. Deep learning recommendation model for personalization and recommendation systems. 2019.
  49. Neptune. Neptune.ai ml metadata store. https://neptune.ai/, 2023.
  50. NVIDIA. Nvidia dlrm example implementation. https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/Recommendation/DLRM, 2023. Accessed: 2023-11-28.
  51. NVIDIA. Nvidia triton inference server. https://developer.nvidia.com/nvidia-triton-inference-server, 2023. Accessed: 2023-11-28.
  52. Repeated random sampling for minimizing the time-to-accuracy of learning, 2023.
  53. Challenges in deploying machine learning: A survey of case studies. ACM Computing Surveys, 55(6), 2022.
  54. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems (NeurIPS), 2019.
  55. Deep learning on a data diet: Finding important examples early in training. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2021.
  56. Learn++: an incremental learning algorithm for supervised neural networks. IEEE Transactions on Systems, Man and Cybernetics, Part C, 31(4), 2001.
  57. Adaptive second order coresets for data-efficient machine learning. In Proceedings of the International Conference on Machine Learning (ICML), 2022.
  58. GDumb: A simple approach that questions our progress in continual learning. In Proceedings of the European Conference on Computer Vision (ECCV), 2020.
  59. PyTorch Serve Contributors. Torchserve: Docs. https://pytorch.org/serve/, 2020. Accessed: 2023-11-28.
  60. Failing loudly: An empirical study of methods for detecting dataset shift. In Proceedings of the Conference on Neural Information Processing Systems (NeurIPS), 2019.
  61. Less is more: Selecting informative and diverse subsets with balancing constraints. 2021.
  62. Operationalizing machine learning: An interview study, 2022.
  63. Rethinking streaming machine learning evaluation. In Proceedings of the ML Evaluation Standards Workshop at ICLR, 2022.
  64. Towards observability for production machine learning pipelines. Proceedings of the VLDB Endowment, 15(13), 2022.
  65. Ekko: A large-scale deep learning recommender system with low-latency model update. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2022.
  66. Soci - the c++ database access library. https://github.com/SOCI/soci, 2023. Accessed: 2023-11-28.
  67. State of California, USA. Section 1798.130 ccpa. https://ccpa-info.com/california-consumer-privacy-act-full-text/, 2018.
  68. The pipeline for the continuous development of artificial intelligence models—current state of research and practice. Journal of Systems and Software, 199, 2023.
  69. The design of postgres. ACM SIGMOD Record, 15(2), 1986.
  70. ODIN: Automated drift detection and recovery in video analytics. Proceedings of the VLDB Endowment, 13(12), 2020.
  71. Driftsurf: Stable-state / reactive-state learning under concept drift. In Proceedings of the International Conference on Machine Learning (ICML), 2021.
  72. Tesla. Tesla autonomy day. https://www.youtube.com/watch?v=Ucp0TTmvqOE&t=6678s, 2019.
  73. Josh Tobin. Toward continual learning systems. https://gantry.io/blog/toward-continual-learning-systems/, 2021.
  74. Temporal quality degradation in AI models. Scientific Reports, 12(1), 2022.
  75. A survey of human-in-the-loop for machine learning. Future Generation Computer Systems, 135, 2022.
  76. Large scale incremental learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  77. An incremental update framework for online recommenders with data-driven prior. In Proceedings of the International Conference on Information and Knowledge Management (CKIM), 2023.
  78. Wild-time: A benchmark of in-the-wild distribution shift over time. In Proceedings of the Conference on Neural Information Processing Systems (NeurIPS) (Benchmark Track), 2022.
  79. Apache spark: a unified engine for big data processing. Communications of the ACM, 59(11):56–65, 2016.
  80. Understanding data storage and ingestion for large-scale deep recommendation model training. In Proceedings of the Annual International Symposium on Computer Architecture (ISCA), 2022.
Citations (2)

Summary

We haven't generated a summary for this paper yet.

Whiteboard

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.