Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Couler: Unified Machine Learning Workflow Optimization in Cloud (2403.07608v1)

Published 12 Mar 2024 in cs.DB, cs.AI, and cs.LG

Abstract: Machine Learning (ML) has become ubiquitous, fueling data-driven applications across various organizations. Contrary to the traditional perception of ML in research, ML workflows can be complex, resource-intensive, and time-consuming. Expanding an ML workflow to encompass a wider range of data infrastructure and data types may lead to larger workloads and increased deployment costs. Currently, numerous workflow engines are available (with over ten being widely recognized). This variety poses a challenge for end-users in terms of mastering different engine APIs. While efforts have primarily focused on optimizing ML Operations (MLOps) for a specific workflow engine, current methods largely overlook workflow optimization across different engines. In this work, we design and implement Couler, a system designed for unified ML workflow optimization in the cloud. Our main insight lies in the ability to generate an ML workflow using natural language (NL) descriptions. We integrate LLMs into workflow generation, and provide a unified programming interface for various workflow engines. This approach alleviates the need to understand various workflow engines' APIs. Moreover, Couler enhances workflow computation efficiency by introducing automated caching at multiple stages, enabling large workflow auto-parallelization and automatic hyperparameters tuning. These enhancements minimize redundant computational costs and improve fault tolerance during deep learning workflow training. Couler is extensively deployed in real-world production scenarios at Ant Group, handling approximately 22k workflows daily, and has successfully improved the CPU/Memory utilization by more than 15% and the workflow completion rate by around 17%.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (56)
  1. “Adopters of couler,” https://github.com/couler-proj/couler/blob/master/ADOPTERS.md, Feb. 2024.
  2. “Airflow: a workflow management platform,” https://airflow.apache.org/, Oct. 2023.
  3. “Alipay: Optimizing alluxio for efficient large-scale training on billions of files,” https://www.alluxio.io/blog/optimizing-alluxio-for-efficient-large-scale-training-on-billions-of-files/, Feb. 2024.
  4. “Machine learning model training with alluxio: Part 1 – solution overview,” https://www.alluxio.io/blog/machine-learning-training-with-alluxio-solution-overview/, Feb. 2024.
  5. “Argo workflows,” https://argoproj.github.io/argo-workflows/, Oct. 2023.
  6. “Authoring and submitting argo workflows using python,” https://blog.argoproj.io/authoring-and-submitting-argo-workflows-using-python-aff9a070d95f, Feb. 2024.
  7. A. Chen, A. Chow, A. Davidson, A. DCunha, A. Ghodsi, S. A. Hong, A. Konwinski, C. Mewald, S. Murching, T. Nykodym, P. Ogilvie, M. Parkhe, A. Singh, F. Xie, M. Zaharia, R. Zang, J. Zheng, and C. Zumar, “Developments in mlflow: A system to accelerate the machine learning lifecycle,” in mlflow, ser. DEEM’20.   New York, NY, USA: Association for Computing Machinery, 2020.
  8. H.-T. Cheng, Z. Haque, L. Hong, M. Ispir, C. Mewald, I. Polosukhin, G. Roumpos, D. Sculley, J. Smith, D. Soergel, Y. Tang et al., “Tensorflow estimators: Managing simplicity vs. flexibility in high-level machine learning frameworks,” in Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2017, pp. 1763–1771.
  9. H.-T. Cheng, L. Koc, J. Harmsen, T. Shaked, T. Chandra, H. Aradhye, G. Anderson, G. Corrado, W. Chai, M. Ispir, R. Anil, Z. Haque, L. Hong, V. Jain, X. Liu, and H. Shah, “Wide & deep learning for recommender systems,” 2016.
  10. S. W. Chien, A. Podobas, I. B. Peng, and S. Markidis, “tf-darshan: Understanding fine-grained i/o performance in machine learning workloads,” in 2020 IEEE International Conference on Cluster Computing (CLUSTER).   IEEE, 2020, pp. 359–370.
  11. “Codeflare,” https://codeflare.readthedocs.io/en/latest/getting_started/overview.html, Oct. 2023.
  12. N. Erickson, J. Mueller, A. Shirkov, H. Zhang, P. Larroy, M. Li, and A. Smola, “Autogluon-tabular: Robust and accurate automl for structured data,” arXiv preprint arXiv:2003.06505, 2020.
  13. M. Feurer, A. Klein, K. Eggensperger, J. Springenberg, M. Blum, and F. Hutter, “Efficient and robust automated machine learning,” Advances in neural information processing systems, vol. 28, 2015.
  14. “Flink,” http://flink.apache.org/, Oct. 2023.
  15. D. Gao, H. Wang, Y. Li, X. Sun, Y. Qian, B. Ding, and J. Zhou, “Text-to-sql empowered by large language models: A benchmark evaluation,” arXiv preprint arXiv:2308.15363, 2023.
  16. T. Gebru, J. Morgenstern, B. Vecchione, J. W. Vaughan, H. Wallach, H. D. Iii, and K. Crawford, “Datasheets for datasets,” Communications of the ACM, vol. 64, no. 12, pp. 86–92, 2021.
  17. “Hadoop,” http://hadoop.apache.org/, Oct. 2023.
  18. “Apache kafka,” https://kafka.apache.org/, Oct. 2023.
  19. A. B. Kahn, “Topological sorting of large networks,” Commun. ACM, vol. 5, no. 11, p. 558–562, Nov. 1962.
  20. G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, and T.-Y. Liu, “Lightgbm: A highly efficient gradient boosting decision tree,” in Proceedings of the 31st International Conference on Neural Information Processing Systems, ser. NIPS’17, Red Hook, NY, USA, 2017, p. 3149–3157.
  21. L. Kotthoff, C. Thornton, H. H. Hoos, F. Hutter, and K. Leyton-Brown, “Auto-weka: Automatic model selection and hyperparameter optimization in weka,” Automated machine learning: methods, systems, challenges, pp. 81–95, 2019.
  22. “Kubeflow,” https://www.kubeflow.org/, Oct. 2023.
  23. J. Lao, Y. Wang, Y. Li, J. Wang, Y. Zhang, Z. Cheng, W. Chen, M. Tang, and J. Wang, “Gptuner: A manual-reading database tuning system via gpt-guided bayesian optimization,” arXiv preprint arXiv:2311.03157, 2023.
  24. M. Mitchell, S. Wu, A. Zaldivar, P. Barnes, L. Vasserman, B. Hutchinson, E. Spitzer, I. D. Raji, and T. Gebru, “Model cards for model reporting,” in Proceedings of the conference on fairness, accountability, and transparency, 2019, pp. 220–229.
  25. A. N. Modi, C. Y. Koo, C. Y. Foo, C. Mewald, D. M. Baylor, E. Breck, H.-T. Cheng, J. Wilkiewicz, L. Koc, L. Lew, M. A. Zinkevich, M. Wicke, M. Ispir, N. Polyzotis, N. Fiedel, S. E. Haykal, S. Whang, S. Roy, S. Ramesh, V. Jain, X. Zhang, and Z. Haque, “Tfx: A tensorflow-based production-scale machine learning platform,” in KDD 2017, 2017.
  26. P. Moritz, R. Nishihara, S. Wang, A. Tumanov, R. Liaw, E. Liang, M. Elibol, Z. Yang, W. Paul, M. I. Jordan et al., “Ray: A distributed framework for emerging {{\{{AI}}\}} applications,” in 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), 2018, pp. 561–577.
  27. “Nas,” https://www.alibabacloud.com/product/nas, Oct. 2023.
  28. E. Nijkamp, B. Pang, H. Hayashi, L. Tu, H. Wang, Y. Zhou, S. Savarese, and C. Xiong, “Codegen: An open large language model for code with multi-turn program synthesis,” arXiv preprint arXiv:2203.13474, 2022.
  29. “Odps,” https://www.alibabacloud.com/help/product/27797.htm, Oct. 2023.
  30. “Apache ooize,” https://oozie.apache.org/, Oct. 2023.
  31. “Oss,” https://www.alibabacloud.com/product/oss, Oct. 2023.
  32. A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, “Pytorch: An imperative style, high-performance deep learning library,” in Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, Eds.   Curran Associates, Inc., 2019, pp. 8024–8035.
  33. N. Polyzotis, S. Roy, S. E. Whang, and M. Zinkevich, “Data lifecycle challenges in production machine learning: A survey,” SIGMOD Rec., vol. 47, no. 2, p. 17–28, Dec. 2018.
  34. S. Pumma, M. Si, W.-C. Feng, and P. Balaji, “Scalable deep learning via i/o analysis and optimization,” ACM Transactions on Parallel Computing (TOPC), vol. 6, no. 2, pp. 1–34, 2019.
  35. N. Rajkumar, R. Li, and D. Bahdanau, “Evaluating the text-to-sql capabilities of large language models,” arXiv preprint arXiv:2204.00498, 2022.
  36. B. Sang, S. Gu, X. Zhan, M. Tang, J. Liu, X. Chen, J. Tan, H. Ge, K. Zhang, R. Ruan et al., “Cougar: A general framework for jobs optimization in cloud,” in 2023 IEEE 39th International Conference on Data Engineering (ICDE).   IEEE, 2023, pp. 3417–3429.
  37. S. Shrivastava, D. Patel, W. M. Gifford, S. Siegel, and J. Kalagnanam, “Thunderml: A toolkit for enabling ai/ml models on cloud for industry 4.0,” in International Conference on Web Services.   Springer, 2019, pp. 163–180.
  38. K. Shvachko, H. Kuang, S. Radia, and R. Chansler, “The hadoop distributed file system,” in 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), 2010, pp. 1–10.
  39. E. Sparks, S. Venkataraman, T. Kaftan, M. Franklin, and B. Recht, “Keystoneml: Optimizing pipelines for large-scale advanced analytics,” in keystoneml, 04 2017, pp. 535–546.
  40. R. Sun, S. O. Arik, H. Nakhost, H. Dai, R. Sinha, P. Yin, and T. Pfister, “Sql-palm: Improved large language modeladaptation for text-to-sql,” arXiv preprint arXiv:2306.00739, 2023.
  41. “Tekton,” https://tekton.dev/, Oct. 2023.
  42. “tf-darshan: Understanding fine-grained i/o performance in machine learning workloads,” https://www.osti.gov/biblio/1830501, Feb. 2024.
  43. C. Thornton, F. Hutter, H. H. Hoos, and K. Leyton-Brown, “Auto-weka: Combined selection and hyperparameter optimization of classification algorithms,” in Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, 2013, pp. 847–855.
  44. K. Tian, E. Mitchell, A. Zhou, A. Sharma, R. Rafailov, H. Yao, C. Finn, and C. D. Manning, “Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback,” arXiv preprint arXiv:2305.14975, 2023.
  45. A. Truong, A. Walters, J. Goodsitt, K. Hines, C. B. Bruss, and R. Farivar, “Towards automated machine learning: Evaluation and comparison of automl approaches and tools,” in 2019 IEEE 31st international conference on tools with artificial intelligence (ICTAI).   IEEE, 2019, pp. 1471–1479.
  46. Y. Wang, Y. Yang, W. Zhu, Y. Wu, X. Yan, Y. Liu, Y. Wang, L. Xie, Z. Gao, W. Zhu, X. Chen, W. Yan, M. Tang, and Y. Tang, “Sqlflow: A bridge between sql and machine learning,” 2020.
  47. J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou et al., “Chain-of-thought prompting elicits reasoning in large language models,” Advances in Neural Information Processing Systems, vol. 35, pp. 24 824–24 837, 2022.
  48. “Awesome workflow engines,” https://github.com/meirwah/awesome-workflow-engines, Oct. 2023.
  49. “Xgboost,” https://github.com/dmlc/xgboost, Oct. 2023.
  50. D. Xin, H. Miao, A. Parameswaran, and N. Polyzotis, “Production machine learning pipelines: Empirical analysis and optimization opportunities,” in Proceedings of the 2021 International Conference on Management of Data, 2021, pp. 2639–2652.
  51. F. F. Xu, U. Alon, G. Neubig, and V. J. Hellendoorn, “A systematic evaluation of large language models of code,” in Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming, 2022, pp. 1–10.
  52. Z. Ye, D. Li, J. Tian, T. Lan, J. Zuo, L. Duan, H. Lu, Y. Jiang, J. Sha, K. Zhang et al., “Aspen: High-throughput lora fine-tuning of large language models with a single gpu,” arXiv preprint arXiv:2312.02515, 2023.
  53. M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauly, M. J. Franklin, S. Shenker, and I. Stoica, “Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing,” in 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12).   San Jose, CA: USENIX Association, Apr. 2012, pp. 15–28.
  54. A. X. Zhang, M. Muller, and D. Wang, “How do data science workers collaborate? roles, workflows, and tools,” 2020.
  55. S. Zhang, C. Gong, L. Wu, X. Liu, and M. Zhou, “Automl-gpt: Automatic machine learning with gpt,” arXiv preprint arXiv:2305.02499, 2023.
  56. L. Zimmer, M. Lindauer, and F. Hutter, “Auto-pytorch: Multi-fidelity metalearning for efficient and robust autodl,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 9, pp. 3079–3090, 2021.

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com