Xorbits: Automating Operator Tiling for Distributed Data Science (2401.00865v2)
Abstract: Data science pipelines commonly utilize dataframe and array operations for tasks such as data preprocessing, analysis, and machine learning. The most popular tools for these tasks are pandas and NumPy. However, these tools are limited to executing on a single node, making them unsuitable for processing large-scale data. Several systems have attempted to distribute data science applications to clusters while maintaining interfaces similar to single-node libraries, enabling data scientists to scale their workloads without significant effort. However, existing systems often struggle with processing large datasets due to Out-of-Memory (OOM) problems caused by poor data partitioning. To overcome these challenges, we develop Xorbits, a high-performance, scalable data science framework specifically designed to distribute data science workloads across clusters while retaining familiar APIs. The key differentiator of Xorbits is its ability to dynamically switch between graph construction and graph execution. Xorbits has been successfully deployed in production environments with up to 5k CPU cores. Its applications span various domains, including user behavior analysis and recommendation systems in the e-commerce sector, as well as credit assessment and risk management in the finance industry. Users can easily scale their data science workloads by simply changing the import line of their pandas and NumPy code. Our experiments demonstrate that Xorbits can effectively process very large datasets without encountering OOM or data-skewing problems. Over the fastest state-of-the-art solutions, Xorbits achieves an impressive 2.66* speedup on average. In terms of API coverage, Xorbits attains a compatibility rate of 96.7%, surpassing the fastest framework by an impressive margin of 60 percentage points. Xorbits is available at https://github.com/xorbitsai/xorbits.
- C. Yan and Y. He, “Auto-Suggest: Learning-to-Recommend Data Preparation Steps Using Data Science Notebooks,” in Proceedings of the 2020 International Conference on Management of Data (SIGMOD 2020), Portland, OR, USA: ACM, 2020, pp. 1539–1554.
- W. McKinney, “Data structures for statistical computing in python,” in Proceedings of the 9th python in science conference 2010 (SciPy 2010), Austin, USA, 2010, pp. 56–61
- “2023 Developer Survey” https://survey.stackoverflow.co/2023/ (accessed Sep. 28, 2023).
- “GlobalInterpreterLock - Python Wiki.” https://wiki.python.org/moin/GlobalInterpreterLock (accessed Jun. 26, 2023).
- M. Rocklin, “Dask: Parallel Computation with Blocked algorithms and Task Scheduling,” presented at the Python in Science Conference, Austin, TX, USA, 2015, pp. 126–132.
- L. Dalcín, R. Paz, and M. Storti, “MPI for python,” Journal of Parallel and Distributed Computing, vol. 65, no. 9, pp. 1108–1115, 2005.
- D. Petersohn, D. Tang, R. Durrani, A. Melik-Adamyan, J. E. Gonzalez, A. D. Joseph, and A. G. Parameswaran, “Flexible rule-based decomposition and metadata independence in modin: a parallel dataframe system,” Proceedings of the VLDB Endowment, vol. 15, no. 3, pp. 739–751, Nov. 2021.
- “From/to pandas and PySpark DataFrames” https://spark.apache.org/docs/3.5.0/api/python/user_guide/pandas_on_spark/pandas_on_spark/pandas_pyspark.html (accessed Aug. 26, 2023).
- “Dask DataFrames Best Practices” https://docs.dask.org/en/stable/dataframe-best-practices.html (accessed Sep. 26, 2023).
- ”Mars” https://github.com/mars-project/mars (accessed Oct. 21, 2023).
- C. Brücke, P. Härtling, R. D. E. Palacios, H. Patel, and T. Rabl, “TPCx-AI - an industry standard benchmark for artificial intelligence and machine learning systems,” Proceedings of the VLDB Endowment, vol. 16, no. 12, pp. 3649–3661, 2023.
- P. Boncz, T. Neumann, and O. Erling, “TPC-H analyzed: Hidden messages and lessons learned from an influential benchmark,” in Performance characterization and benchmarking, Cham: Springer International Publishing, 2014, pp. 61–76.
- “Best Practices” https://spark.apache.org/docs/3.5.0/api/python/user_guide/pandas_on_spark/best_practices.html (accessed Aug. 26, 2023).
- “Best Practices” https://docs.dask.org/en/stable/array-best-practices.html (accessed Sep. 26, 2023).
- “Choosing good chunk sizes in Dask” https://blog.dask.org/2021/11/02/choosing-dask-chunk-sizes (accessed Aug. 31, 2023).
- ”Datasets” https://github.com/huggingface/datasets (accessed Oct. 21, 2023).
- T. Chen and C. Guestrin, “XGBoost: A Scalable Tree Boosting System,” in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, August 13-17, 2016, B. Krishnapuram, M. Shah, A. J. Smola, C. C. Aggarwal, D. Shen, and R. Rastogi, Eds., ACM, 2016, pp. 785–794.
- “CuPy,” CuPy. https://cupy.dev/ (accessed Jun. 26, 2023).
- “RAPIDS — GPU Accelerated Data Science.” https://rapids.ai/ (accessed Jun. 26, 2023).
- J. Dean and S. Ghemawat, “MapReduce: Simplified data processing on large clusters,” in Sixth symposium on operating system design and implementation (OSDI 2004), San Francisco, CA, USA, 2004, pp. 137–150.
- “multiprocessing.shared_memory — Shared memory for direct access across processes” https://docs.python.org/3/library/multiprocessing.shared_memory.html (accessed Aug. 31, 2023).
- H. Li, A. Ghodsi, M. Zaharia, S. Shenker, and I. Stoica, “Tachyon: Reliable, memory speed storage for cluster computing frameworks,” in Proceedings of the ACM symposium on cloud computing (SOCC 14), New York, USA: ACM, 2014, pp. 1–15.
- C. Yan, Y. Lin, and Y. He, “Predicate pushdown for data science pipelines,” Proceedings of the ACM on Management of Data, vol. 1, no. 2, 2023.
- “PEP 574 – Pickle protocol 5 with out-of-band data” https://peps.python.org/pep-0574/ (accessed Aug. 31, 2023).
- A. R. Benson, D. F. Gleich, and J. Demmel, “Direct QR factorizations for tall-and-skinny matrices in MapReduce architectures,” in 2013 IEEE international conference on big data, Los Alamitos, CA, USA: IEEE, Oct. 2013, pp. 264–272
- “Spark SQL” https://spark.apache.org/docs/3.5.0/api/python/reference/pyspark.sql/index.html (accessed Oct. 21, 2023).
- G. E. Gévay, T. Rabl, S. Breß, L. Madai-Tahy, J.-A. Quiané-Ruiz, and V. Markl, “Efficient control flow in dataflow systems: When ease-of-use meets high performance,” in 37th IEEE international conference on data engineering (ICDE 2021), chania, greece: IEEE, 2021, pp. 1428–1439.
- S Xue, S Zhao, et al, “Kronos: towards bus contention-aware job scheduling in warehouse scale computers,” Frontiers of Computer Science, vol. 17, no. 1, pp. 171101, 2023.
- M. Bauer and M. Garland, “Legate NumPy: accelerated and distributed array computing,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC 2019), Denver, USA: ACM, 2019, pp. 1–23.
- W. McKinney, “pandas: a foundational Python library for data analysis and statistics,” Python for high performance and scientific computing, vol. 14, no. 9, pp. 1–9, 2011.
- “Adaptive Query Execution” https://spark.apache.org/docs/3.5.0/sql-performance-tuning.html#adaptive-query-execution (accessed Dec. 29, 2023).
- M. Olma, M. Karpathiotakis, I. Alagiannis, M. Athanassoulis, and A. Ailamaki, “Adaptive partitioning and indexing for in situ query processing,” The VLDB Journal, vol. 29, no. 1, pp. 569–591, 2020.
- M. Bauer, S. Treichler, E. Slaughter, and A. Aiken, “Legion: Expressing locality and independence with logical regions,” in Proceedings of the international conference on high performance computing, networking, storage and analysis (SC 2012), Washington, DC, USA: IEEE, 2012.
- A. Sabne, “XLA: Compiling machine learning for peak performance.” 2020.