Evaluation of Dataframe Libraries for Data Preparation on a Single Machine (2312.11122v3)
Abstract: Data preparation is a trial-and-error process that typically involves countless iterations over the data to define the best pipeline of operators for a given task. With tabular data, practitioners often perform that burdensome activity on local machines by writing ad hoc scripts with libraries based on the Pandas dataframe API and testing them on samples of the entire dataset-the faster the library, the less idle time its users have. In this paper, we evaluate the most popular Python dataframe libraries in general data preparation use cases to assess how they perform on a single machine. To do so, we employ 4 real-world datasets with heterogeneous features, covering a variety of scenarios, and the TPC-H benchmark. The insights gained with this experimentation are useful to data scientists who need to choose which of the dataframe libraries best suits their data preparation task at hand. In a nutshell, we found that: for small datasets, Pandas consistently proves to be the best choice with the richest API; when data fits in RAM and there is no need for complete compatibility with Pandas API, Polars is the go-to choice thanks to its in-memory execution and query optimizations; when a GPU is available, CuDF often yields the best performance, while for very large datasets that cannot fit in the GPU memory and RAM, PySpark (thanks to a multithread execution and a query optimizer) proves to be the best option.
- A. Haug, F. Zachariassen, and D. Van Liempd, “The costs of poor data quality,” Journal of Industrial Engineering and Management, vol. 4, no. 2, pp. 168–193, 2011.
- W. Fan, “Data Quality: From Theory to Practice,” ACM SIGMOD Record, vol. 44, no. 3, pp. 7–18, 2015.
- A. A. A. Fernandes, M. Koehler, N. Konstantinou, P. Pankin, N. W. Paton, and R. Sakellariou, “Data Preparation: A Technological Perspective and Review,” SN Computer Science (SNCS), vol. 4, no. 4, pp. 425:1–425:20, 2023.
- M. Hameed and F. Naumann, “Data Preparation: A Survey of Commercial Tools,” ACM SIGMOD Record, vol. 49, no. 3, pp. 18–29, 2020.
- T. Furche, G. Gottlob, L. Libkin, G. Orsi, and N. W. Paton, “Data Wrangling for Big Data: Challenges and Opportunities,” in Proceedings of the International Conference on Extending Database Technology (EDBT), 2016, pp. 473–478.
- I. G. Terrizzano, P. M. Schwarz, M. Roth, and J. E. Colino, “Data Wrangling: The Challenging Journey from the Wild to the Lake,” in Proceedings of the Biennial Conference on Innovative Data Systems Research (CIDR), 2015.
- E. K. Rezig, L. Cao, M. Stonebraker, G. Simonini, W. Tao, S. Madden, M. Ouzzani, N. Tang, and A. K. Elmagarmid, “Data Civilizer 2.0: A Holistic Framework for Data Preparation and Analytics,” Proceedings of the VLDB Endowment (PVLDB), vol. 12, no. 12, pp. 1954–1957, 2019.
- F. Geerts, G. Mecca, P. Papotti, and D. Santoro, “Cleaning data with Llunatic,” VLDB Journal, vol. 29, no. 4, pp. 867–892, 2020.
- J. M. Hellerstein, J. Heer, and S. Kandel, “Self-Service Data Preparation: Research to Practice,” Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, vol. 41, no. 2, pp. 23–34, 2018.
- D. Petersohn, S. Macke, D. Xin, W. Ma, D. Lee, X. Mo, J. E. Gonzalez, J. M. Hellerstein, A. D. Joseph, and A. Parameswaran, “Towards Scalable Dataframe Systems,” Proceedings of the VLDB Endowment (PVLDB), vol. 13, no. 11, pp. 2033–2046, 2020.
- W. McKinney, “Data Structures for Statistical Computing in Python,” in Proceedings of the Python in Science Conference (SciPy), 2010, pp. 56–61.
- T. Claburn, “Python explosion blamed on pandas,” The Register, 2017. [Online]. Available: https://www.theregister.com/2017/09/14/python_explosion_blamed_on_pandas
- W. McKinney, “Apache arrow and the “10 things i hate about pandas”,” Archives for Wes McKinney, 2017.
- M. Schmitt, “Scaling Pandas: Comparing Dask, Ray, Modin, Vaex, and RAPIDS,” Data Revenue, 2020. [Online]. Available: https://www.datarevenue.com/en-blog/pandas-vs-dask-vs-vaex-vs-modin-vs-rapids-vs-ray
- J. Alexander, “Beyond Pandas: Spark, Dask, Vaex and other big data technologies battling head to head,” Medium, 2023. [Online]. Available: https://towardsdatascience.com/beyond-pandas-spark-dask-vaex-and-other-big-data-technologies-battling-head-to-head-a453a1f8cc13
- B. Pinner, “Data Processing: Pandas vs PySpark vs Polars,” Medium, 2023. [Online]. Available: https://medium.com/@benpinner1997/data-processing-pandas-vs-pyspark-vs-polars-fc1cdcb28725
- M. Karlsson, “Pandas, Spark, and Polars: When To Use Which?” Medium, 2023. [Online]. Available: https://betterprogramming.pub/pandas-spark-and-polars-when-to-use-which-f4e85d909c6f
- Polars: Alternatives. [Online]. Available: https://pola-rs.github.io/polars/user-guide/misc/alternatives/
- D. Petersohn, D. Tang, R. Durrani, A. Melik-Adamyan, J. E. Gonzalez, A. D. Joseph, and A. G. Parameswaran, “Flexible Rule-Based Decomposition and Metadata Independence in Modin: A Parallel Dataframe System,” Proceedings of the VLDB Endowment (PVLDB), vol. 15, no. 3, pp. 739–751, 2021.
- M. S. Rehman and A. J. Elmore, “FuzzyData: A Scalable Workload Generator for Testing Dataframe Workflow Systems,” in Proceedings of the International Workshop on Testing Database Systems (DBTest), 2022, pp. 17–24.
- S. Shanbhag and S. Chimalakonda, “An Exploratory Study on Energy Consumption of Dataframe Processing Libraries,” in Proceedings of the International Conference on Mining Software Repositories (MSR), 2023, pp. 284–295.
- Polars TPC-H Benchmark. [Online]. Available: https://www.pola.rs/benchmarks.html
- Kaggle. [Online]. Available: https://www.kaggle.com
- Pandas. [Online]. Available: https://github.com/pandas-dev/pandas
- NumPy. [Online]. Available: https://github.com/numpy/numpy
- Apache Arrow. [Online]. Available: https://github.com/apache/arrow
- Pandas: What’s new in 2.0.0. [Online]. Available: https://pandas.pydata.org/docs/dev/whatsnew/v2.0.0.html
- Modin Documentation. [Online]. Available: https://modin.readthedocs.io/en/latest
- DataTable: Comparison with Pandas. [Online]. Available: https://datatable.readthedocs.io/en/latest/manual/comparison_with_pandas.html
- PySpark. [Online]. Available: https://github.com/apache/spark
- Modin. [Online]. Available: https://github.com/modin-project/modin
- Polars. [Online]. Available: https://github.com/pola-rs/polars
- CuDF. [Online]. Available: https://github.com/rapidsai/cudf
- Vaex. [Online]. Available: https://github.com/vaexio/vaex
- DataTable. [Online]. Available: https://github.com/h2oai/datatable
- M. Armbrust, R. S. Xin, C. Lian, Y. Huai, D. Liu, J. K. Bradley, X. Meng, T. Kaftan, M. J. Franklin, A. Ghodsi, and M. Zaharia, “Spark SQL: Relational Data Processing in Spark,” in Proceedings of the International Conference on Management of Data (SIGMOD), 2015, pp. 1383–1394.
- Koalas. [Online]. Available: https://github.com/databricks/koalas
- Dask. [Online]. Available: https://github.com/dask/dask
- Ray. [Online]. Available: https://github.com/ray-project/ray
- D. Xin, D. Petersohn, D. Tang, Y. Wu, J. E. Gonzalez, J. M. Hellerstein, A. D. Joseph, and A. G. Parameswaran, “Enhancing the Interactivity of Dataframe Queries by Leveraging Think Time,” Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, vol. 44, no. 1, pp. 66–78, 2021.
- M. Rocklin, “Dask: Parallel Computation with Blocked algorithms and Task Scheduling,” in Proceedings of the Python in Science Conference (SciPy), 2015, pp. 126–132.
- P. Moritz, R. Nishihara, S. Wang, A. Tumanov, R. Liaw, E. Liang, M. Elibol, Z. Yang, W. Paul, M. I. Jordan, and I. Stoica, “Ray: A Distributed Framework for Emerging AI Applications,” in Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2018, pp. 561–577.
- RAPIDS. [Online]. Available: https://rapids.ai
- 120 years of Olympic history: athletes and results. [Online]. Available: https://www.kaggle.com/datasets/heesoo37/120-years-of-olympic-history-athletes-and-results
- All Lending Club loan data. [Online]. Available: https://www.kaggle.com/datasets/wordsforthewise/lending-club
- Stanford Open Policing Project. [Online]. Available: https://www.kaggle.com/datasets/faressayah/stanford-open-policing-project
- New York City Taxi Trip Duration. [Online]. Available: https://www.kaggle.com/competitions/nyc-taxi-trip-duration
- Apache Parquet. [Online]. Available: {https://parquet.apache.org}
- Polars: Missing data. [Online]. Available: https://pola-rs.github.io/polars/user-guide/expressions/null/
- G. Vitagliano, M. Hameed, L. Jiang, L. Reisener, E. Wu, and F. Naumann, “Pollock: A Data Loading Benchmark,” Proceedings of the VLDB Endowment (PVLDB), vol. 16, no. 8, pp. 1870–1882, 2023.
- C. Liu, A. Pavlenko, M. Interlandi, and B. Haynes, “A Deep Dive into Common Open Formats for Analytical DBMSs,” Proceedings of the VLDB Endowment, vol. 16, no. 11, pp. 3044–3056, 2023.
- Reading and Writing the Apache Parquet Format. [Online]. Available: https://arrow.apache.org/docs/python/parquet.html
- P. Sinthong and M. J. Carey, “PolyFrame: A Retargetable Query-based Approach to Scaling Dataframes,” Proceedings of the VLDB Endowment (PVLDB), vol. 14, no. 11, pp. 2296–2304, 2021.
- P. Sinthong and M. J. Carey, “Aframe: Extending dataframes for large-scale modern data analysis,” in Proceeding of the International Conference on Big Data (Big Data), 2019, pp. 359–371.
- D. J. DeWitt, “The Wisconsin Benchmark: Past, Present, and Future,” in The Benchmark Handbook for Database and Transaction Systems, 1993.
- M. Pöss and C. Floyd, “New TPC Benchmarks for Decision Support and Web Commerce,” ACM SIGMOD Record, vol. 29, no. 4, pp. 64–71, 2000.
- M. Raasveldt and H. Mühleisen, “DuckDB: an Embeddable Analytical Database,” in Proceedings of the International Conference on Management of Data (SIGMOD), 2019, pp. 1981–1984.
- H2O database-like ops benchmark. [Online]. Available: https://h2oai.github.io/db-benchmark
- Anaconda. [Online]. Available: https://www.anaconda.com
- PostgreSQL. [Online]. Available: https://www.postgresql.org
- T. Tang, “Polars: Pandas DataFrame but Much Faster,” Medium, 2023. [Online]. Available: https://towardsdatascience.com/pandas-dataframe-but-much-faster-f475d6be4cd4