Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
158 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Evaluation of Dataframe Libraries for Data Preparation on a Single Machine (2312.11122v3)

Published 18 Dec 2023 in cs.DB

Abstract: Data preparation is a trial-and-error process that typically involves countless iterations over the data to define the best pipeline of operators for a given task. With tabular data, practitioners often perform that burdensome activity on local machines by writing ad hoc scripts with libraries based on the Pandas dataframe API and testing them on samples of the entire dataset-the faster the library, the less idle time its users have. In this paper, we evaluate the most popular Python dataframe libraries in general data preparation use cases to assess how they perform on a single machine. To do so, we employ 4 real-world datasets with heterogeneous features, covering a variety of scenarios, and the TPC-H benchmark. The insights gained with this experimentation are useful to data scientists who need to choose which of the dataframe libraries best suits their data preparation task at hand. In a nutshell, we found that: for small datasets, Pandas consistently proves to be the best choice with the richest API; when data fits in RAM and there is no need for complete compatibility with Pandas API, Polars is the go-to choice thanks to its in-memory execution and query optimizations; when a GPU is available, CuDF often yields the best performance, while for very large datasets that cannot fit in the GPU memory and RAM, PySpark (thanks to a multithread execution and a query optimizer) proves to be the best option.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (61)
  1. A. Haug, F. Zachariassen, and D. Van Liempd, “The costs of poor data quality,” Journal of Industrial Engineering and Management, vol. 4, no. 2, pp. 168–193, 2011.
  2. W. Fan, “Data Quality: From Theory to Practice,” ACM SIGMOD Record, vol. 44, no. 3, pp. 7–18, 2015.
  3. A. A. A. Fernandes, M. Koehler, N. Konstantinou, P. Pankin, N. W. Paton, and R. Sakellariou, “Data Preparation: A Technological Perspective and Review,” SN Computer Science (SNCS), vol. 4, no. 4, pp. 425:1–425:20, 2023.
  4. M. Hameed and F. Naumann, “Data Preparation: A Survey of Commercial Tools,” ACM SIGMOD Record, vol. 49, no. 3, pp. 18–29, 2020.
  5. T. Furche, G. Gottlob, L. Libkin, G. Orsi, and N. W. Paton, “Data Wrangling for Big Data: Challenges and Opportunities,” in Proceedings of the International Conference on Extending Database Technology (EDBT), 2016, pp. 473–478.
  6. I. G. Terrizzano, P. M. Schwarz, M. Roth, and J. E. Colino, “Data Wrangling: The Challenging Journey from the Wild to the Lake,” in Proceedings of the Biennial Conference on Innovative Data Systems Research (CIDR), 2015.
  7. E. K. Rezig, L. Cao, M. Stonebraker, G. Simonini, W. Tao, S. Madden, M. Ouzzani, N. Tang, and A. K. Elmagarmid, “Data Civilizer 2.0: A Holistic Framework for Data Preparation and Analytics,” Proceedings of the VLDB Endowment (PVLDB), vol. 12, no. 12, pp. 1954–1957, 2019.
  8. F. Geerts, G. Mecca, P. Papotti, and D. Santoro, “Cleaning data with Llunatic,” VLDB Journal, vol. 29, no. 4, pp. 867–892, 2020.
  9. J. M. Hellerstein, J. Heer, and S. Kandel, “Self-Service Data Preparation: Research to Practice,” Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, vol. 41, no. 2, pp. 23–34, 2018.
  10. D. Petersohn, S. Macke, D. Xin, W. Ma, D. Lee, X. Mo, J. E. Gonzalez, J. M. Hellerstein, A. D. Joseph, and A. Parameswaran, “Towards Scalable Dataframe Systems,” Proceedings of the VLDB Endowment (PVLDB), vol. 13, no. 11, pp. 2033–2046, 2020.
  11. W. McKinney, “Data Structures for Statistical Computing in Python,” in Proceedings of the Python in Science Conference (SciPy), 2010, pp. 56–61.
  12. T. Claburn, “Python explosion blamed on pandas,” The Register, 2017. [Online]. Available: https://www.theregister.com/2017/09/14/python_explosion_blamed_on_pandas
  13. W. McKinney, “Apache arrow and the “10 things i hate about pandas”,” Archives for Wes McKinney, 2017.
  14. M. Schmitt, “Scaling Pandas: Comparing Dask, Ray, Modin, Vaex, and RAPIDS,” Data Revenue, 2020. [Online]. Available: https://www.datarevenue.com/en-blog/pandas-vs-dask-vs-vaex-vs-modin-vs-rapids-vs-ray
  15. J. Alexander, “Beyond Pandas: Spark, Dask, Vaex and other big data technologies battling head to head,” Medium, 2023. [Online]. Available: https://towardsdatascience.com/beyond-pandas-spark-dask-vaex-and-other-big-data-technologies-battling-head-to-head-a453a1f8cc13
  16. B. Pinner, “Data Processing: Pandas vs PySpark vs Polars,” Medium, 2023. [Online]. Available: https://medium.com/@benpinner1997/data-processing-pandas-vs-pyspark-vs-polars-fc1cdcb28725
  17. M. Karlsson, “Pandas, Spark, and Polars: When To Use Which?” Medium, 2023. [Online]. Available: https://betterprogramming.pub/pandas-spark-and-polars-when-to-use-which-f4e85d909c6f
  18. Polars: Alternatives. [Online]. Available: https://pola-rs.github.io/polars/user-guide/misc/alternatives/
  19. D. Petersohn, D. Tang, R. Durrani, A. Melik-Adamyan, J. E. Gonzalez, A. D. Joseph, and A. G. Parameswaran, “Flexible Rule-Based Decomposition and Metadata Independence in Modin: A Parallel Dataframe System,” Proceedings of the VLDB Endowment (PVLDB), vol. 15, no. 3, pp. 739–751, 2021.
  20. M. S. Rehman and A. J. Elmore, “FuzzyData: A Scalable Workload Generator for Testing Dataframe Workflow Systems,” in Proceedings of the International Workshop on Testing Database Systems (DBTest), 2022, pp. 17–24.
  21. S. Shanbhag and S. Chimalakonda, “An Exploratory Study on Energy Consumption of Dataframe Processing Libraries,” in Proceedings of the International Conference on Mining Software Repositories (MSR), 2023, pp. 284–295.
  22. Polars TPC-H Benchmark. [Online]. Available: https://www.pola.rs/benchmarks.html
  23. Kaggle. [Online]. Available: https://www.kaggle.com
  24. Pandas. [Online]. Available: https://github.com/pandas-dev/pandas
  25. NumPy. [Online]. Available: https://github.com/numpy/numpy
  26. Apache Arrow. [Online]. Available: https://github.com/apache/arrow
  27. Pandas: What’s new in 2.0.0. [Online]. Available: https://pandas.pydata.org/docs/dev/whatsnew/v2.0.0.html
  28. Modin Documentation. [Online]. Available: https://modin.readthedocs.io/en/latest
  29. DataTable: Comparison with Pandas. [Online]. Available: https://datatable.readthedocs.io/en/latest/manual/comparison_with_pandas.html
  30. PySpark. [Online]. Available: https://github.com/apache/spark
  31. Modin. [Online]. Available: https://github.com/modin-project/modin
  32. Polars. [Online]. Available: https://github.com/pola-rs/polars
  33. CuDF. [Online]. Available: https://github.com/rapidsai/cudf
  34. Vaex. [Online]. Available: https://github.com/vaexio/vaex
  35. DataTable. [Online]. Available: https://github.com/h2oai/datatable
  36. M. Armbrust, R. S. Xin, C. Lian, Y. Huai, D. Liu, J. K. Bradley, X. Meng, T. Kaftan, M. J. Franklin, A. Ghodsi, and M. Zaharia, “Spark SQL: Relational Data Processing in Spark,” in Proceedings of the International Conference on Management of Data (SIGMOD), 2015, pp. 1383–1394.
  37. Koalas. [Online]. Available: https://github.com/databricks/koalas
  38. Dask. [Online]. Available: https://github.com/dask/dask
  39. Ray. [Online]. Available: https://github.com/ray-project/ray
  40. D. Xin, D. Petersohn, D. Tang, Y. Wu, J. E. Gonzalez, J. M. Hellerstein, A. D. Joseph, and A. G. Parameswaran, “Enhancing the Interactivity of Dataframe Queries by Leveraging Think Time,” Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, vol. 44, no. 1, pp. 66–78, 2021.
  41. M. Rocklin, “Dask: Parallel Computation with Blocked algorithms and Task Scheduling,” in Proceedings of the Python in Science Conference (SciPy), 2015, pp. 126–132.
  42. P. Moritz, R. Nishihara, S. Wang, A. Tumanov, R. Liaw, E. Liang, M. Elibol, Z. Yang, W. Paul, M. I. Jordan, and I. Stoica, “Ray: A Distributed Framework for Emerging AI Applications,” in Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2018, pp. 561–577.
  43. RAPIDS. [Online]. Available: https://rapids.ai
  44. 120 years of Olympic history: athletes and results. [Online]. Available: https://www.kaggle.com/datasets/heesoo37/120-years-of-olympic-history-athletes-and-results
  45. All Lending Club loan data. [Online]. Available: https://www.kaggle.com/datasets/wordsforthewise/lending-club
  46. Stanford Open Policing Project. [Online]. Available: https://www.kaggle.com/datasets/faressayah/stanford-open-policing-project
  47. New York City Taxi Trip Duration. [Online]. Available: https://www.kaggle.com/competitions/nyc-taxi-trip-duration
  48. Apache Parquet. [Online]. Available: {https://parquet.apache.org}
  49. Polars: Missing data. [Online]. Available: https://pola-rs.github.io/polars/user-guide/expressions/null/
  50. G. Vitagliano, M. Hameed, L. Jiang, L. Reisener, E. Wu, and F. Naumann, “Pollock: A Data Loading Benchmark,” Proceedings of the VLDB Endowment (PVLDB), vol. 16, no. 8, pp. 1870–1882, 2023.
  51. C. Liu, A. Pavlenko, M. Interlandi, and B. Haynes, “A Deep Dive into Common Open Formats for Analytical DBMSs,” Proceedings of the VLDB Endowment, vol. 16, no. 11, pp. 3044–3056, 2023.
  52. Reading and Writing the Apache Parquet Format. [Online]. Available: https://arrow.apache.org/docs/python/parquet.html
  53. P. Sinthong and M. J. Carey, “PolyFrame: A Retargetable Query-based Approach to Scaling Dataframes,” Proceedings of the VLDB Endowment (PVLDB), vol. 14, no. 11, pp. 2296–2304, 2021.
  54. P. Sinthong and M. J. Carey, “Aframe: Extending dataframes for large-scale modern data analysis,” in Proceeding of the International Conference on Big Data (Big Data), 2019, pp. 359–371.
  55. D. J. DeWitt, “The Wisconsin Benchmark: Past, Present, and Future,” in The Benchmark Handbook for Database and Transaction Systems, 1993.
  56. M. Pöss and C. Floyd, “New TPC Benchmarks for Decision Support and Web Commerce,” ACM SIGMOD Record, vol. 29, no. 4, pp. 64–71, 2000.
  57. M. Raasveldt and H. Mühleisen, “DuckDB: an Embeddable Analytical Database,” in Proceedings of the International Conference on Management of Data (SIGMOD), 2019, pp. 1981–1984.
  58. H2O database-like ops benchmark. [Online]. Available: https://h2oai.github.io/db-benchmark
  59. Anaconda. [Online]. Available: https://www.anaconda.com
  60. PostgreSQL. [Online]. Available: https://www.postgresql.org
  61. T. Tang, “Polars: Pandas DataFrame but Much Faster,” Medium, 2023. [Online]. Available: https://towardsdatascience.com/pandas-dataframe-but-much-faster-f475d6be4cd4

Summary

We haven't generated a summary for this paper yet.