Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
156 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

In-depth Analysis On Parallel Processing Patterns for High-Performance Dataframes (2307.01394v1)

Published 3 Jul 2023 in cs.DC, cs.AI, cs.IR, and cs.LG

Abstract: The Data Science domain has expanded monumentally in both research and industry communities during the past decade, predominantly owing to the Big Data revolution. AI and Machine Learning (ML) are bringing more complexities to data engineering applications, which are now integrated into data processing pipelines to process terabytes of data. Typically, a significant amount of time is spent on data preprocessing in these pipelines, and hence improving its e fficiency directly impacts the overall pipeline performance. The community has recently embraced the concept of Dataframes as the de-facto data structure for data representation and manipulation. However, the most widely used serial Dataframes today (R, pandas) experience performance limitations while working on even moderately large data sets. We believe that there is plenty of room for improvement by taking a look at this problem from a high-performance computing point of view. In a prior publication, we presented a set of parallel processing patterns for distributed dataframe operators and the reference runtime implementation, Cylon [1]. In this paper, we are expanding on the initial concept by introducing a cost model for evaluating the said patterns. Furthermore, we evaluate the performance of Cylon on the ORNL Summit supercomputer.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (15)
  1. Apache hadoop, https://hadoop.apache.org/.
  2. Apache spark™ - unified engine for large-scale data analytics, https://spark.apache.org/.
  3. Apache flink: Stateful computations over data streams, https://flink.apache.org/.
  4. rapidsai/cudf: cudf - gpu dataframe library, https://github.com/rapidsai/cudf.
  5. Shuffling for groupby and join — dask documentation, https://docs.dask.org/en/stable/dataframe-groupby.html.
  6. Performance tips and tuning — ray 2.0.0, https://docs.ray.io/en/latest/data/performance-tips.html.
  7. Open mpi: Open source high performance computing, https://www.open-mpi.org/.
  8. facebookincubator/gloo: Collective communications library with various primitives for multi-machine training., https://github.com/facebookincubator/gloo.
  9. Pmix — process management interface - exascale copyright 2017-2020 pmix community, https://pmix.github.io/.
  10. Summit user guide - olcf user documentation, https://docs.olcf.ornl.gov/.
  11. Conda - conda documentation, https://docs.conda.io/.
  12. Pypi - the python package index, https://pypi.org/.
  13. pandas - python data analysis library, https://pandas.pydata.org/.
  14. Tpc-homepage, https://www.tpc.org/default5.asp.
  15. Dask — scale the python tools you love, https://www.dask.org/.
Citations (4)

Summary

We haven't generated a summary for this paper yet.