DataPrep.EDA: Task-Centric Exploratory Data Analysis for Statistical Modeling in Python (2104.00841v2)

Published 2 Apr 2021 in cs.DB

Abstract: Exploratory Data Analysis (EDA) is a crucial step in any data science project. However, existing Python libraries fall short in supporting data scientists to complete common EDA tasks for statistical modeling. Their API design is either too low level, which is optimized for plotting rather than EDA, or too high level, which is hard to specify more fine-grained EDA tasks. In response, we propose DataPrep.EDA, a novel task-centric EDA system in Python. DataPrep.EDA allows data scientists to declaratively specify a wide range of EDA tasks in different granularity with a single function call. We identify a number of challenges to implement DataPrep.EDA, and propose effective solutions to improve the scalability, usability, customizability of the system. In particular, we discuss some lessons learned from using Dask to build the data processing pipelines for EDA tasks and describe our approaches to accelerate the pipelines. We conduct extensive experiments to compare DataPrep.EDA with Pandas-profiling, the state-of-the-art EDA system in Python. The experiments show that DataPrep.EDA significantly outperforms Pandas-profiling in terms of both speed and user experience. DataPrep.EDA is open-sourced as an EDA component of DataPrep: https://github.com/sfu-db/dataprep.

Authors (9)

Jinglin Peng (4 papers)
Weiyuan Wu (5 papers)
Brandon Lockhart (2 papers)
Song Bian (21 papers)
Jing Nathan Yan (11 papers)
Linghao Xu (1 paper)
Zhixuan Chi (1 paper)
Jeffrey Rzeszotarski (1 paper)
Jiannan Wang (37 papers)

Citations (29)

View on Semantic Scholar

Summary

Task-Centric Exploratory Data Analysis for Statistical Modeling in Python

This paper introduces DataPrep.EDA, a task-centric exploratory data analysis (EDA) tool designed to address limitations in existing Python EDA libraries. The authors identify that current libraries either provide low-level APIs optimized for plotting rather than EDA or high-level APIs that lack flexibility. DataPrep.EDA seeks to bridge this gap by enabling data scientists to specify diverse EDA tasks at different granularities with a single function call.

Key Contributions and Methodology

DataPrep.EDA offers a framework tailored towards enhancing scalability, usability, and customizability. At its core, the tool maps EDA tasks directly to specific function calls, allowing for tasks such as univariate analysis, correlation analysis, and missing value analysis to be performed efficiently. The system leverages Dask to optimize data processing, enabling parallelization and lazy evaluation to boost performance compared to existing tools like Pandas-profiling.

The authors emphasize the following contributions:

Development of a task-centric EDA framework that simplifies the execution of common statistical modeling tasks.
Design of a declarative API interface that enables users to perform complex EDA operations with concise function calls.
Implementation of solutions to overcome computational challenges, particularly through effective use of Dask's capabilities.
Comprehensive evaluation against Pandas-profiling, demonstrating significant improvements in both speed and user experience.

Experimental Results and Performance

The experimental evaluation of DataPrep.EDA includes tests on 15 real-world datasets, where it consistently outperforms Pandas-profiling, often by a factor of four to twenty times faster. DataPrep.EDA shows particular efficiency improvements in handling numerical data and datasets with fewer categorical columns.

The usability enhancements are underscored by a user paper, which reveals that participants completed more tasks with greater accuracy using DataPrep.EDA compared to Pandas-profiling. This suggests that the task-centric design reduces the likelihood of false discoveries and enhances user engagement and satisfaction.

Implications and Future Directions

DataPrep.EDA represents a significant step towards more efficient task-centric approaches in EDA, providing a tool that integrates seamlessly into the Python data science ecosystem. The system's design greatly improves the interactive speed and usability, making it a suitable choice for both novice and expert users.

The potential applications of DataPrep.EDA extend beyond immediate improvements in EDA practices. The system's architecture can be adapted for other data-intensive tasks, suggesting future exploration into areas such as time-series analysis and multi-variate analysis.

The scalability challenges, highlighted when Input/Output becomes a bottleneck, might be further addressed through data compression techniques and enhanced data storage strategies. Additionally, investigating sampling and sketching methods could offer further computational benefits, particularly in large-scale data scenarios.

Conclusion

DataPrep.EDA successfully showcases the advantages of a task-centric approach in exploratory data analysis, offering a robust and efficient alternative to existing Python libraries. Its focus on user-centric design and computational efficiency holds promise for continued development and application in the broader field of data science. The work sets a precedent for future research in refining EDA tools and methodologies, potentially impacting various domains reliant on data analysis and machine learning.

PDF Markdown

Related Papers

Find Related Papers

GitHub

GitHub - sfu-db/dataprep: Open-source low code data preparation library in python. Collect, clean and visualization your data in python with a few lines of code. (1,944 stars)