Solving Data Quality Problems with Desbordante: a Demo (2307.14935v2)

Published 27 Jul 2023 in cs.DB, cs.AI, cs.CE, and cs.LG

Abstract: Data profiling is an essential process in modern data-driven industries. One of its critical components is the discovery and validation of complex statistics, including functional dependencies, data constraints, association rules, and others. However, most existing data profiling systems that focus on complex statistics do not provide proper integration with the tools used by contemporary data scientists. This creates a significant barrier to the adoption of these tools in the industry. Moreover, existing systems were not created with industrial-grade workloads in mind. Finally, they do not aim to provide descriptive explanations, i.e. why a given pattern is not found. It is a significant issue as it is essential to understand the underlying reasons for a specific pattern's absence to make informed decisions based on the data. Because of that, these patterns are effectively rest in thin air: their application scope is rather limited, they are rarely used by the broader public. At the same time, as we are going to demonstrate in this presentation, complex statistics can be efficiently used to solve many classic data quality problems. Desbordante is an open-source data profiler that aims to close this gap. It is built with emphasis on industrial application: it is efficient, scalable, resilient to crashes, and provides explanations. Furthermore, it provides seamless Python integration by offloading various costly operations to the C++ core, not only mining. In this demonstration, we show several scenarios that allow end users to solve different data quality problems. Namely, we showcase typo detection, data deduplication, and data anomaly detection scenarios.

Citations (1)

View on Semantic Scholar

Summary

The paper presents an innovative, open-source data profiler that uses advanced C++ techniques to achieve up to 3.43 times faster dependency discovery than Java-based systems.
The paper integrates a seamless Python interface and supports a variety of primitives, enabling effective typo detection, data deduplication, and comprehensive profiling.
The paper demonstrates robust anomaly detection with detailed explanations and user-friendly interfaces, enhancing data quality insights for industrial applications.

Overview of Desbordante: Addressing Data Quality Problems

The paper introduces Desbordante, an innovative open-source data profiler focused on solving data quality issues through science-intensive profiling methodologies. The authors emphasize the limitations of existing data profiling systems, noting their inadequacy in handling industrial-grade workloads and offering proper system integration for data scientists. Desbordante aims to bridge this gap with its efficient and scalable design, providing functionalities such as typo detection, data deduplication, and anomaly detection via seamless Python integration.

Core Features of Desbordante

Desbordante distinguishes itself in several ways:

Efficiency and Scalability: Implemented in C++, Desbordante prioritizes performance and resource optimization. It effectively addresses complex computational tasks through advanced techniques like vectorization and cache-conscious programming. Benchmarking results highlighted a significant speed-up in the functional dependency discovery, achieving up to 3.43 times faster performance compared to Java-based systems like Metanome.
Comprehensive Primitive Handling: It supports a diverse array of primitives including functional dependencies, inclusion dependencies, and association rules, among others. This capability enables Desbordante to perform exhaustive data profiling and discerns complex metadata structures often overlooked by other systems.
Python Integration: By offloading computationally intensive tasks to its C++ core, Desbordante ensures smooth integration with Python, allowing data scientists to utilize libraries like Pandas effectively. This integration facilitates rapid prototyping and ad-hoc data quality solutions.
Explainability and User Interface: The tool enhances user experience by providing explanations for validation processes, offering insights into why certain patterns do not hold. This feature, combined with its web, console, and Python interfaces, makes Desbordante accessible for industrial applications.

Demonstration Scenarios

The paper demonstrates Desbordante's capabilities through three scenarios, showcasing real-world applications:

Typo Detection: Utilizing approximate functional dependencies, Desbordante can identify likely data entry errors by examining clusters where dependencies are narrowly unfulfilled. This refinement allows for targeted data cleaning strategies.
Data Deduplication: By discovering approximate dependencies and leveraging a sorted-neighborhood-like method, Desbordante enhances deduplication processes, enabling users to consolidate duplicate records effectively.
Anomaly Detection: The tool employs a dynamic mine-explore-validate cycle that aids users in identifying and validating functional dependencies in evolving datasets, providing a means to detect anomalies or shifts in data norms.

Implications and Future Directions

Desbordante has significant implications for both practical and theoretical advancements in data profiling. It provides a robust framework for addressing data quality issues, expanding the usability of complex statistics in industrial settings. Its emphasis on explainability and efficiency contributes to the broader adaptability of data profiling tools in various sectors.

Looking forward, there are opportunities to enhance Desbordante's capabilities further. Potential developments may include improving integration with other data science platforms, expanding the range of supported primitives, and refining algorithmic efficiency. As more data-driven industries recognize the importance of high-quality data, tools like Desbordante will become increasingly essential.

Conclusion

Desbordante represents a significant step toward bridging existing gaps in data profiling, offering a comprehensive, efficient, and user-friendly tool for tackling data quality challenges. By integrating science-intensive methodologies with practical applications, Desbordante provides a valuable resource for researchers and industry professionals alike.

PDF Markdown

Related Papers

GitHub

GitHub - Mstrutov/Desbordante: Desbordante is a high-performance data profiler that is capable of discovering many different patterns in data using various algorithms. It also allows to run data cleaning scenarios using these algorithms. Desbordante has a console version and an easy-to-use web application. (406 stars)