A Novel Metric for Measuring Data Quality in Classification Applications (extended version) (2312.08066v1)
Abstract: Data quality is a key element for building and optimizing good learning models. Despite many attempts to characterize data quality, there is still a need for rigorous formalization and an efficient measure of the quality from available observations. Indeed, without a clear understanding of the training and testing processes, it is hard to evaluate the intrinsic performance of a model. Besides, tools allowing to measure data quality specific to machine learning are still lacking. In this paper, we introduce and explain a novel metric to measure data quality. This metric is based on the correlated evolution between the classification performance and the deterioration of data. The proposed method has the major advantage of being model-independent. Furthermore, we provide an interpretation of each criterion and examples of assessment levels. We confirm the utility of the proposed metric with intensive numerical experiments and detail some illustrative cases with controlled and interpretable qualities.
- Ataccama (2023). Ataccamaone. https://www.ataccama.com/platform.
- Methodologies for data quality assessment and improvement. In ACM computing surveys.
- Data and information quality. Springer.
- Visual interactive creation, customization, and analysis of data quality metrics. In Journal of Data and Information Quality (JDIQ) ACM.
- An overview of data quality frameworks. In IEEE Access.
- DataCleaner (2023). Datacleaner. https://datacleaner.github.io/.
- Datamartist (2023). Datamartist. http://www.datamartist.com/.
- A survey of data quality measurement and monitoring tools. In Frontiers in big data.
- Experian (2023). User manual version 5.9. https://www.edq.com/globalassets/documentation/pandora/pandora“manual“590.pdf.
- Foundation, A. (2023). Apache griffin user guide. https://github.com/apache/griffin/blob/master/griffin-doc/ui/user-guide.md.
- Data quality considerations for big data and machine learning: Going beyond data cleaning and transformations. In International Journal on Advances in Software.
- IBM (2023). Ibm data quality for ai api. https://developer.ibm.com/apis/catalog/dataquality4ai–data-quality-for-ai/Introduction.
- Informatica (2023). What is data quality? https://www.informatica.com/resources/articles/what-is-data-quality.html.
- InfoZoom (2023). Infozoom & izdq. https://www.infozoom.com/en/products/infozoom-data-quality/.
- On studying the effect of data quality on classification performances. In 23rd International Conference on Intelligent Data Engineering and Automated Learning (IDEAL). Springer.
- Additional resources for the reproducibility of the experiment. https://gitlab.com/roxane.jouseau/measuring-data-quality-for-classification-tasks.
- The uci machine learning repository. https://archive.ics.uci.edu.
- OpenRefine (2023). Openrefine. https://github.com/OpenRefine/OpenRefine.
- Scikit-learn: Machine learning in python. In Journal of Machine Learning Research.
- Data quality assessment. In Communications of the ACM.
- Rolland, A. (2023). Mobydq. https://ubisoft.github.io/mobydq.
- SAS (2023). Dataflux data management studio 2.7: User guide. http://support.sas.com/documentation/onlinedoc/dfdmstudio/2.7/dmpdmsug/dfUnity.html.
- Talend (2023). Talend open studio for data quality – user guide 7.0.1m2. http://download-mirror1.talend.com/top/user-guide-download/V552/TalendOpenStudio˙DQ˙UG˙5.5.2˙EN.pdf.
- Jouseau Roxane (1 paper)
- Salva Sébastien (1 paper)
- Samir Chafik (1 paper)