Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Novel Metric for Measuring Data Quality in Classification Applications (extended version) (2312.08066v1)

Published 13 Dec 2023 in cs.LG and cs.AI

Abstract: Data quality is a key element for building and optimizing good learning models. Despite many attempts to characterize data quality, there is still a need for rigorous formalization and an efficient measure of the quality from available observations. Indeed, without a clear understanding of the training and testing processes, it is hard to evaluate the intrinsic performance of a model. Besides, tools allowing to measure data quality specific to machine learning are still lacking. In this paper, we introduce and explain a novel metric to measure data quality. This metric is based on the correlated evolution between the classification performance and the deterioration of data. The proposed method has the major advantage of being model-independent. Furthermore, we provide an interpretation of each criterion and examples of assessment levels. We confirm the utility of the proposed metric with intensive numerical experiments and detail some illustrative cases with controlled and interpretable qualities.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (23)
  1. Ataccama (2023). Ataccamaone. https://www.ataccama.com/platform.
  2. Methodologies for data quality assessment and improvement. In ACM computing surveys.
  3. Data and information quality. Springer.
  4. Visual interactive creation, customization, and analysis of data quality metrics. In Journal of Data and Information Quality (JDIQ) ACM.
  5. An overview of data quality frameworks. In IEEE Access.
  6. DataCleaner (2023). Datacleaner. https://datacleaner.github.io/.
  7. Datamartist (2023). Datamartist. http://www.datamartist.com/.
  8. A survey of data quality measurement and monitoring tools. In Frontiers in big data.
  9. Experian (2023). User manual version 5.9. https://www.edq.com/globalassets/documentation/pandora/pandora“manual“590.pdf.
  10. Foundation, A. (2023). Apache griffin user guide. https://github.com/apache/griffin/blob/master/griffin-doc/ui/user-guide.md.
  11. Data quality considerations for big data and machine learning: Going beyond data cleaning and transformations. In International Journal on Advances in Software.
  12. IBM (2023). Ibm data quality for ai api. https://developer.ibm.com/apis/catalog/dataquality4ai–data-quality-for-ai/Introduction.
  13. Informatica (2023). What is data quality? https://www.informatica.com/resources/articles/what-is-data-quality.html.
  14. InfoZoom (2023). Infozoom & izdq. https://www.infozoom.com/en/products/infozoom-data-quality/.
  15. On studying the effect of data quality on classification performances. In 23rd International Conference on Intelligent Data Engineering and Automated Learning (IDEAL). Springer.
  16. Additional resources for the reproducibility of the experiment. https://gitlab.com/roxane.jouseau/measuring-data-quality-for-classification-tasks.
  17. The uci machine learning repository. https://archive.ics.uci.edu.
  18. OpenRefine (2023). Openrefine. https://github.com/OpenRefine/OpenRefine.
  19. Scikit-learn: Machine learning in python. In Journal of Machine Learning Research.
  20. Data quality assessment. In Communications of the ACM.
  21. Rolland, A. (2023). Mobydq. https://ubisoft.github.io/mobydq.
  22. SAS (2023). Dataflux data management studio 2.7: User guide. http://support.sas.com/documentation/onlinedoc/dfdmstudio/2.7/dmpdmsug/dfUnity.html.
  23. Talend (2023). Talend open studio for data quality – user guide 7.0.1m2. http://download-mirror1.talend.com/top/user-guide-download/V552/TalendOpenStudio˙DQ˙UG˙5.5.2˙EN.pdf.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Jouseau Roxane (1 paper)
  2. Salva Sébastien (1 paper)
  3. Samir Chafik (1 paper)

Summary

We haven't generated a summary for this paper yet.