Data Quality Assessment: Challenges and Opportunities (2403.00526v2)
Abstract: Data-oriented applications, their users, and even the law require data of high quality. Research has divided the rather vague notion of data quality into various dimensions, such as accuracy, consistency, and reputation. To achieve the goal of high data quality, many tools and techniques exist to clean and otherwise improve data. Yet, systematic research on actually assessing data quality in its dimensions is largely absent, and with it, the ability to gauge the success of any data cleaning effort. We propose five facets as ingredients to assess data quality: data, source, system, task, and human. Tapping each facet for data quality assessment poses its own challenges. We show how overcoming these challenges helps data quality assessment for those data quality dimensions mentioned in Europe's AI Act. Our work concludes with a proposal for a comprehensive data quality assessment framework.
- 1996. Health Insurance Portability and Accountability Act of 1996 (HIPAA) | CDC. https://www.cdc.gov/phlp/publications/topic/hipaa.html
- Shazia Sadiq (Ed.). 2013. Handbook of data quality: research and practice (Berlin Heidelberg). Springer. https://doi.org/10.1007/978-3-642-36257-6
- 2016. General Data Protection Regulation (Last accessed: 2024-02-13). https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=CELEX:02016R0679-20160504
- Profiling relational data: a survey. VLDB Journal 24, 4 (2015), 557–581. https://doi.org/10.1007/S00778-015-0389-Y
- A Survey on Homomorphic Encryption Schemes: Theory and Implementation. Comput. Surveys 51, 4 (2018), 79:1–79:35. https://doi.org/10.1145/3214303
- Assessing and remedying coverage for a given dataset. In 2019 IEEE 35th International Conference on Data Engineering (ICDE). IEEE, 554–565.
- Carlo Batini. 2016. Data and information quality: dimensions, principles and techniques. Springer Berlin Heidelberg.
- Methodologies for data quality assessment and improvement. Comput. Surveys 41, 3 (2009), 16:1–16:52. https://doi.org/10.1145/1541880.1541883
- A Comparative Analysis of Methodologies for Database Schema Integration. Comput. Surveys 18, 4 (1986), 323–364. https://doi.org/10.1145/27633.27634
- Carlo Batini and Monica Scannapieco. 2006. Data quality: concepts, methodologies and techniques. Springer.
- Efficiently Computing Inclusion Dependencies for Schema Discovery. In Proceedings of the International Conference on Data Engineering Workshops (ICDE), Roger S. Barga and Xiaofang Zhou (Eds.). IEEE, 2. https://doi.org/10.1109/ICDEW.2006.54
- Towards Semantic Web Mining. In Proceedings of the International Semantic Web Conference (ISWC) (Lecture Notes in Computer Science), Ian Horrocks and James A. Hendler (Eds.), Vol. 2342. Springer, 264–278. https://doi.org/10.1007/3-540-48005-6_21
- The Effects of Data Quality on Machine Learning Performance. arXiv preprint arXiv:2207.14529 (2022).
- Li Cai and Yangyong Zhu. 2015. The Challenges of Data Quality and Data Quality Assessment in the Big Data Era. Data Sci. J. 14 (2015), 2. https://doi.org/10.5334/DSJ-2015-002
- Raul Castro Fernandez. 2023. Data-Sharing Markets: Model, Protocol, and Algorithms to Incentivize the Formation of Data-Sharing Consortia. Proceedings of the ACM on Management of Data (PACMMOD) 1, 2 (2023). https://doi.org/10.1145/3589317
- Line H. Clemmensen and Rune D. Kjærsgaard. 2022. Data Representativity for Machine Learning and AI Systems. CoRR abs/2203.04706 (2022). https://doi.org/10.48550/ARXIV.2203.04706 arXiv:2203.04706
- Principles of Data Integration. Morgan Kaufmann. https://doi.org/10.1016/C2011-0-06130-6
- Cynthia Dwork. [n.d.]. Differential Privacy. In Automata, Languages and Programming. Vol. 4052. Springer Berlin Heidelberg, 1–12. https://doi.org/10.1007/11787006_1 Series Title: Lecture Notes in Computer Science.
- Human-Computer Interaction: Introduction and Overview. KI - Künstliche Intelligenz 26, 2 (2012), 121–126. https://doi.org/10.1007/s13218-012-0174-7
- Data Catalogs: A Systematic Literature Review and Guidelines to Implementation. In Database and Expert Systems Applications - DEXA 2021 Workshops - BIOKDD, IWCFS, MLKgraphs, AI-CARES, ProTime, AISys 2021, Virtual Event, September 27-30, 2021, Proceedings (Communications in Computer and Information Science), Vol. 1479. Springer, 148–158. https://doi.org/10.1007/978-3-030-87101-7_15
- European Parliament. 2023. Data Act (final ed.). https://doi.org/10.5040/9781782258674
- European Parliament. 2024. Artifical Inteligence Act (Proposal). https://media.licdn.com/dms/document/media/D4E1FAQF1e5-c80Uqgw/feedshare-document-pdf-analyzed/0/1705928091363?e=1709164800&v=beta&t=-4aKfFU14bWHcCzBjma4uQGkM6k101xdsNhr524hwj8 Leaked version from 2024-1-21.
- Estimating the extent of the effects of Data Quality through Observations. In Proceedings of the International Conference on Data Engineering (ICDE). IEEE, 1913–1918. https://doi.org/10.1109/ICDE51399.2021.00176
- Haibo He and Edwardo A. Garcia. 2009. Learning from Imbalanced Data. IEEE Transactions on Knowledge and Data Engineering (TKDE) 21, 9 (2009), 1263–1284. https://doi.org/10.1109/TKDE.2008.239
- A survey on provenance: What for? What form? What from? VLDB Journal 26, 6 (2017), 881–906. https://doi.org/10.1007/S00778-017-0486-1
- Data quality and record linkage techniques. Springer. OCLC: ocn137313060.
- The White House. 2023. Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence. https://www.whitehouse.gov/briefing-room/presidential-actions/2023/10/30/executive-order-on-the-safe-secure-and-trustworthy-development-and-use-of-artificial-intelligence/
- ActiveClean: Interactive Data Cleaning For Statistical Modeling. PVLDB 9, 12 (2016), 948–959. https://doi.org/10.14778/2994509.2994514
- The Right to Data Portability: conception, status quo, and future directions. Informatik Spektrum 44, 4 (2021), 264–272. https://doi.org/10.1007/s00287-021-01372-w
- Research Methods in Human Computer Interaction (second ed.). Elsevier.
- DBpedia - A large-scale, multilingual knowledge base extracted from Wikipedia. Semantic Web 6, 2 (2015), 167–195. https://doi.org/10.3233/SW-140134
- CleanML: A Study for Evaluating the Impact of Data Cleaning on ML Classification Tasks. In Proceedings of the International Conference on Data Engineering (ICDE). IEEE, 13–24. https://doi.org/10.1109/ICDE51399.2021.00009
- Anne E. Magurran. 2021. Measuring biological diversity. Current Biology 31, 19 (2021), R1174–R1177. https://doi.org/10.1016/j.cub.2021.07.049
- Arkady Maydanchik. 2007. Data quality assessment. Technics Publications.
- Towards Query Pricing on Incomplete Data. IEEE Transactions on Knowledge and Data Engineering (TKDE) 34, 8 (2022), 4024–4036. https://doi.org/10.1109/TKDE.2020.3026031
- A Data Quality Glossary. (2024). https://doi.org/10.5281/ZENODO.10474880 Publisher: Zenodo Version Number: 1.0.
- SQuaRE-Aligned Data Quality Model for Web Portals. In Proceedings of the Ninth International Conference on Quality Software (QSIC). IEEE, 117–122. https://doi.org/10.1109/QSIC.2009.23
- Assessing data quality: A managerial call to action. Business Horizons 63, 3 ([n. d.]), 325–337. https://doi.org/10.1016/j.bushor.2020.01.006
- Felix Naumann. 2002. Quality-Driven Query Answering for Integrated Information Systems. Lecture Notes in Computer Science, Vol. 2261. Springer. https://doi.org/10.1007/3-540-45921-9
- Felix Naumann and Melanie Herschel. 2010. An introduction to duplicate detection. Number 3 in Synthesis lectures on data management. Morgan & Claypool Publishers.
- Felix Naumann and Claudia Rolker. 2000. Assessment Methods for Information Quality Criteria. In Fifth Conference on Information Quality (IQ 2000). MIT, 148–162.
- Automated Quality Assessment of Metadata across Open Data Portals. Journal on Data and Information Quality 8, 1 (2016), 2:1–2:29. https://doi.org/10.1145/2964909
- From Cleaning before ML to Cleaning for ML. IEEE Data Engineering Bulletin 44, 1 (2021), 24–41. http://sites.computer.org/debull/A21mar/p24.pdf
- Data Cleaning and AutoML: Would an Optimizer Choose to Clean? Datenbank-Spektrum 22, 2 (2022), 121–130. https://doi.org/10.1007/s13222-022-00413-2
- Data Quality Assessment. Commun. ACM 45, 4 (2002), 211–218. https://doi.org/10.1145/505248.506010
- Data Cards: Purposeful and Transparent Dataset Documentation for Responsible AI. In Proceedings of the ACM Conference on Fairness, Accountability, and Transparency (FaCCT). Association for Computing Machinery, New York, NY, USA, 1776–1826. https://doi.org/10.1145/3531146.3533231
- FAHES: A Robust Disguised Missing Values Detector. In Proceedings of the International Conference on Knowledge discovery and data mining (SIGKDD). Association for Computing Machinery, New York, NY, USA, 2100–2109. https://doi.org/10.1145/3219819.3220109
- Thomas C Redman. 2001. Data quality: the field guide. Digital press.
- Formalizing GDPR Provisions in Reified I/O Logic: The DAPRECO Knowledge Base. J. Log. Lang. Inf. 29, 4 (2020), 401–449. https://doi.org/10.1007/S10849-019-09309-Z
- The Chinese approach to artificial intelligence: an analysis of policy, ethics, and regulation. 36, 1 (2021), 59–77. https://doi.org/10.1007/s00146-020-00992-2
- How do Categorical Duplicates Affect ML? A New Benchmark and Empirical Analyses. https://adalabucsd.github.io/papers/TR_2023_CategDedup.pdf
- Lloyd S Shapley. 1953. A Value for n-Person Games. In Contributions to the Theory of Games II. Princeton University Press, Princeton, 307–317.
- A framework for information quality assessment. J. Assoc. Inf. Sci. Technol. 58, 12 (2007), 1720–1733. https://doi.org/10.1002/ASI.20652
- Latanya Sweeney. 2002. k-Anonymity: A Model for Protecting Privacy. Int. J. Uncertain. Fuzziness Knowl. Based Syst. 10, 5 (2002), 557–570. https://doi.org/10.1142/S0218488502001648
- YAGO 4: A Reason-able Knowledge Base. In Proceedings of the Extended Semantic Web Conference (ESWC) (Lecture Notes in Computer Science), Vol. 12123. Springer, 583–596. https://doi.org/10.1007/978-3-030-49461-2_34
- A taxonomy of privacy-preserving record linkage techniques. Information Systems (IS) 38, 6 (2013), 946–969. https://doi.org/10.1016/J.IS.2012.11.005
- Richard Y. Wang and Diane M. Strong. 1996. Beyond Accuracy: What Data Quality Means to Data Consumers. J. Manag. Inf. Syst. 12, 4 (1996), 5–33. https://doi.org/10.1080/07421222.1996.11518099
- Data collection and quality challenges in deep learning: a data-centric AI perspective. VLDB Journal 32, 4 (2023), 791–813. https://doi.org/10.1007/S00778-022-00775-9
- Data-centric Artificial Intelligence: A Survey. CoRR abs/2303.10158 (2023). https://doi.org/10.48550/ARXIV.2303.10158 arXiv:2303.10158
- Data-centric Artificial Intelligence: A Survey. CoRR abs/2303.10158 (2023). https://doi.org/10.48550/arXiv.2303.10158 arXiv:2303.10158