Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Framework to Automatically Determine the Quality of Open Data Catalogs (2307.15464v7)

Published 28 Jul 2023 in cs.IR

Abstract: Data catalogs play a crucial role in modern data-driven organizations by facilitating the discovery, understanding, and utilization of diverse data assets. However, ensuring their quality and reliability is complex, especially in open and large-scale data environments. This paper proposes a framework to automatically determine the quality of open data catalogs, addressing the need for efficient and reliable quality assessment mechanisms. Our framework can analyze various core quality dimensions, such as accuracy, completeness, consistency, scalability, and timeliness, offer several alternatives for the assessment of compatibility and similarity across such catalogs as well as the implementation of a set of non-core quality dimensions such as provenance, readability, and licensing. The goal is to empower data-driven organizations to make informed decisions based on trustworthy and well-curated data assets. The source code that illustrates our approach can be downloaded from https://www.github.com/jorge-martinez-gil/dataq/.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (30)
  1. The W3C data catalog vocabulary, version 2: Rationale, design principles, and uptake. CoRR, abs/2303.08883. doi:10.48550/arXiv.2303.08883. arXiv:2303.08883.
  2. Introducing the data quality vocabulary (DQV). Semantic Web, 12, 81--97. doi:10.3233/SW-200382.
  3. Towards a standard-based open data ecosystem: analysis of DCAT-AP use at national and european level. Electron. Gov. an Int. J., 18, 137--180. doi:10.1504/EG.2022.121856.
  4. Luzzu - A methodology and framework for linked data quality assessment. ACM J. Data Inf. Qual., 8, 4:1--4:32. doi:10.1145/2992786.
  5. Using semantic technologies to manage a data lake: Data catalog, provenance and access control. In T. Liebig, A. Fokoue, & Z. Wu (Eds.), Proceedings of the 12th International Workshop on Scalable Semantic Web Knowledge Base Systems co-located with 19th International Semantic Web Conference (ISWC 2020), Athens, Greece, November 2, 2020 (pp. 65--80). CEUR-WS.org volume 2757 of CEUR Workshop Proceedings. URL: https://ceur-ws.org/Vol-2757/SSWS2020_paper5.pdf.
  6. Data catalogs: A systematic literature review and guidelines to implementation. In G. Kotsis, A. M. Tjoa, I. Khalil, B. Moser, A. Mashkoor, J. Sametinger, A. Fensel, J. M. Gil, L. Fischer, G. Czech, F. Sobieczky, & S. Khan (Eds.), Database and Expert Systems Applications - DEXA 2021 Workshops - BIOKDD, IWCFS, MLKgraphs, AI-CARES, ProTime, AISys 2021, Virtual Event, September 27-30, 2021, Proceedings (pp. 148--158). Springer volume 1479 of Communications in Computer and Information Science. doi:10.1007/978-3-030-87101-7\_15.
  7. Geißner, A. (2023). Modeling institutional research data repositories using the DCAT3 Data Catalog Vocabulary. Master’s thesis Humboldt-Universität zu Berlin.
  8. Data quality toolkit: Automatic assessment of data quality and remediation for machine learning datasets. CoRR, abs/2108.05935. arXiv:2108.05935.
  9. A decentralised persistent identification layer for DCAT datasets. In Y. Ding, J. Tang, J. F. Sequeda, L. Aroyo, C. Castillo, & G. Houben (Eds.), Companion Proceedings of the ACM Web Conference 2023, WWW 2023, Austin, TX, USA, 30 April 2023 - 4 May 2023 (pp. 1424--1427). ACM. doi:10.1145/3543873.3587589.
  10. Linked data in the european data portal: A comprehensive platform for applying dcat-ap. In Electronic Government: 18th IFIP WG 8.5 International Conference, EGOV 2019, San Benedetto Del Tronto, Italy, September 2–4, 2019, Proceedings 18 (pp. 192--204). Springer.
  11. Linkedpipes DCAT-AP viewer: A native DCAT-AP data catalog. In M. van Erp, M. Atre, V. López, K. Srinivas, & C. Fortuna (Eds.), Proceedings of the ISWC 2018 Posters & Demonstrations, Industry and Blue Sky Ideas Tracks co-located with 17th International Semantic Web Conference (ISWC 2018), Monterey, USA, October 8th - to - 12th, 2018. CEUR-WS.org volume 2180 of CEUR Workshop Proceedings. URL: https://ceur-ws.org/Vol-2180/paper-32.pdf.
  12. Comparison of metadata quality in open data portals using the analytic hierarchy process. Gov. Inf. Q., 35, 13--29. doi:10.1016/j.giq.2017.11.003.
  13. FAIR enough? enhancing the usage of enterprise data with data catalogs. In 22nd IEEE Conference on Business Informatics, CBI 2020, Antwerp, Belgium, June 22-24, 2020. Volume 1 (pp. 201--210). IEEE. doi:10.1109/CBI49978.2020.00029.
  14. Open data as a foundation for innovation: The enabling effect of free public sector information for entrepreneurs. IEEE Access, 1, 558--563. doi:10.1109/ACCESS.2013.2279164.
  15. Improving findability of digital assets in research data repositories using the W3C DCAT vocabulary. In P. Otero, P. Scott, S. Z. Martin, & E. Huesing (Eds.), MEDINFO 2021: One World, One Health - Global Partnership for Digital Innovation - Proceedings of the 18th World Congress on Medical and Health Informatics, Virtual Event, 2-4 October 2021 (pp. 61--65). IOS Press volume 290 of Studies in Health Technology and Informatics. doi:10.3233/SHTI220032.
  16. Data catalog vocabulary (dcat). w3c recommendation. World Wide Web Consortium, (pp. 29--126).
  17. Martinez-Gil, J. (2019). Semantic similarity aggregators for very short textual expressions: a case study on landmarks and points of interest. J. Intell. Inf. Syst., 53, 361--380. doi:10.1007/s10844-019-00561-0.
  18. Martinez-Gil, J. (2022). A comprehensive review of stacking methods for semantic similarity measurement. Machine Learning with Applications, 10, 100423. doi:10.1016/j.mlwa.2022.100423.
  19. Martinez-Gil, J. (2023). Optimizing readability using genetic algorithms. CoRR, abs/2301.00374. doi:10.48550/arXiv.2301.00374. arXiv:2301.00374.
  20. A novel method based on symbolic regression for interpretable semantic similarity measurement. Expert Syst. Appl., 160, 113663. doi:10.1016/j.eswa.2020.113663.
  21. Sustainable semantic similarity assessment. Journal of Intelligent & Fuzzy Systems, 43, 6163--6174. doi:10.3233/JIFS-220137.
  22. Automated quality assessment of metadata across open data portals. ACM J. Data Inf. Qual., 8, 2:1--2:29. doi:10.1145/2964909.
  23. Quality of metadata in open data portals. IEEE Access, 9, 60364--60382. doi:10.1109/ACCESS.2021.3073455.
  24. Automatic evaluation of metadata quality in digital repositories. Int. J. Digit. Libr., 10, 67--91. doi:10.1007/s00799-009-0054-4.
  25. Browsing linked data catalogs with lodatlas. In D. Vrandecic, K. Bontcheva, M. C. Suárez-Figueroa, V. Presutti, I. Celino, M. Sabou, L. Kaffee, & E. Simperl (Eds.), The Semantic Web - ISWC 2018 - 17th International Semantic Web Conference, Monterey, CA, USA, October 8-12, 2018, Proceedings, Part II (pp. 137--153). Springer volume 11137 of Lecture Notes in Computer Science. doi:10.1007/978-3-030-00668-6\_9.
  26. Building a data processing activities catalog: Representing heterogeneous compliance-related information for GDPR using DCAT-AP and DPV. In M. Alam, P. Groth, V. de Boer, T. Pellegrini, H. J. Pandit, E. Montiel-Ponsoda, V. Rodríguez-Doncel, B. McGillivray, & A. Meroño-Peñuela (Eds.), Further with Knowledge Graphs - Proceedings of the 17th International Conference on Semantic Systems, SEMANTiCS 2017, Amsterdam, The Netherlands, September 6-9, 2021 (pp. 169--182). IOS Press volume 53 of Studies on the Semantic Web. doi:10.3233/SSW210043.
  27. Evaluation framework for search methods focused on dataset findability in open data catalogs. In M. Indrawan-Santiago, E. Pardede, I. L. Salvadori, M. Steinbauer, I. Khalil, & G. Kotsis (Eds.), iiWAS ’20: The 22nd International Conference on Information Integration and Web-based Applications & Services, Virtual Event / Chiang Mai, Thailand, November 30 - December 2, 2020 (pp. 200--209). ACM. doi:10.1145/3428757.3429973.
  28. Comprehensive and comprehensible data catalogs: The what, who, where, when, why, and how of metadata management. CoRR, abs/2103.07532. arXiv:2103.07532.
  29. The fair guiding principles for scientific data management and stewardship. Scientific data, 3, 1--9.
  30. Quality assessment for linked data: A survey. Semantic Web, 7, 63--93. doi:10.3233/SW-150175.

Summary

We haven't generated a summary for this paper yet.