Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
149 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

JSONoid: Monoid-based Enrichment for Configurable and Scalable Data-Driven Schema Discovery (2307.03113v1)

Published 6 Jul 2023 in cs.DB

Abstract: Schema discovery is an important aspect to working with data in formats such as JSON. Unlike relational databases, JSON data sets often do not have associated structural information. Consumers of such datasets are often left to browse through data in an attempt to observe commonalities in structure across documents to construct suitable code for data processing. However, this process is time-consuming and error-prone. Existing distributed approaches to mining schemas present a significant usability advantage as they provide useful metadata for large data sources. However, depending on the data source, ad hoc queries for estimating other properties to help with crafting an efficient data pipeline can be expensive. We propose JSONoid, a distributed schema discovery process augmented with additional metadata in the form of monoid data structures that are easily maintainable in a distributed setting. JSONoid subsumes several existing approaches to distributed schema discovery with similar performance. Our approach also adds significant useful additional information about data values to discovered schemas with linear scalability.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. I. V. Latták and P. Koupil, “A comparative analysis of JSON schema inference algorithms,” in Proc. of ENASE ’22.   SciTePress, Sep 2022, pp. 379–386.
  2. P. Bourhis, J. L. Reutter, F. Suárez, and D. Vrgoč, “JSON: Data model, query languages and schema specification,” in Proc. of PODS ’17.   New York, NY, USA: ACM, May 2017, pp. 123–135.
  3. F. Pezoa, J. L. Reutter, F. Suarez, M. Ugarte, and D. Vrgoč, “Foundations of JSON schema,” in Proc. WWW ’16, Geneva, CHE, 2016, pp. 263–273.
  4. M. Zaharia, R. S. Xin, P. Wendell, T. Das, M. Armbrust, A. Dave, X. Meng, J. Rosen, S. Venkataraman, M. J. Franklin et al., “Apache Spark: a unified engine for big data processing,” Commun. ACM, vol. 59, no. 11, pp. 56–65, 2016.
  5. C. Owen, D. G. Seiler, and J. Desrosiers, “Miscellaneous Examples,” https://json-schema.org/learn/miscellaneous-examples.html, Feb. 2021, accessed Sep. 27, 2022.
  6. J. Dean and S. Ghemawat, “MapReduce: Simplified data processing on large clusters,” Commun. ACM, vol. 51, no. 1, pp. 107–113, Jan 2008.
  7. M.-A. Baazizi, H. B. Lahmar, D. Colazzo, G. Ghelli, and C. Sartiani, “Schema inference for massive JSON datasets,” in Proc. of EDBT ’17, Venice, Italy, Mar 2017.
  8. A. Singh, S. Garg, R. Kaur, S. Batra, N. Kumar, and A. Y. Zomaya, “Probabilistic data structures for big data analytics: A comprehensive review,” Knowledge-Based Systems, vol. 188, 2020.
  9. B. H. Bloom, “Space/time trade-offs in hash coding with allowable errors,” Commun. ACM, vol. 13, no. 7, pp. 422–426, Jul. 1970.
  10. P. Flajolet et al., “HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm,” in Discrete Mathematics and Theoretical Computer Science, Juan les Pins, France, Jun. 2007, pp. 137–156.
  11. M.-A. Baazizi et al., “Parametric schema inference for massive JSON datasets,” The VLDB Journal, vol. 28, pp. 497–521, Jan 2019.
  12. ——, “Counting types for massive JSON datasets,” in Proc. DBPL ’17.   Munich, Germany: ACM, 2017, pp. 1–12.
  13. M. L. Möller, N. Scharlau, and M. Klettke, “An empirical study of open data JSON files,” in Proc. DOLAP ’21, K. Stefanidis and P. Marcel, Eds., vol. 2840, 2021, pp. 121–125.
  14. K.-H. Li, “Reservoir-sampling algorithms of time complexity O(n(1+Log(N/n))),” ACM Trans. Math. Software, vol. 20, no. 4, pp. 481–493, Dec. 1994.
  15. Y. Ben-Haim and E. Tom-Tov, “A streaming parallel decision tree algorithm,” Journal of Machine Learning Research, vol. 11, no. 2, 2010.
  16. J. D. Cook, “Computing skewness and kurtosis in one pass,” https://www.johndcook.com/blog/skewness_kurtosis/, Nov. 2014, accessed Sep. 27, 2022.
  17. JSON Schema Store, “JSON schema for NPM package.json files,” https://json.schemastore.org/package.json, accessed Apr. 13, 2022.
  18. W. Spoth, O. Kennedy, Y. Lu, B. Hammerschmidt, and Z. H. Liu, “Reducing ambiguity in JSON schema discovery,” in Proc. SIGMOD ’21, 2021, pp. 1732–1744.
  19. M. Klettke, S. Scherzinger, and U. Störl, “Schema extraction and structural outlier detection for JSON-based NoSQL data stores,” in Proc. BTW ’15.   Hamburg, Germany: Gesellschaft für Informatik eV, 2015.
  20. X. Wang, Y. Hong, H. Chang, K. Park, G. Langdale, J. Hu, and H. Zhu, “Hyperscan: A fast multi-pattern regex matcher for modern CPUs,” in NSDI ’19, 2019, pp. 631–648.
  21. R. He and J. McAuley, “Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering,” in Proc. WWW ’16, Geneva, CHE, 2016, pp. 507–517.
  22. M. Souibgui, F. Atigui, S. Ben Yahia, and S. Si-Said Cherfi, “An embedding driven approach to automatically detect identifiers and references in document stores,” Data & Knowledge Engineering, vol. 139, May 2022.
  23. T. Papenbrock and F. Naumann, “Data-driven schema normalization,” in Proc. of EDBT ’17, 2017, pp. 342–353.
  24. F. Tschirschnitz, T. Papenbrock, and F. Naumann, “Detecting inclusion dependencies on very many tables,” ACM Trans. Database Syst., vol. 42, no. 3, Jul. 2017.
  25. M. J. Mior, “Fast discovery of nested dependencies on JSON data,” CoRR, vol. abs/2111.10398, 2021. [Online]. Available: https://arxiv.org/abs/2111.10398
  26. M. DiScala and D. J. Abadi, “Automatic generation of normalized relational schemas from nested key-value data,” in Proc. SIGMOD ’16.   ACM, 2016, pp. 295–310.
  27. F. J. Massey Jr, “The Kolmogorov-Smirnov test for goodness of fit,” Journal of the American statistical Association, vol. 46, no. 253, pp. 68–78, 1951.
  28. L. Wang, S. Zhang, J. Shi, L. Jiao, O. Hassanzadeh, J. Zou, and C. Wangz, “Schema management for document stores,” Proc. VLDB Endow., vol. 8, no. 9, pp. 922–933, May 2015.
  29. J. L. C. Izquierdo and J. Cabot, “Discovering implicit schemas in JSON data,” in Web Engineering.   Springer, Berlin, Heidelberg, Jul 2013, pp. 68–83.
  30. M. L. Möller, N. Berton, M. Klettke, S. Scherzinger, and U. Störl, “jHound: Large-scale profiling of open JSON data,” in Proc. BTW ’19.   Gesellschaft für Informatik, Bonn, 2019.
  31. K. Kellou-Menouer, N. Kardoulakis, G. Troullinou, Z. Kedad, D. Plexousakis, and H. Kondylakis, “A survey on semantic schema discovery,” The VLDB Journal, vol. 31, no. 4, p. 675–710, Jul 2022.
  32. E. Gallinucci, M. Golfarelli, and S. Rizzi, “Schema profiling of document-oriented databases,” Information Systems, vol. 75, pp. 13–25, Jun 2018.
  33. D. S. Ruiz, S. F. Morales, and J. G. Molina, “Inferring versioned schemas from NoSQL databases and its applications,” in Conceptual Modeling, ser. Lecture Notes in Computer Science.   Springer International Publishing, Oct 2015, pp. 467–480.
  34. M. Klettke, H. Awolin, U. Storl, D. Muller, and S. Scherzinger, “Uncovering the evolution history of data lakes,” in Proc. BigData ’17.   Boston, MA: IEEE, 12 2017, pp. 2462–2471, 00012.
  35. C. C. Aggarwal, “Supervised outlier detection,” in Outlier Analysis.   Springer International Publishing, 2017, pp. 219–248.
Citations (3)

Summary

We haven't generated a summary for this paper yet.