Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Polystore Architecture Using Knowledge Graphs to Support Queries on Heterogeneous Data Stores (2308.03584v2)

Published 7 Aug 2023 in cs.DB

Abstract: Modern applications commonly need to manage dataset types composed of heterogeneous data and schemas, making it difficult to access them in an integrated way. A single data store to manage heterogeneous data using a common data model is not effective in such a scenario, which results in the domain data being fragmented in the data stores that best fit their storage and access requirements (e.g., NoSQL, relational DBMS, or HDFS). Besides, organization workflows independently consume these fragments, and usually, there is no explicit link among the fragments that would be useful to support an integrated view. The research challenge tackled by this work is to provide the means to query heterogeneous data residing on distinct data repositories that are not explicitly connected. We propose a federated database architecture by providing a single abstract global conceptual schema to users, allowing them to write their queries, encapsulating data heterogeneity, location, and linkage by employing: (i) meta-models to represent the global conceptual schema, the remote data local conceptual schemas, and mappings among them; (ii) provenance to create explicit links among the consumed and generated data residing in separate datasets. We evaluated the architecture through its implementation as a polystore service, following a microservice architecture approach, in a scenario that simulates a real case in Oil & Gas industry. Also, we compared the proposed architecture to a relational multidatabase system based on foreign data wrappers, measuring the user's cognitive load to write a query (or query complexity) and the query processing time. The results demonstrated that the proposed architecture allows query writing two times less complex than the one written for the relational multidatabase system, adding an excess of no more than 30% in query processing time.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (50)
  1. Data integration using semantic technology: a use case. In 2006 Second International Conference on Rules and Rule Markup Languages for the Semantic Web (RuleML’06), pages 58–66. IEEE. DOI: 10.1109/RULEML.2006.9.
  2. Modern federated database systems: An overview. In 22nd International Conference in Enterprise Information Systems (ICEIS), pages 276–283. European Association of Geoscientists & Engineers. DOI: 10.5220/0009795402760283.
  3. INSIDE: An Ontology-based Data Integration System Applied to the Oil and Gas Sector. In Proceedings of the XIX Brazilian Symposium on Information Systems, SBSI ’23, pages 94–101, New York, NY, USA. Association for Computing Machinery. DOI: 10.1145/3592813.3592893.
  4. Towards heterogeneous multimedia information systems: The garlic approach. In Proceedings RIDE-DOM’95. Fifth International Workshop on Research Issues in Data Engineering-Distributed Object Management, pages 124–131. IEEE. DOI: 10.1109/RIDE.1995.378736.
  5. Efficient classification of seismic textures. In 2018 International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE. DOI: 10.1109/IJCNN.2018.8489654.
  6. Capturing and querying workflow runtime provenance with prov: a practical approach. In EDBT/ICDT workshops. DOI: 10.1145/2457317.2457365.
  7. Provenance and scientific workflows: challenges and opportunities. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pages 1345–1350. DOI: 10.1145/1376616.1376772.
  8. Using ontologies for semantic data integration. In Flesca, S., Greco, S., Masciari, E., and Saccà, D., editors, A Comprehensive Guide Through the Italian Database Research Over the Last 25 Years, Studies in Big Data, pages 187–202. Springer International Publishing. DOI: 10.1007/978-3-319-61893-7_11.
  9. Towards a definition of knowledge graph. SEMANTICS 2016: Posters adn Demos Track, 48(1-4):2.
  10. Principled design of the modern web architecture. ACM Transactions on Internet Technology (TOIT), 2(2):115–150. DOI: 10.1145/514183.514185.
  11. The bigdawg polystore system and architecture. In 2016 IEEE High Performance Extreme Computing Conference (HPEC), pages 1–6. IEEE. DOI: 10.1109/HPEC.2016.7761636.
  12. The google file system. In Proceedings of the nineteenth ACM symposium on Operating systems principles, pages 29–43. DOI: 10.1145/945445.945450.
  13. Intelligent systems for geosciences: an essential research agenda. Communications of the ACM, 62(1):76–84. DOI: 10.1145/3192335.
  14. W3C PROV: an overview of the prov family of documents.
  15. Gruber, T. R. (2008). Ontology. In Encyclopedia of Database Systems. Springer-Verlag.
  16. LUBM: A benchmark for owl knowledge base systems. Journal of Web Semantics, 3(2-3):158–182. DOI: 10.1016/j.websem.2005.06.005.
  17. Data integration through database federation. IBM Systems Journal, 41(4):578–596. DOI: 10.1147/sj.414.0578.
  18. Discoverylink: A system for integrated access to life sciences data sources. IBM systems Journal, 40(2):489–511. DOI: 10.1147/sj.402.0489.
  19. Scientific benchmarking of parallel computing systems: twelve ways to tell the masses when reporting performance results. In SC ’15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–12. DOI: 10.1145/2807591.2807644.
  20. A survey on knowledge graphs: representation, acquisition, and applications. IEEE Transactions on Neural Networks and Learning Systems, 33(2):494–514. DOI: 10.1109/TNNLS.2021.3070843.
  21. A semantic web middleware for virtual data integration on the web. In European Semantic Web Conference, pages 493–507. Springer. DOI: 10.1007/978-3-540-68234-9_37.
  22. A middleware framework for scalable management of linked streams. Journal of Web Semantics, 16:42–51. DOI: 10.1016/j.websem.2012.06.003.
  23. Augmented access for querying and exploring a polystore. In 2018 IEEE 34th International Conference on Data Engineering (ICDE), pages 77–88. IEEE. DOI: 10.1109/ICDE.2018.00017.
  24. SQL/MED: a status report. ACM SIGMOD Record, 31(3):81–89. DOI: 10.1145/601858.601877.
  25. SQL and management of external data. ACM SIGMOD Record, 30(1):70–77. DOI: 10.1145/373626.373709.
  26. Polyflow: A SOA for Analyzing Workflow Heterogeneous Provenance Data in Distributed Environments. In Proceedings of the XV Brazilian Symposium on Information Systems, SBSI ’19, pages 1–8, New York, NY, USA. Association for Computing Machinery. DOI: 10.1145/3330204.3330259.
  27. Extending hypermedia conceptual models to support hyperknowledge specifications. International Journal of Semantic Computing, 11(01):43–64. DOI: 10.1142/S1793351X17400037.
  28. A hyperknowledge approach to support dataset engineering. In ISWC (Posters/Demos/Industry).
  29. Handling hyperknowledge representations through an interactive visual approach. In 2018 IEEE International Conference on Information Reuse and Integration (IRI), pages 139–146. IEEE. DOI: 10.1109/IRI.2018.00029.
  30. Otuonye, A. I. (2021). Cloud-based enterprise resource planning for sustainable growth of smes in third world countries. International Journal of Computer Science and Information Security (IJCSIS), 19(5). DOI: 10.5281/zenodo.4900658.
  31. Principles of distributed database systems. Springer, 4th edition. DOI: 10.1007/978-3-030-26253-2.
  32. Multimedia big data analytics: A survey. ACM computing surveys (CSUR), 51(1):1–34. DOI: 10.1145/3150226.
  33. Sparql query language for rdf. Accessed in April 12st, 2021.
  34. Three-dimensional texture attributes for seismic data analysis. In SEG Technical Program Expanded Abstracts 2000, pages 668–671. Society of Exploration Geophysicists.
  35. RESTful web APIs: services for a changing world. O’Reilly Media Inc.
  36. RESTful web services. O’Reilly Media, Inc.
  37. Singhal, A. (2012). Introducing the knowledge graph: thing, not strings. https://blog.google/products/search/introducing-knowledge-graph-things-not. Accessed in June 25st, 2022.
  38. Workflow provenance in the lifecycle of scientific machine learning. Concurrency and Computation: Practice and Experience. DOI: 10.1002/cpe.6544.
  39. Workflow provenance in the lifecycle of scientific machine learning. Concurrency and Computation: Practice and Experience, e6544:1–21. DOI: 10.1109/eScience.2019.00047.
  40. Efficient runtime capture of multiworkflow data using provenance. In 2019 15th International Conference on eScience (eScience), pages 359–368. IEEE. DOI: 10.1109/eScience.2019.00047.
  41. Stonebraker, M. (2015). The case for polystore. https://wp.sigmod.org/?p=1629.
  42. A new model for measuring the complexity of sql commands. In 10th International Conference on Information Technology and Electrical Engineering (ICITEE), pages 1–5. DOI: 10.1109/ICITEED.2018.8534782.
  43. Enabling query processing across heterogeneous data models: A survey. In IEEE Intl. Conf. on Big Data (Big Data), pages 3211–3220. IEEE. DOI: 10.1109/BigData.2017.8258302.
  44. Measuring query complexity in sqlshare workload. https://uwescience.github.io/sqlshare/pdfs/Jain-Vashistha.pdf. Accessed in January 23rd, 2022.
  45. Microservice Architecture for Multistore Database Using Canonical Data Model. In Proceedings of the XVI Brazilian Symposium on Information Systems, SBSI ’20, pages 1–8, New York, NY, USA. Association for Computing Machinery. DOI: 10.1145/3411564.3411629.
  46. EPAComp: An Architectural Model for EPA Composition. In Proceedings of the XIX Brazilian Symposium on Information Systems, SBSI ’23, pages 61–69, New York, NY, USA. Association for Computing Machinery. DOI: 10.1145/3592813.3592889.
  47. Unified access layer with postgresql fdw for heterogeneous databases. In IFIP International Conference on Network and Parallel Computing, pages 131–135. Springer. DOI: 10.1007/978-3-319-68210-5_14.
  48. Wiederhold, G. (1992). Mediators in the architecture of future information systems. Computer, 25(3):38–49. DOI: 10.1109/2.121508.
  49. Spider: a large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task.
  50. Evaluation of triple indices in retrieving web documents. In International Conference on Advanced Computer Science Applications and Technologies (ACSAT), pages 525–529. IEEE. DOI: 10.1109/ACSAT.2013.109.

Summary

We haven't generated a summary for this paper yet.