Papers
Topics
Authors
Recent
2000 character limit reached

Graph integration of structured, semistructured and unstructured data for data journalism

Published 23 Jul 2020 in cs.DB, cs.AI, and cs.CY | (2007.12488v2)

Abstract: Nowadays, journalism is facilitated by the existence of large amounts of digital data sources, including many Open Data ones. Such data sources are extremely heterogeneous, ranging from highly struc-tured (relational databases), semi-structured (JSON, XML, HTML), graphs (e.g., RDF), and text. Journalists (and other classes of users lacking advanced IT expertise, such as most non-governmental-organizations, or small public administrations) need to be able to make sense of such heterogeneous corpora, even if they lack the ability to de ne and deploy custom extract-transform-load work ows. These are di cult to set up not only for arbitrary heterogeneous inputs , but also given that users may want to add (or remove) datasets to (from) the corpus. We describe a complete approach for integrating dynamic sets of heterogeneous data sources along the lines described above: the challenges we faced to make such graphs useful, allow their integration to scale, and the solutions we proposed for these problems. Our approach is implemented within the ConnectionLens system; we validate it through a set of experiments.

Citations (38)

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.