Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

From Data Fusion to Knowledge Fusion (1503.00302v1)

Published 1 Mar 2015 in cs.DB

Abstract: The task of {\em data fusion} is to identify the true values of data items (eg, the true date of birth for {\em Tom Cruise}) among multiple observed values drawn from different sources (eg, Web sites) of varying (and unknown) reliability. A recent survey\cite{LDL+12} has provided a detailed comparison of various fusion methods on Deep Web data. In this paper, we study the applicability and limitations of different fusion techniques on a more challenging problem: {\em knowledge fusion}. Knowledge fusion identifies true subject-predicate-object triples extracted by multiple information extractors from multiple information sources. These extractors perform the tasks of entity linkage and schema alignment, thus introducing an additional source of noise that is quite different from that traditionally considered in the data fusion literature, which only focuses on factual errors in the original sources. We adapt state-of-the-art data fusion techniques and apply them to a knowledge base with 1.6B unique knowledge triples extracted by 12 extractors from over 1B Web pages, which is three orders of magnitude larger than the data sets used in previous data fusion papers. We show great promise of the data fusion approaches in solving the knowledge fusion problem, and suggest interesting research directions through a detailed error analysis of the methods.

From Data Fusion to Knowledge Fusion

The paper, "From Data Fusion to Knowledge Fusion," authored by a research team at Google, explores the development of techniques to address the complex problem of knowledge fusion, extending beyond traditional data fusion methodologies. The paper emphasizes the identification of accurate subject-predicate-object triples across multiple data sources and extractors, contending with the noise introduced in such large-scale information extraction systems.

Overview and Contributions

The authors highlight three primary contributions. First, they redefine the knowledge fusion problem, focusing on its distinctions from data fusion, including its multi-dimensional nature and the intricacies added by information extractors. Second, they propose modifications to existing data fusion methods—such as Bayesian approaches—to cater to the scale and complexity of knowledge bases, leveraging substantial datasets extracted from over a billion web pages. Third, they conduct a comprehensive error analysis, suggesting research directions to address the limitations observed in current techniques.

Numerical Results and Analysis

In their empirical evaluations, the authors adapt data fusion techniques like {\sc Vote}, {\sc Accu}, and {\sc PopAccu}, transforming them to work with provenance information from extracted triples. Among their findings, the refined version of PopAccu demonstrates a reduced deviation from expected probabilities and an enhanced area under the precision-recall curve, attributing to improved result calibration.

Using a MapReduce framework, the authors successfully manage the scale of knowledge fusion, processing input data greater in magnitude than conventional data fusion environments. The evaluation metrics suggest that their approach, even while maintaining the simplifying assumption of functional predicates, yields reasonable accuracy, with PopAccu achieving well-calibrated probability estimates across a test set of over 1.6 billion triples.

Implications and Future Directions

The proposed knowledge fusion methodologies have significant implications both in theory and practice. The scalable adaptations of data fusion techniques, tailored to handle the noise and complexity of automated knowledge extraction, could be crucial for maintaining the integrity and usability of dynamic, large-scale knowledge bases.

Looking forward, several avenues for enhanced research are explored, such as refining the extraction processes to differentiate between errors originating from the extractors themselves and errors truly inherent in the data sources. Moreover, understanding correlations and dependencies between extractors could minimize the advantages that false triples might gain simply from being corroborated by multiple extractors. Bridging the gap in assumptions about the nature of predicates—accounting for non-functional predicates—and improving ground truth assumptions by potentially adopting a semi-open-world model could further elevate the applicability of the technique in open-domain knowledge extraction scenarios.

The paper delineates a forward-looking research agenda while successfully demonstrating the feasibility of adapting data fusion techniques for knowledge fusion, depicting a promising path toward more robust knowledge base construction methodologies. The work encourages the ongoing refinement of these methods to effectively manage uncertainties and complexities at scale in automated knowledge extraction efforts.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Xin Luna Dong (46 papers)
  2. Evgeniy Gabrilovich (14 papers)
  3. Geremy Heitz (2 papers)
  4. Wilko Horn (2 papers)
  5. Kevin Murphy (87 papers)
  6. Shaohua Sun (3 papers)
  7. Wei Zhang (1489 papers)
Citations (238)