A Bayesian Approach to Discovering Truth from Conflicting Sources for Data Integration (1203.0058v1)

Published 1 Mar 2012 in cs.DB and cs.LG

Abstract: In practical data integration systems, it is common for the data sources being integrated to provide conflicting information about the same entity. Consequently, a major challenge for data integration is to derive the most complete and accurate integrated records from diverse and sometimes conflicting sources. We term this challenge the truth finding problem. We observe that some sources are generally more reliable than others, and therefore a good model of source quality is the key to solving the truth finding problem. In this work, we propose a probabilistic graphical model that can automatically infer true records and source quality without any supervision. In contrast to previous methods, our principled approach leverages a generative process of two types of errors (false positive and false negative) by modeling two different aspects of source quality. In so doing, ours is also the first approach designed to merge multi-valued attribute types. Our method is scalable, due to an efficient sampling-based inference algorithm that needs very few iterations in practice and enjoys linear time complexity, with an even faster incremental variant. Experiments on two real world datasets show that our new method outperforms existing state-of-the-art approaches to the truth finding problem.

PDF Abstract

A Bayesian Approach to Discovering Truth from Conflicting Sources for Data Integration

This paper introduces a novel methodology to address the challenge of determining accurate information from unreliable and conflicting data sources in data integration systems. The core problem identified by the authors is the "truth finding problem," where conflicting claims about the same entity necessitate a mechanism to discern the most complete and accurate set of true records. To tackle this issue, the authors propose a Bayesian probabilistic graphical model called the Latent Truth Model (LTM), which is capable of inferring both the underlying truths and the quality of data sources without supervision.

The LTM stands out by modeling source quality via two independent components: specificity—measuring false positive rates, and sensitivity—measuring false negative rates. This bifurcated modeling approach allows the system to handle scenarios where multiple truths are possible, which is frequently the case with multi-valued attribute types.

Traditional methods, like majority voting or relying on accuracy and precision metrics, prove inadequate, especially when the source distribution of false positives and false negatives vary significantly. The strength of LTM lies in its ability to iteratively refine its inference of truth and source quality, thereby leveraging the interconnectedness of these elements in a principled manner. The model uses Bayesian networks to effectively aggregate these measures and facilitate continuous improvement in inferring truth against inconsistent datasets.

Numerically, the system demonstrated strong performance when tested on real-world datasets. For instance, on an author dataset from online book sellers and a movie director dataset from Bing, LTM outperformed existing methods, stabilizing high precision and recall scores under varying thresholds. Practically, these results guide the model's potential to transform applications reliant on data consistency, such as knowledge bases, large-scale data mergers, and web indexing systems.

Additionally, the LTM employs a scalable inference algorithm based on collapsed Gibbs sampling. This approach enjoys linear time complexity, making it feasible for application on vast data sources and capable of quick convergence in practice. An incremental learning variant also allows for real-time adaptation to data streams, enabling the model to remain current without re-training on the cumulative data at each step.

Theoretically, the implications of adopting such a framework extend to more refined approaches in probabilistic reasoning and decision-making under uncertainty. Future enhancements could involve extending the model to incorporate multi-attributed dependencies or handling more complex error types beyond binary truth assessments. Moreover, investigations into adversarial data sources and integrating real-valued loss into the truth-discovery process represent promising avenues for expanding the robustness and utility of this Bayesian approach.

In conclusion, the paper's contribution lies in its nuanced treatment of the truth-finding problem, proposing a solution that combines the strengths of advanced probabilistic modeling with practical scalability and accuracy. It charts a path for future research and application development in data integration and collective intelligence systems, emphasizing the importance of source reliability and truth estimation in an era of data abundance and diversity.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Bo Zhao (242 papers)
Benjamin I. P. Rubinstein (69 papers)
Jim Gemmell (4 papers)
Jiawei Han (263 papers)

Citations (343)

View on Semantic Scholar

A Bayesian Approach to Discovering Truth from Conflicting Sources for Data Integration (1203.0058v1)

A Bayesian Approach to Discovering Truth from Conflicting Sources for Data Integration

Related Papers