A Survey on Truth Discovery (1505.02463v2)

Published 11 May 2015 in cs.DB

Abstract: Thanks to information explosion, data for the objects of interest can be collected from increasingly more sources. However, for the same object, there usually exist conflicts among the collected multi-source information. To tackle this challenge, truth discovery, which integrates multi-source noisy information by estimating the reliability of each source, has emerged as a hot topic. Several truth discovery methods have been proposed for various scenarios, and they have been successfully applied in diverse application domains. In this survey, we focus on providing a comprehensive overview of truth discovery methods, and summarizing them from different aspects. We also discuss some future directions of truth discovery research. We hope that this survey will promote a better understanding of the current progress on truth discovery, and offer some guidelines on how to apply these approaches in application domains.

View on arXiv

Authors (8)

Yaliang Li (117 papers)
Jing Gao (98 papers)
Chuishi Meng (5 papers)
Qi Li (354 papers)
Lu Su (18 papers)
Bo Zhao (242 papers)
Wei Fan (160 papers)
Jiawei Han (263 papers)

Citations (407)

View on Semantic Scholar

Summary

Survey and Analysis of Truth Discovery Methods

The paper "A Survey on Truth Discovery" by Yaliang Li et al. presents an exhaustive overview of the field of truth discovery, a significant topic in data integration aimed at resolving conflicts in multi-source information by evaluating the reliability of each source. The authors systematically dissect current methodologies, propose a taxonomy of truth discovery techniques based on various aspects, and provide valuable insights into the challenges and future directions of this research domain.

Core Methodological Insights

One of the pivotal contributions of this paper is the introduction of the general principle of truth discovery: reliable sources are those more likely to provide true information often, and, conversely, information supported by multiple reliable sources is more likely to be true. The authors delineate three prominent methods to encapsulate this principle: iterative, optimization-based, and probabilistic graphical model methods.

Iterative Methods - These involve concurrently estimating source reliability and discovering truths through an iterative process, refining both iteratively until a convergence threshold is met.
Optimization-Based Methods - In these approaches, the problem is formulated as an optimization task where the goal is to minimize the weighted discrepancy between the observed data and the discovered truths.
Probabilistic Graphical Models (PGMs) - PGMs model the data, truths, and source reliabilities through structured probabilistic dependencies, allowing the inference of latent variables which estimate the reliability of sources.

Analytical Perspectives

The survey also categorizes truth discovery methods based on key variables: input data, source reliability, object relations, claimed value characteristics, and output formats. The rigour with which the authors handle each category provides a clear understanding of how different conditions and assumptions can affect the implementation and outcomes of truth discovery algorithms. Some notable analytical perspectives include:

Data Considerations: The authors note the heterogeneity of data types (categorical, continuous) and underscore the necessity of incorporating structural or temporal correlations when applicable.
Source Dependency: They highlight algorithms that attempt to consider dependencies among sources, such as copying relationships and source correlations.
Object Relations: The survey discusses the potential improvements in truth estimation by considering relational data or constraints among objects.

Challenges and Future Directions

The paper outlines several challenges that contain opportunities for further research:

Unstructured Data: Moving beyond databases, the integration of unstructured data such as textual information requires novel truth discovery methodologies that account for inherent data uncertainties.
Object and Source Correlations: Developing methods that can automatically identify and utilize object and source correlations without explicit prior knowledge remains a prominent challenge.
Initial Source Reliability: The initialization of source reliability affects the performance of truth discovery methods, indicating a need for adaptive or learning-based initializations.

Practical Implications and Speculations

Truth discovery plays a crucial role across various domains including healthcare, crowd/social sensing, crowdsourcing, information extraction, and knowledge base construction. The techniques discussed have the potential to improve outcomes in data fusion, offering more reliable aggregated information which is crucial in decision-making processes. The paper postulates that advancements in truth discovery will further refine large-scale data processing, leading to more robust systems capable of handling increasing amounts of data with conflicting information.

Conclusion

This comprehensive survey not only serves as a valuable resource for identifying the current state and methodologies of truth discovery but also stimulates further research by outlining unresolved challenges in the field. The comprehensive comparison of methods offers researchers a clear starting point for application-specific truth discovery problems, aiding in the selection of appropriate methods tailored to specific characteristics of their data and objectives.

PDF Markdown

Related Papers

Find Related Papers