Knowledge-Based Trust: Estimating the Trustworthiness of Web Sources

Published 12 Feb 2015 in cs.DB and cs.IR | (1502.03519v1)

Abstract: The quality of web sources has been traditionally evaluated using exogenous signals such as the hyperlink structure of the graph. We propose a new approach that relies on endogenous signals, namely, the correctness of factual information provided by the source. A source that has few false facts is considered to be trustworthy. The facts are automatically extracted from each source by information extraction methods commonly used to construct knowledge bases. We propose a way to distinguish errors made in the extraction process from factual errors in the web source per se, by using joint inference in a novel multi-layer probabilistic model. We call the trustworthiness score we computed Knowledge-Based Trust (KBT). On synthetic data, we show that our method can reliably compute the true trustworthiness levels of the sources. We then apply it to a database of 2.8B facts extracted from the web, and thereby estimate the trustworthiness of 119M webpages. Manual evaluation of a subset of the results confirms the effectiveness of the method.

Abstract PDF Upgrade to Chat

Citations (217)

View on Semantic Scholar

Summary

The paper introduces Knowledge-Based Trust (KBT) to estimate web source reliability by evaluating the factual correctness of content.
It employs a probabilistic multi-layer model that differentiates factual errors from extraction errors and is validated on billions of data triples.
The framework enhances practical applications in web search and content curation by complementing traditional popularity-based metrics.

Knowledge-Based Trust: Estimating the Trustworthiness of Web Sources

In this paper, the authors investigate the challenging task of assessing the trustworthiness of web sources by introducing a novel concept known as Knowledge-Based Trust (KBT). This framework aims to move beyond traditional exogenous metrics such as hyperlink structure and browsing history, which primarily evaluate the popularity rather than the reliability of information sources.

Methodological Framework

The central premise of KBT is to estimate reliability through endogenous signals, specifically the factual correctness of content. The paper outlines a probabilistic multi-layer model that distinguishes factual errors within web content from errors arising due to the extraction process. This nuanced approach leverages a plurality of facts extracted from webpages and employs joint inference to evaluate both the accuracy of extracted facts and the trustworthiness of their sources.

The framework consists of several components:

Fact Extraction: This involves extracting (subject, predicate, object) triples from web pages using methods developed under the Knowledge Vault (KV) project, where subjects and predicates are defined within Freebase.
Inference Model: The model is engineered to differentiate between factual inaccuracies originating from the source itself and errors introduced during data extraction. This is accomplished via a probabilistic model that applies joint inference across multiple layers.
Granularity Control: An adaptive mechanism is implemented to determine the granularity at which information is aggregated within sources. This ensures computational efficiency and robust statistical estimation by merging data from similar sources or splitting excessively large sources.

Empirical Evaluation

The model's efficacy was tested statistically on synthetic data and broadly applied to a dataset comprising 2.8 billion triples extracted from the web. The assessment demonstrates that by analyzing endogenous data signals, the method can reliably estimate the trustworthiness of 119 million webpages and 5.6 million websites. Validation of a subset shows the effectiveness of this assessment, confirming its practical viability.

Implications and Future Directions

Practically, the implications of this research are significant within the domain of information retrieval and web search. By offering a reliable measure of source trustworthiness, KBT can augment existing metrics such as PageRank, enriching the evaluation framework used for web content. Furthermore, the model can be a stepping stone for future developments in upstream data integration and downstream tasks like automated content curation.

Theoretically, the multi-layer approach presents a significant advancement in the field of data integration, suggesting future research directions for more nuanced models that capture additional layers of noise and errors. The capability to address open-IE style information, while managing increased noise and copy detection at scale, are promising avenues for further development.

In conclusion, this paper provides a robust framework for estimating the reliability of web sources by focusing on the correctness of the content they provide. Through a sophisticated probabilistic approach, it demonstrates the potential to redefine how we assess trustworthiness in the digital ecosystem.