Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Knowledge-Based Trust: Estimating the Trustworthiness of Web Sources (1502.03519v1)

Published 12 Feb 2015 in cs.DB and cs.IR

Abstract: The quality of web sources has been traditionally evaluated using exogenous signals such as the hyperlink structure of the graph. We propose a new approach that relies on endogenous signals, namely, the correctness of factual information provided by the source. A source that has few false facts is considered to be trustworthy. The facts are automatically extracted from each source by information extraction methods commonly used to construct knowledge bases. We propose a way to distinguish errors made in the extraction process from factual errors in the web source per se, by using joint inference in a novel multi-layer probabilistic model. We call the trustworthiness score we computed Knowledge-Based Trust (KBT). On synthetic data, we show that our method can reliably compute the true trustworthiness levels of the sources. We then apply it to a database of 2.8B facts extracted from the web, and thereby estimate the trustworthiness of 119M webpages. Manual evaluation of a subset of the results confirms the effectiveness of the method.

Knowledge-Based Trust: Estimating the Trustworthiness of Web Sources

In this paper, the authors investigate the challenging task of assessing the trustworthiness of web sources by introducing a novel concept known as Knowledge-Based Trust (KBT). This framework aims to move beyond traditional exogenous metrics such as hyperlink structure and browsing history, which primarily evaluate the popularity rather than the reliability of information sources.

Methodological Framework

The central premise of KBT is to estimate reliability through endogenous signals, specifically the factual correctness of content. The paper outlines a probabilistic multi-layer model that distinguishes factual errors within web content from errors arising due to the extraction process. This nuanced approach leverages a plurality of facts extracted from webpages and employs joint inference to evaluate both the accuracy of extracted facts and the trustworthiness of their sources.

The framework consists of several components:

  1. Fact Extraction: This involves extracting (subject, predicate, object) triples from web pages using methods developed under the Knowledge Vault (KV) project, where subjects and predicates are defined within Freebase.
  2. Inference Model: The model is engineered to differentiate between factual inaccuracies originating from the source itself and errors introduced during data extraction. This is accomplished via a probabilistic model that applies joint inference across multiple layers.
  3. Granularity Control: An adaptive mechanism is implemented to determine the granularity at which information is aggregated within sources. This ensures computational efficiency and robust statistical estimation by merging data from similar sources or splitting excessively large sources.

Empirical Evaluation

The model's efficacy was tested statistically on synthetic data and broadly applied to a dataset comprising 2.8 billion triples extracted from the web. The assessment demonstrates that by analyzing endogenous data signals, the method can reliably estimate the trustworthiness of 119 million webpages and 5.6 million websites. Validation of a subset shows the effectiveness of this assessment, confirming its practical viability.

Implications and Future Directions

Practically, the implications of this research are significant within the domain of information retrieval and web search. By offering a reliable measure of source trustworthiness, KBT can augment existing metrics such as PageRank, enriching the evaluation framework used for web content. Furthermore, the model can be a stepping stone for future developments in upstream data integration and downstream tasks like automated content curation.

Theoretically, the multi-layer approach presents a significant advancement in the field of data integration, suggesting future research directions for more nuanced models that capture additional layers of noise and errors. The capability to address open-IE style information, while managing increased noise and copy detection at scale, are promising avenues for further development.

In conclusion, this paper provides a robust framework for estimating the reliability of web sources by focusing on the correctness of the content they provide. Through a sophisticated probabilistic approach, it demonstrates the potential to redefine how we assess trustworthiness in the digital ecosystem.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Xin Luna Dong (46 papers)
  2. Evgeniy Gabrilovich (14 papers)
  3. Kevin Murphy (87 papers)
  4. Van Dang (2 papers)
  5. Wilko Horn (2 papers)
  6. Camillo Lugaresi (3 papers)
  7. Shaohua Sun (3 papers)
  8. Wei Zhang (1489 papers)
Citations (217)
X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com