Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Truth Finding on the Deep Web: Is the Problem Solved? (1503.00303v1)

Published 1 Mar 2015 in cs.DB and cs.IR

Abstract: The amount of useful information available on the Web has been growing at a dramatic pace in recent years and people rely more and more on the Web to fulfill their information needs. In this paper, we study truthfulness of Deep Web data in two domains where we believed data are fairly clean and data quality is important to people's lives: {\em Stock} and {\em Flight}. To our surprise, we observed a large amount of inconsistency on data from different sources and also some sources with quite low accuracy. We further applied on these two data sets state-of-the-art {\em data fusion} methods that aim at resolving conflicts and finding the truth, analyzed their strengths and limitations, and suggested promising research directions. We wish our study can increase awareness of the seriousness of conflicting data on the Web and in turn inspire more research in our community to tackle this problem.

Analysis of Truth Finding on the Deep Web: Is the Problem Solved?

The paper "Truth Finding on the Deep Web: Is the Problem Solved?" by Li et al. examines the veracity of data retrieved from Deep Web sources, focusing on domains where data accuracy is crucial such as stock market information and flight times. The paper's motivation stems from the growing reliance on web-sourced data which, unlike traditional media, lacks consistent quality verification. Despite preconceived notions about the reliability of these domains, the authors identify significant inconsistencies across different data sources, revealing weaknesses in current data integration and truth discovery techniques.

Key Findings

The authors specifically analyze the accuracy and consistency of Deep Web data in the stock and flight domains. Notably, they discover that a substantial portion of data items exhibit discrepancies, with multiple conflicting values provided for the same data item. Semantics ambiguity, outdated information, and outright errors are identified as major causes. This inconsistency is significant enough that in 70% of cases, more than one value is reported for a data item.

Another remarkable finding is the observation of data copying among sources. This behavior complicates the determination of the original source of information and exacerbates the challenge of truth finding, especially when low-quality data is propagated across multiple platforms. The paper highlights that while some well-known sources demonstrate high accuracy, no single source can be deemed entirely authoritative due to coverage issues.

Evaluation of Data Fusion Techniques

The research assesses several state-of-the-art data fusion methods designed to resolve conflicting information and identify the most credible data. These include voting-based strategies and more sophisticated approaches that take into account source trustworthiness, data sharing between sources, and other heuristic considerations. The authors measure the effectiveness of these techniques based on their ability to improve over naive strategies such as trusting the majority of sources or relying on a single source perceived as reliable.

Despite the strong performance of advanced fusion methods, which achieve, on average, 96% accuracy in identifying correct data values, the paper observes method instability, with no single technique consistently outperforming others across datasets or domains. This highlights the need for further refinement of these algorithms, particularly in addressing data dependencies and semantic contradictions.

Implications and Future Directions

From a practical standpoint, the findings underscore both the opportunities and challenges in utilizing web-based data for critical applications. The variability in data quality and the observed data integration issues suggest that end-users and systems should approach web-derived data with caution and employ robust validation mechanisms.

Theoretically, the paper opens up several avenues for future exploration. The authors advocate for improved methods to gauge source trustworthiness that go beyond simplistic voting schemes. Moreover, the potential of integrating schema mapping, record linkage, and fusion strategies promises a more holistic solution to data integration challenges. Another potential research direction involves developing algorithms for more accurate copying detection and source selection, thereby enhancing the quality of integrated data without introducing unnecessary noise.

In conclusion, while significant progress has been made in the field of data fusion and truth discovery, this paper highlights fundamental problems that remain unsolved. The emphasis on rigorous methodology and comprehensive evaluation provides a robust framework for addressing these complex issues, offering valuable insights for researchers aiming to advance the state-of-the-art in this crucial domain.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Xian Li (116 papers)
  2. Xin Luna Dong (46 papers)
  3. Kenneth Lyons (2 papers)
  4. Weiyi Meng (4 papers)
  5. Divesh Srivastava (37 papers)
Citations (298)