Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Online Topic-Aware Entity Resolution Over Incomplete Data Streams (Technical Report) (2103.08720v1)

Published 15 Mar 2021 in cs.DB

Abstract: In many real applications such as the data integration, social network analysis, and the Semantic Web, the entity resolution (ER) is an important and fundamental problem, which identifies and links the same real-world entities from various data sources. While prior works usually consider ER over static and complete data, in practice, application data are usually collected in a streaming fashion, and often incur missing attributes (due to the inaccuracy of data extraction techniques). Therefore, in this paper, we will formulate and tackle a novel problem, topic-aware entity resolution over incomplete data streams (TER-iDS), which online imputes incomplete tuples and detects pairs of topic-related matching entities from incomplete data streams. In order to effectively and efficiently tackle the TER-iDS problem, we propose an effective imputation strategy, carefully design effective pruning strategies, as well as indexes/synopsis, and develop an efficient TER-iDS algorithm via index joins. Extensive experiments have been conducted to evaluate the effectiveness and efficiency of our proposed TER-iDS approach over real data sets.

Citations (12)

Summary

We haven't generated a summary for this paper yet.