Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ZeroER: Entity Resolution using Zero Labeled Examples (1908.06049v2)

Published 16 Aug 2019 in cs.DB and cs.LG

Abstract: Entity resolution (ER) refers to the problem of matching records in one or more relations that refer to the same real-world entity. While supervised ML approaches achieve the state-of-the-art results, they require a large amount of labeled examples that are expensive to obtain and often times infeasible. We investigate an important problem that vexes practitioners: is it possible to design an effective algorithm for ER that requires Zero labeled examples, yet can achieve performance comparable to supervised approaches? In this paper, we answer in the affirmative through our proposed approach dubbed ZeroER. Our approach is based on a simple observation -- the similarity vectors for matches should look different from that of unmatches. Operationalizing this insight requires a number of technical innovations. First, we propose a simple yet powerful generative model based on Gaussian Mixture Models for learning the match and unmatch distributions. Second, we propose an adaptive regularization technique customized for ER that ameliorates the issue of feature overfitting. Finally, we incorporate the transitivity property into the generative model in a novel way resulting in improved accuracy. On five benchmark ER datasets, we show that ZeroER greatly outperforms existing unsupervised approaches and achieves comparable performance to supervised approaches.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Renzhi Wu (11 papers)
  2. Sanya Chaba (3 papers)
  3. Saurabh Sawlani (14 papers)
  4. Xu Chu (66 papers)
  5. Saravanan Thirumuruganathan (25 papers)
Citations (82)

Summary

We haven't generated a summary for this paper yet.