Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Active Deep Learning on Entity Resolution by Risk Sampling (2012.12960v1)

Published 23 Dec 2020 in cs.LG

Abstract: While the state-of-the-art performance on entity resolution (ER) has been achieved by deep learning, its effectiveness depends on large quantities of accurately labeled training data. To alleviate the data labeling burden, Active Learning (AL) presents itself as a feasible solution that focuses on data deemed useful for model training. Building upon the recent advances in risk analysis for ER, which can provide a more refined estimate on label misprediction risk than the simpler classifier outputs, we propose a novel AL approach of risk sampling for ER. Risk sampling leverages misprediction risk estimation for active instance selection. Based on the core-set characterization for AL, we theoretically derive an optimization model which aims to minimize core-set loss with non-uniform Lipschitz continuity. Since the defined weighted K-medoids problem is NP-hard, we then present an efficient heuristic algorithm. Finally, we empirically verify the efficacy of the proposed approach on real data by a comparative study. Our extensive experiments have shown that it outperforms the existing alternatives by considerable margins. Using ER as a test case, we demonstrate that risk sampling is a promising approach potentially applicable to other challenging classification tasks.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Youcef Nafa (5 papers)
  2. Qun Chen (28 papers)
  3. Zhaoqiang Chen (7 papers)
  4. Xingyu Lu (29 papers)
  5. Haiyang He (9 papers)
  6. Tianyi Duan (4 papers)
  7. Zhanhuai Li (9 papers)
Citations (15)

Summary

We haven't generated a summary for this paper yet.