Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

View-Driven Deduplication with Active Learning (1606.05708v1)

Published 17 Jun 2016 in cs.DB

Abstract: Visual analytics systems such as Tableau are increasingly popular for interactive data exploration. These tools, however, do not currently assist users with detecting or resolving potential data quality problems including the well-known deduplication problem. Recent approaches for deduplication focus on cleaning entire datasets and commonly require hundreds to thousands of user labels. In this paper, we address the problem of deduplication in the context of visual data analytics. We present a new approach for record deduplication that strives to produce the cleanest view possible with a limited budget for data labeling. The key idea behind our approach is to consider the impact that individual tuples have on a visualization and to monitor how the view changes during cleaning. With experiments on nine different visualizations for two real-world datasets, we show that our approach produces significantly cleaner views for small labeling budgets than state-of-the-art alternatives and that it also stops the cleaning process after requesting fewer labels.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Kristi Morton (1 paper)
  2. Hannaneh Hajishirzi (176 papers)
  3. Magdalena Balazinska (25 papers)
  4. Dan Grossman (7 papers)
Citations (1)

Summary

We haven't generated a summary for this paper yet.