Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Active label cleaning for improved dataset quality under resource constraints (2109.00574v2)

Published 1 Sep 2021 in cs.CV and cs.LG

Abstract: Imperfections in data annotation, known as label noise, are detrimental to the training of machine learning models and have an often-overlooked confounding effect on the assessment of model performance. Nevertheless, employing experts to remove label noise by fully re-annotating large datasets is infeasible in resource-constrained settings, such as healthcare. This work advocates for a data-driven approach to prioritising samples for re-annotation - which we term "active label cleaning". We propose to rank instances according to estimated label correctness and labelling difficulty of each sample, and introduce a simulation framework to evaluate relabelling efficacy. Our experiments on natural images and on a new medical imaging benchmark show that cleaning noisy labels mitigates their negative impact on model training, evaluation, and selection. Crucially, the proposed active label cleaning enables correcting labels up to 4 times more effectively than typical random selection in realistic conditions, making better use of experts' valuable time for improving dataset quality.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (12)
  1. Melanie Bernhardt (6 papers)
  2. Daniel C. Castro (28 papers)
  3. Ryutaro Tanno (36 papers)
  4. Anton Schwaighofer (13 papers)
  5. Kerem C. Tezcan (5 papers)
  6. Miguel Monteiro (9 papers)
  7. Shruthi Bannur (15 papers)
  8. Matthew Lungren (10 papers)
  9. Aditya Nori (22 papers)
  10. Ben Glocker (143 papers)
  11. Javier Alvarez-Valle (19 papers)
  12. Ozan Oktay (34 papers)
Citations (65)

Summary

We haven't generated a summary for this paper yet.