Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
184 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Generating the Ground Truth: Synthetic Data for Soft Label and Label Noise Research (2309.04318v2)

Published 8 Sep 2023 in cs.LG

Abstract: In many real-world classification tasks, label noise is an unavoidable issue that adversely affects the generalization error of machine learning models. Additionally, evaluating how methods handle such noise is complicated, as the effect label noise has on their performance cannot be accurately quantified without clean labels. Existing research on label noise typically relies on either noisy or oversimplified simulated data as a baseline, into which additional noise with known properties is injected. In this paper, we introduce SYNLABEL, a framework designed to address these limitations by creating noiseless datasets informed by real-world data. SYNLABEL supports defining a pre-specified or learned function as the ground truth function, which can then be used for generating new clean labels. Furthermore, by repeatedly resampling values for selected features within the domain of the function, evaluating the function and aggregating the resulting labels, each data point can be assigned a soft label or label distribution. These distributions capture the inherent uncertainty present in many real-world datasets and enable the direct injection and quantification of label noise. The generated datasets serve as a clean baseline of adjustable complexity, into which various types of noise can be introduced. Additionally, they facilitate research into soft label learning and related applications. We demonstrate the application of SYNLABEL, showcasing its ability to precisely quantify label noise and its improvement over existing methodologies.

Citations (1)

Summary

We haven't generated a summary for this paper yet.