Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
184 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Exploring Dataset-Scale Indicators of Data Quality (2311.04016v1)

Published 7 Nov 2023 in cs.CV and cs.LG

Abstract: Modern computer vision foundation models are trained on massive amounts of data, incurring large economic and environmental costs. Recent research has suggested that improving data quality can significantly reduce the need for data quantity. But what constitutes data quality in computer vision? We posit that the quality of a given dataset can be decomposed into distinct sample-level and dataset-level constituents, and that the former have been more extensively studied than the latter. We ablate the effects of two important dataset-level constituents: label set design, and class balance. By monitoring these constituents using key indicators we provide, researchers and practitioners can better anticipate model performance, measured in terms of its accuracy and robustness to distribution shifts.

Citations (1)

Summary

We haven't generated a summary for this paper yet.