Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 76 tok/s

Gemini 2.5 Pro 59 tok/s Pro

GPT-5 Medium 24 tok/s Pro

GPT-5 High 23 tok/s Pro

GPT-4o 95 tok/s Pro

Kimi K2 207 tok/s Pro

GPT OSS 120B 449 tok/s Pro

Claude Sonnet 4 37 tok/s Pro

2000 character limit reached

REDUCR: Robust Data Downsampling Using Class Priority Reweighting (2312.00486v2)

Published 1 Dec 2023 in cs.LG

Abstract: Modern machine learning models are becoming increasingly expensive to train for real-world image and text classification tasks, where massive web-scale data is collected in a streaming fashion. To reduce the training cost, online batch selection techniques have been developed to choose the most informative datapoints. However, these techniques can suffer from poor worst-class generalization performance due to class imbalance and distributional shifts. This work introduces REDUCR, a robust and efficient data downsampling method that uses class priority reweighting. REDUCR reduces the training data while preserving worst-class generalization performance. REDUCR assigns priority weights to datapoints in a class-aware manner using an online learning algorithm. We demonstrate the data efficiency and robust performance of REDUCR on vision and text classification tasks. On web-scraped datasets with imbalanced class distributions, REDUCR significantly improves worst-class test accuracy (and average accuracy), surpassing state-of-the-art methods by around 15%.

Citations (1)

View on Semantic Scholar

Summary

The paper introduces REDUCR, a novel method that uses class-priority reweighting to optimize data selection for efficient training.
It employs an online learning algorithm to dynamically update weights, ensuring improved performance for underrepresented classes.
Empirical evaluations on datasets like Clothing1M show up to a 15% improvement in worst-class accuracy over state-of-the-art methods.

Modern machine learning applications often involve processing huge volumes of data, contributing to significant computational costs. Among these tasks, real-world image and text classification pose notable challenges due to factors like class imbalance and distributional shifts, which can affect the effectiveness and efficiency of model training. An innovative solution to these challenges has emerged in the form of a robust data downsampling algorithm known as REDUCR (Robust Data Downsampling using Class Priority Reweighting).

REDUCR introduces a strategy to select the most informative and relevant datapoints during the training process. It does so by assigning priority weights to these datapoints in a class-aware manner, utilizing an online learning algorithm that dynamically updates these weights to emphasize classes that the model is currently underperforming in. This ensures that not only is the training process more efficient by using fewer data, but it also retains or even improves the model's performance on the worst-performing classes—a common victim of class imbalance or noisy data.

REDUCR's performance is extensively evaluated through empirical analysis on various datasets, including those involving vision and text classification tasks. Results indicated that REDUCR significantly outperforms state-of-the-art methods in maintaining high accuracy, particularly for the worst-class test cases, which are often most problematic due to class imbalances. On a notably challenging dataset called Clothing1M, which was obtained through web scraping and contained imbalanced and noisy class distributions, REDUCR improved worst-class test accuracy by as much as 15% over other methods.

The core contributions from this work are manifold. Firstly, it formalizes the concept of robust data selection, aiming to optimal training datasets that preserve worst-class performance. Secondly, REDUCR is proposed with a robust selection rule that fundamentally changes how datapoints are evaluated for selection, focusing on their impact on specific class performance. Lastly, through comprehensive evaluations against benchmarks, REDUCR is demonstrated to achieve strong results in terms of worst-class accuracy while often outperforming competitors in average accuracy as well.

The approach introduced by REDUCR could have significant implications for machine learning, especially in cases where model performance is critical across all classes, not just on average. Although this method has demonstrated great potential, future research may aim to reduce the computational resources further or integrate this selection strategy with different model architectures and learning paradigms, broadening its applicability and enhancing its operational efficiency.