- The paper introduces REDUCR, a novel method that uses class-priority reweighting to optimize data selection for efficient training.
- It employs an online learning algorithm to dynamically update weights, ensuring improved performance for underrepresented classes.
- Empirical evaluations on datasets like Clothing1M show up to a 15% improvement in worst-class accuracy over state-of-the-art methods.
Modern machine learning applications often involve processing huge volumes of data, contributing to significant computational costs. Among these tasks, real-world image and text classification pose notable challenges due to factors like class imbalance and distributional shifts, which can affect the effectiveness and efficiency of model training. An innovative solution to these challenges has emerged in the form of a robust data downsampling algorithm known as REDUCR (Robust Data Downsampling using Class Priority Reweighting).
REDUCR introduces a strategy to select the most informative and relevant datapoints during the training process. It does so by assigning priority weights to these datapoints in a class-aware manner, utilizing an online learning algorithm that dynamically updates these weights to emphasize classes that the model is currently underperforming in. This ensures that not only is the training process more efficient by using fewer data, but it also retains or even improves the model's performance on the worst-performing classes—a common victim of class imbalance or noisy data.
REDUCR's performance is extensively evaluated through empirical analysis on various datasets, including those involving vision and text classification tasks. Results indicated that REDUCR significantly outperforms state-of-the-art methods in maintaining high accuracy, particularly for the worst-class test cases, which are often most problematic due to class imbalances. On a notably challenging dataset called Clothing1M, which was obtained through web scraping and contained imbalanced and noisy class distributions, REDUCR improved worst-class test accuracy by as much as 15% over other methods.
The core contributions from this work are manifold. Firstly, it formalizes the concept of robust data selection, aiming to optimal training datasets that preserve worst-class performance. Secondly, REDUCR is proposed with a robust selection rule that fundamentally changes how datapoints are evaluated for selection, focusing on their impact on specific class performance. Lastly, through comprehensive evaluations against benchmarks, REDUCR is demonstrated to achieve strong results in terms of worst-class accuracy while often outperforming competitors in average accuracy as well.
The approach introduced by REDUCR could have significant implications for machine learning, especially in cases where model performance is critical across all classes, not just on average. Although this method has demonstrated great potential, future research may aim to reduce the computational resources further or integrate this selection strategy with different model architectures and learning paradigms, broadening its applicability and enhancing its operational efficiency.