Simple data balancing achieves competitive worst-group-accuracy (2110.14503v2)

Published 27 Oct 2021 in cs.LG, cs.AI, and cs.CR

Abstract: We study the problem of learning classifiers that perform well across (known or unknown) groups of data. After observing that common worst-group-accuracy datasets suffer from substantial imbalances, we set out to compare state-of-the-art methods to simple balancing of classes and groups by either subsampling or reweighting data. Our results show that these data balancing baselines achieve state-of-the-art-accuracy, while being faster to train and requiring no additional hyper-parameters. In addition, we highlight that access to group information is most critical for model selection purposes, and not so much during training. All in all, our findings beg closer examination of benchmarks and methods for research in worst-group-accuracy optimization.

Citations (149)

View on Semantic Scholar

Summary

The paper demonstrates that simple data balancing via subsampling and reweighting achieves competitive worst-group accuracy.
It compares these basic methods against complex approaches like gDRO and JTT, revealing significant reductions in training time and system complexity.
The study implies that efficient data balancing can enhance fairness in classifiers by mitigating spurious correlations across diverse data groups.

Simple Data Balancing for Competitive Worst-Group-Accuracy

The paper "Simple data balancing achieves competitive worst-group-accuracy" presents an incisive examination of the challenges encountered in training classifiers to achieve optimal performance across diverse data groups. The core issue pertains to the imbalances present in worst-group-accuracy datasets and the potential solutions that can be applied to rectify such biases. The paper meticulously compares state-of-the-art techniques against simpler methods of balancing classes and groups via subsampling or reweighting, showcasing that these basic approaches can rival in effectiveness with less computational overhead and no need for hyper-parameter tuning.

Problem Definition and Importance

Worst-group-accuracy delineates a scenario where the classifier's performance is assessed by its worst performance across all available data groups. This metric is crucial since it addresses the susceptibility of machine learning models to spurious correlations—patterns that only hold within specific subgroups and do not generalize well. The authors emphasize its significance in building fair and robust classifiers, especially pertinent in societal applications where different user groups might be represented unequally in the data.

Methodological Approach

The authors explore the balancing of data through simple subsampling or reweighting strategies, contrasting these with sophisticated methods such as Group Distributionally Robust Optimization (gDRO) and Just Train Twice (JTT). The experiments reveal that by calibrating the data balance, it is possible to celebrate state-of-the-art accuracy with a minimal increase in system complexity.

Key experiments conducted on datasets like CelebA, Waterbirds, MultiNLI, and CivilComments emphasize the impact of severe class and group imbalances. Results indicate that simple data subsampling and reweighting provide benchmark-matching performances while significantly reducing training time. Notably, subsampling methods manifest robustness over extended training periods, while reweighting techniques necessitate accurate early-stopping strategies to prevent performance degradation.

Implications and Future Research

This research points to the potential oversimplicity of current benchmarks in misrepresenting the challenge of eliminating spurious correlations, surmising that these datasets may not encapsulate complex real-world scenarios comprehensively. The findings also hint at the effectiveness of smaller data samples in achieving strong discriminative capabilities, challenging the perception that more data invariably leads to better performance.

For future exploration, the field must delve into more nuanced datasets encapsulating real-world complexity, where spurious correlations are subtler and less dominant. Further development of hyper-parameter tuning methodologies absent of group-level information could also broaden the applicability of these techniques in weakly supervised learning settings.

Overall, the authors advocate for a critical reevaluation of prevailing benchmarks and methods, encouraging a focus on more pragmatic and efficient practices that extend beyond simply aggregating more data. This perspective is crucial for enhancing the robustness and fairness of machine learning classifiers in diverse applications.

PDF Markdown

Related Papers

GitHub

GitHub - facebookresearch/BalancingGroups: Simple data balancing baselines for worst-group-accuracy benchmarks. (42 stars)