DCAST: Diverse Class-Aware Self-Training Mitigates Selection Bias for Fairer Learning (2409.20126v2)
Abstract: Fairness in machine learning seeks to mitigate model bias against individuals based on sensitive features such as sex or age, often caused by an uneven representation of the population in the training data due to selection bias. Notably, bias unascribed to sensitive features is challenging to identify and typically goes undiagnosed, despite its prominence in complex high-dimensional data from fields like computer vision and molecular biomedicine. Strategies to mitigate unidentified bias and evaluate mitigation methods are crucially needed, yet remain underexplored. We introduce: (i) Diverse Class-Aware Self-Training (DCAST), model-agnostic mitigation aware of class-specific bias, which promotes sample diversity to counter confirmation bias of conventional self-training while leveraging unlabeled samples for an improved representation of the underlying population; (ii) hierarchy bias, multivariate and class-aware bias induction without prior knowledge. Models learned with DCAST showed improved robustness to hierarchy and other biases across eleven datasets, against conventional self-training and six prominent domain adaptation techniques. Advantage was largest on multi-class classification, emphasizing DCAST as a promising strategy for fairer learning in different contexts.
- A survey on bias and fairness in machine learning. ACM Computing Surveys 54 (2021).
- A review on fairness in machine learning. ACM Computing Surveys 55, 1–44 (2022).
- Correcting sample selection bias for image classification. In 2008 3rd International Conference on Intelligent System and Knowledge Engineering, vol. 1, 1214–1220 (2008).
- Active and semisupervised learning for the classification of remote sensing images. IEEE Transactions on Geoscience and Remote Sensing 52, 6937–6956 (2014).
- Richards, J. W. et al. Active learning to overcome sample selection bias: Application to photometric variable star classification. The Astrophysical Journal 744, 192 (2011).
- Nearest neighbor density ratio estimation for large-scale applications in astronomy. Astronomy and Computing 12, 67–72 (2015).
- Building biomedical text classifiers under sample selection bias. In Advances in Intelligent and Soft Computing, 11–18 (Springer Berlin Heidelberg, 2011).
- Inferring zambia’s HIV prevalence from a selected sample. Applied Economics 52, 4236–4249 (2020).
- Overcoming selection bias in synthetic lethality prediction. Bioinformatics 38, 4360–4368 (2022).
- ELISL: early-late integrated synthetic lethality prediction in cancer. Bioinformatics 40 (2024).
- Decision support and profit prediction for online auction sellers. In Proceedings of the 1st ACM SIGKDD Workshop on Knowledge Discovery from Uncertain Data, U ’09, 1–8 (Association for Computing Machinery, New York, NY, USA, 2009).
- The age pay gap between young and older employees in italy: Perceived or real discrimination against the young? In Research in Labor Economics, 195–221 (Emerald Publishing Limited, 2020).
- Reject inference in credit scoring using a three-way decision and safe semi-supervised support vector machine. Information Sciences 606, 614–627 (2022).
- Melucci, M. Investigating sample selection bias in the relevance feedback algorithm of the vector space model for information retrieval. In 2014 International Conference on Data Science and Advanced Analytics (DSAA), 83–89 (2014).
- Melucci, M. Impact of query sample selection bias on information retrieval system ranking. In 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA), 341–350 (2016).
- Zhang, G. et al. Selection bias explorations and debias methods for natural language sentence matching datasets. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 4418–4429 (Association for Computational Linguistics, 2019).
- Learning from labeled and unlabeled data: An empirical study across techniques and domains. Journal of Artificial Intelligence Research 23, 331–366 (2005).
- Making generative classifiers robust to selection bias. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’07, 657–666 (Association for Computing Machinery, New York, NY, USA, 2007).
- Correcting sample selection bias by unlabeled data. In Proceedings of the 19th International Conference on Neural Information Processing Systems, NIPS’06, 601–608 (MIT Press, Cambridge, MA, USA, 2006).
- Robust classification under sample selection bias. Advances in Neural Information Processing Systems 1, 37–45 (2014).
- A review of domain adaptation without target labels. IEEE Transactions on Pattern Analysis and Machine Intelligence 43, 766–785 (2021).
- Shimodaira, H. Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of Statistical Planning and Inference 90, 227–244 (2000).
- Zadrozny, B. Learning and evaluating classifiers under sample selection bias. In Proceedings of the Twenty-First International Conference on Machine Learning, ICML ’04, 114 (Association for Computing Machinery, New York, NY, USA, 2004).
- Healing sample selection bias by source classifier selection. In 2011 IEEE 11th International Conference on Data Mining, 577–586 (2011).
- Learning under nonstationarity: covariate shift and class-balance change. Wiley Interdisciplinary Reviews: Computational Statistics 5, 465–477 (2013).
- Causally regularized learning with agnostic data selection bias. In Proceedings of the 26th ACM International Conference on Multimedia, MM ’18, 411–419 (Association for Computing Machinery, New York, NY, USA, 2018).
- Diesendruck, M. et al. Importance weighted generative networks. In Machine Learning and Knowledge Discovery in Databases, 249–265 (Springer International Publishing, 2020).
- Fair and robust classification under sample selection bias. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, CIKM ’21, 2999–3003 (Association for Computing Machinery, New York, NY, USA, 2021).
- Domain adaptation with structural correspondence learning. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, EMNLP ’06, 120–128 (Association for Computational Linguistics, USA, 2006).
- Unsupervised visual domain adaptation using subspace alignment. In 2013 IEEE International Conference on Computer Vision, 2960–2967 (2013).
- Feature-level domain adaptation. Journal of Machine Learning Research 17, 5943–5974 (2016).
- Robust domain-adaptive discriminant analysis. Pattern Recognition Letters 148, 107–113 (2021).
- Reverse testing: An efficient framework to select amongst classifiers under sample selection bias. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’06, 147–156 (Association for Computing Machinery, New York, NY, USA, 2006).
- Type independent correction of sample selection bias via structural discovery and re-balancing. In Proceedings of the 2008 SIAM International Conference on Data Mining (Society for Industrial and Applied Mathematics, 2008).
- McLachlan, G. J. Iterative reclassification procedure for constructing an asymptotically optimal rule of allocation in discriminant analysis. Journal of the American Statistical Association 70, 365–369 (1975).
- Combining labeled and unlabeled data with co-training. In Proceedings of the Eleventh Annual Conference on Computational Learning Theory, COLT’ 98, 92–100 (Association for Computing Machinery, New York, NY, USA, 1998).
- Classification of pistachio species using improved k-nn classifier. Progress in Nutrition 23, e2021044 (2021).
- Determining the extinguishing status of fuel flames with sound wave by machine learning methods. IEEE Access 9, 86207–86216 (2021).
- The use of machine learning methods in classification of pumpkin seeds (cucurbita pepo l.). Genetic Resources and Crop Evolution 68, 2713–2726 (2021).
- Ho, T. K. Random decision forests. In Proceedings of 3rd international conference on document analysis and recognition, vol. 1, 278–282 (IEEE, 1995).
- Pedregosa, F. et al. Scikit-learn: Machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
- Kouw, W. wmkouw/libtlda v0.1 (2018). URL https://doi.org/10.5281/zenodo.1214315.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days freePaper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.