Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CLIP the Bias: How Useful is Balancing Data in Multimodal Learning? (2403.04547v1)

Published 7 Mar 2024 in cs.LG and cs.AI

Abstract: We study the effectiveness of data-balancing for mitigating biases in contrastive language-image pretraining (CLIP), identifying areas of strength and limitation. First, we reaffirm prior conclusions that CLIP models can inadvertently absorb societal stereotypes. To counter this, we present a novel algorithm, called Multi-Modal Moment Matching (M4), designed to reduce both representation and association biases (i.e. in first- and second-order statistics) in multimodal data. We use M4 to conduct an in-depth analysis taking into account various factors, such as the model, representation, and data size. Our study also explores the dynamic nature of how CLIP learns and unlearns biases. In particular, we find that fine-tuning is effective in countering representation biases, though its impact diminishes for association biases. Also, data balancing has a mixed impact on quality: it tends to improve classification but can hurt retrieval. Interestingly, data and architectural improvements seem to mitigate the negative impact of data balancing on performance; e.g. applying M4 to SigLIP-B/16 with data quality filters improves COCO image-to-text retrieval @5 from 86% (without data balancing) to 87% and ImageNet 0-shot classification from 77% to 77.5%! Finally, we conclude with recommendations for improving the efficacy of data balancing in multimodal systems.

Exploring the Impact of Data Balancing in Multimodal Learning Systems

Introduction

In recent years, the advancement of multimodal systems, particularly those that align embeddings across different modalities like CLIP (Contrastive Language-Image Pretraining), has been significant. However, these systems are not without flaws; they can inadvertently encode and amplify societal stereotypes and biases, leading to potential harms. Data balancing, or the process of adjusting the training dataset to mitigate such biases, is one potential mitigation strategy. This blog post explores the effectiveness of data balancing in reducing biases in CLIP models, which are widely used in various applications.

Data Balancing Algorithm: Multi-Modal Moment Matching (M4)

We introduce the Multi-Modal Moment Matching (M4) algorithm aimed at reducing both representation and association biases in multimodal data. Representation bias concerns the overall presence of sensitive attribute categories, while association bias relates to the correlation between such attributes and others in the dataset. The M4 algorithm works by reweighting training examples to align their distribution with a target one, aiming to fulfill both types of bias constraints simultaneously. This approach offers flexibility by handling an arbitrary number of overlapping groups and attributes and can be a formidable baseline for bias mitigation in overparameterized models.

Key Findings

Our empirical paper, which includes training over 150 models, brings forth several insights:

  • Impact on representation bias: Including proxies (attributes not directly related to the sensitive attributes but potentially acting as indirect links) substantially mitigates representation bias, making the model less likely to favor certain subgroups in unrelated contexts.
  • Effect on association bias: While data balancing generally aids in reducing association bias, the addition of proxies might adversely affect this endeavor due to competing constraints during the balancing process.
  • Efficiency of fine-tuning: Fine-tuning on balanced data proves effective in mitigating representation bias, showcasing the model's sensitivity to the data distribution it last encountered.
  • Association bias dynamics: Unlike representation bias, the change in association bias is more gradual and depends on how long the model is trained on balanced data.
  • Model quality concerns: Balancing the data impacts the model's performance in nuanced ways. It tends to enhance classification performance but may degrade retrieval metrics, possibly due to shifts in the distribution of human and non-human examples in the training data.

Further Observations and Recommendations

Our in-depth analysis reveals mixed results on the efficacy of data balancing in addressing biases in CLIP models. While it does present a means to tackle representation and association biases, it is not a panacea and should be seen as part of a broader strategy that might include in-processing and post-processing interventions. Interestingly, balancing data from training onset and assessing impact across human-related and non-human-related metrics are recommended due to the nuanced effects observed on model quality. Furthermore, our findings hint that advancements in data quality and model architectures could mitigate negative impacts on model performance stemming from data balancing.

Conclusion

The exploration into data balancing as a mitigation strategy for biases in CLIP models has unearthed nuanced impacts—both positive and negative—on biases and model performance. This complex landscape underscores the need for comprehensive strategies that go beyond data balancing to effectively tackle bias in multimodal learning systems. Future work may explore additional interventions, including data augmentation techniques, to further refine the efficacy of these systems in a bias-conscious manner.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (115)
  1. A reductions approach to fair classification. In ICML, 2018.
  2. Musiclm: Generating music from text, 2023.
  3. Challenges in measuring bias via open-ended language generation, 2022.
  4. A near optimal algorithm for debiasing trained machine learning models. In NeurIPS, 2021.
  5. A reduction to binary approach for debiasing multiclass datasets. In NeurIPS, 2022a.
  6. Revisiting neural scaling laws in language and vision. NeurIPS, 2022b.
  7. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
  8. Beyond adult and compas: Fairness in multi-class prediction. In NeurIPS, 2022.
  9. A prompt array keeps the bias away: Debiasing vision-language models with adversarial learning. arXiv preprint arXiv:2203.11933, 2022.
  10. Big vision. https://github.com/google-research/big_vision, 2022.
  11. Webinsight: making web images accessible. In Proceedings of the 8th International ACM SIGACCESS Conference on Computers and Accessibility, pp.  181–188, 2006.
  12. Multimodal datasets: misogyny, pornography, and malignant stereotypes. arXiv preprint arXiv:2110.01963, 2021.
  13. UCI repository of machine learning databases, 1998.
  14. Man is to computer programmer as woman is to homemaker? debiasing word embeddings. In NeurIPS, 2016.
  15. Bias and fairness in multimodal machine learning: A case study of automated video interviews. In Proceedings of the 2021 International Conference on Multimodal Interaction, pp.  268–277, 2021.
  16. Stochastic subgradient methods. 2008. URL https://see.stanford.edu/materials/lsocoee364b/04-stoch_subgrad_notes.pdf.
  17. Convex optimization. Cambridge University Press, 2004.
  18. JAX: composable transformations of Python+NumPy programs, 2018. URL http://github.com/google/jax.
  19. Gender shades: Intersectional accuracy disparities in commercial gender classification. In FAccT, 2018.
  20. What is the effect of importance weighting in deep learning? In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp.  872–881. PMLR, 09–15 Jun 2019. URL https://proceedings.mlr.press/v97/byrd19a.html.
  21. Semantics derived automatically from language corpora contain human-like biases. Science, 356(6334), 2017.
  22. Optimized pre-processing for discrimination prevention. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/file/9a49a25d845a483fae4be7e341368e36-Paper.pdf.
  23. Fair and diverse DPP-based data summarization. In Jennifer Dy and Andreas Krause (eds.), ICML, volume 80 of Proceedings of Machine Learning Research, pp.  716–725. PMLR, 10–15 Jul 2018. URL https://proceedings.mlr.press/v80/celis18a.html.
  24. Muse: Text-to-image generation via masked generative transformers, 2023.
  25. Why is my classifier discriminatory? NeurIPS, 31, 2018.
  26. Pali: A jointly-scaled multilingual language-image model. arXiv preprint arXiv:2209.06794, 2022.
  27. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015.
  28. Dall-eval: Probing the reasoning skills and social biases of text-to-image generative models, 2022.
  29. Fair generative modeling via weak supervision. In Hal Daumé III and Aarti Singh (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp.  1887–1898. PMLR, 13–18 Jul 2020. URL https://proceedings.mlr.press/v119/choi20a.html.
  30. Describing textures in the wild. In Proceedings of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2014.
  31. Elements of information theory. Wiley & Sons, 1991.
  32. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
  33. Uncovering the bias in facial expressions. arXiv preprint arXiv:2011.11311, 2020.
  34. Does object recognition work for everyone?, 2019.
  35. Measuring and mitigating unintended bias in text classification. In Conference on AI, Ethics, and Society, 2018.
  36. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2020.
  37. Diversity in big data: A review. Big data, 5(2):73–84, 2017.
  38. fairlearn 0.4.6. 2020. URL https://pypi.org/project/fairlearn/.
  39. Fairness through awareness. In Innovations in Theoretical Computer Science, 2012.
  40. Benjamin Eva. Principles of indifference. The Journal of Philosophy, 116(7):390–411, 2019.
  41. Algorithmic fairness datasets: the story so far. Data Mining and Knowledge Discovery, 36(6):2074–2152, 2022.
  42. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. 2004.
  43. Certifying and removing disparate impact. In SIGKDD, pp.  259–268, 2015.
  44. Generative adversarial nets. Advances in neural information processing systems, 27, 2014.
  45. Vision-language models performing zero-shot tasks exhibit gender-based disparities, 2023.
  46. Equality of opportunity in supervised learning. In NeurIPS, 2016.
  47. Towards measuring fairness in ai: the casual conversations dataset. IEEE Transactions on Biometrics, Behavior, and Identity Science, 4(3):324–332, 2021.
  48. Flax: A neural network library and ecosystem for JAX, 2020. URL http://github.com/google/flax.
  49. Women also snowboard: Overcoming bias in captioning models. In ECCV, 2018.
  50. Underspecification in scene description-to-depiction tasks, 2022.
  51. Simple data balancing achieves competitive worst-group-accuracy, 2022.
  52. Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, pp.  4904–4916. PMLR, 2021.
  53. Classifying without discriminating. In 2009 2nd International Conference on Computer, Control and Communication, 2009.
  54. Fairface: Face attribute dataset for balanced race, gender, and age for bias measurement and mitigation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp.  1548–1558, 2021.
  55. Multi-class texture analysis in colorectal cancer histology. Scientific reports, 6:27988, 2016.
  56. Unequal representation and gender stereotypes in image search results for occupations. In Proceedings of the 33rd annual ACM conference on human factors in computing systems, pp.  3819–3828, 2015.
  57. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  58. Last layer re-training is sufficient for robustness to spurious correlations, 2022.
  59. Inherent trade-offs in the fair determination of risk scores. arXiv preprint arXiv:1609.05807, 2016.
  60. Ron Kohavi. Scaling up the accuracy of naive-Bayes classifiers: A decision-tree hybrid. In SIGKDD, 1996.
  61. 3d object representations for fine-grained categorization. In 4th International IEEE Workshop on 3D Representation and Recognition (3dRR-13), Sydney, Australia, 2013.
  62. Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009.
  63. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Eduardo Blanco and Wei Lu (eds.), Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018: System Demonstrations, Brussels, Belgium, October 31 - November 4, 2018, pp.  66–71. Association for Computational Linguistics, 2018. doi: 10.18653/v1/d18-2012. URL https://doi.org/10.18653/v1/d18-2012.
  64. Microsoft coco: Common objects in context, 2015.
  65. A statistical framework for fair predictive algorithms. arXiv preprint arXiv:1610.08077, 2016.
  66. Learning adversarially fair and transferable representations, 2018.
  67. Causally motivated shortcut removal using auxiliary labels. In International Conference on Artificial Intelligence and Statistics, pp.  739–766. PMLR, 2022.
  68. A survey on bias and fairness in machine learning. arXiv preprint arXiv:1908.09635, 2019.
  69. Mitigating bias in set selection with noisy protected attributes. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pp.  237–248, 2021.
  70. Bias amplification and bias unmasking. Political Analysis, 24(3):307–323, 2016.
  71. Simple open-vocabulary object detection with vision transformers. arXiv preprint arXiv:2205.06230, 2022a.
  72. Simple open-vocabulary object detection. In European Conference on Computer Vision, pp.  728–755. Springer, 2022b.
  73. Dall·e 2 preview - risks and limitations, 2022.
  74. Model cards for model reporting. pp.  220–229, 2019. doi: 10.1145/3287560.3287596. URL https://doi.org/10.1145/3287560.3287596.
  75. Ellis Monk. The monk skin tone scale, 2019. URL https://skintone.google.
  76. Rectified linear units improve restricted boltzmann machines. In ICML, 2010.
  77. Cats and dogs. In IEEE Conference on Computer Vision and Pattern Recognition, 2012.
  78. Learning transferable visual models from natural language supervision. In ICML, 2021.
  79. Zero-shot text-to-image generation, 2021.
  80. Hierarchical text-conditional image generation with clip latents, 2022.
  81. High-resolution image synthesis with latent diffusion models, 2022.
  82. An investigation of why overparameterization exacerbates spurious correlations. In Hal Daumé III and Aarti Singh (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp.  8346–8356. PMLR, 13–18 Jul 2020. URL https://proceedings.mlr.press/v119/sagawa20a.html.
  83. Photorealistic text-to-image diffusion models with deep language understanding, 2022.
  84. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs, 2021.
  85. A step toward more inclusive people annotations for fairness. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society (AIES), 2021.
  86. Simplified transfer learning for chest radiography models using less data. Radiology, 305(2):454–465, 2022.
  87. Online set selection with fairness and diversity constraints. In Proceedings of the EDBT Conference, 2018.
  88. Improving the fairness of deep generative models without retraining. arXiv preprint arXiv:2012.04842, 2020.
  89. Mitigating gender bias in captioning systems. In WWW, pp.  633–645, 2021.
  90. Expert-level detection of pathologies from unannotated chest x-ray images via self-supervised learning. Nature Biomedical Engineering, pp.  1–8, 2022.
  91. Attention is all you need. In NeurIPS, 2017.
  92. Counterfactual invariance to spurious correlations: Why and how to pass stress tests. May 2021.
  93. Are gender-neutral queries really gender-neutral? mitigating gender bias in image search, 2021.
  94. Fairclip: Social bias elimination based on attribute prototype learning and representation neutralization. arXiv preprint arXiv:2210.14562, 2022.
  95. Balanced datasets are not enough: Estimating and mitigating gender bias in deep image representations. In ICCV, 2019.
  96. Concept algebra for text-controlled vision models, 2023.
  97. Caltech-UCSD Birds 200. Technical Report CNS-TR-2010-001, California Institute of Technology, 2010.
  98. Frank Wilcoxon. Individual comparisons by ranking methods. In Breakthroughs in Statistics: Methodology and Distribution, pp.  196–202. Springer, 1992.
  99. Mitigating biases in multimodal personality assessment. In Proceedings of the 2020 International Conference on Multimodal Interaction, pp.  361–369, 2020.
  100. Fairness with overlapping groups. arXiv preprint arXiv:2006.13485, 2020.
  101. Yi Yang and Shawn Newsam. Bag-of-visual-words and spatial extensions for land-use classification. In ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (ACM GIS), 2010.
  102. The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients. Expert Systems with Applications, 36(2), 2009.
  103. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2:67–78, 2014.
  104. Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917, 2022a.
  105. Scaling autoregressive models for content-rich text-to-image generation, 2022b.
  106. Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432, 2021.
  107. Fairness beyond disparate treatment & disparate impact: Learning classification without disparate mistreatment. In International Conference on World Wide Web, 2017.
  108. Learning fair representations. In ICML, 2013.
  109. Scaling vision transformers. In CVPR, 2022a.
  110. Lit: Zero-shot transfer with locked-image text tuning. In CVPR, pp.  18123–18133, 2022b.
  111. Sigmoid loss for language image pre-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp.  11975–11986, October 2023.
  112. Large-scale domain-specific pretraining for biomedical vision-language processing, 2023.
  113. Age progression/regression by conditional adversarial autoencoder. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2017.
  114. Understanding and evaluating racial biases in image captioning. In ICCV, 2021.
  115. Men also like shopping: Reducing gender bias amplification using corpus-level constraints. arXiv preprint arXiv:1707.09457, 2017.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Ibrahim Alabdulmohsin (31 papers)
  2. Xiao Wang (507 papers)
  3. Andreas Steiner (17 papers)
  4. Priya Goyal (15 papers)
  5. Alexander D'Amour (37 papers)
  6. Xiaohua Zhai (51 papers)
Citations (9)