Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Consensus and Subjectivity of Skin Tone Annotation for ML Fairness (2305.09073v3)

Published 16 May 2023 in cs.CV and cs.CY

Abstract: Understanding different human attributes and how they affect model behavior may become a standard need for all model creation and usage, from traditional computer vision tasks to the newest multimodal generative AI systems. In computer vision specifically, we have relied on datasets augmented with perceived attribute signals (e.g., gender presentation, skin tone, and age) and benchmarks enabled by these datasets. Typically labels for these tasks come from human annotators. However, annotating attribute signals, especially skin tone, is a difficult and subjective task. Perceived skin tone is affected by technical factors, like lighting conditions, and social factors that shape an annotator's lived experience. This paper examines the subjectivity of skin tone annotation through a series of annotation experiments using the Monk Skin Tone (MST) scale, a small pool of professional photographers, and a much larger pool of trained crowdsourced annotators. Along with this study we release the Monk Skin Tone Examples (MST-E) dataset, containing 1515 images and 31 videos spread across the full MST scale. MST-E is designed to help train human annotators to annotate MST effectively. Our study shows that annotators can reliably annotate skin tone in a way that aligns with an expert in the MST scale, even under challenging environmental conditions. We also find evidence that annotators from different geographic regions rely on different mental models of MST categories resulting in annotations that systematically vary across regions. Given this, we advise practitioners to use a diverse set of annotators and a higher replication count for each image when annotating skin tone for fairness research.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (86)
  1. A view from somewhere: Human-centric face representations. arXiv preprint arXiv:2303.17176, 2023.
  2. The reasonable effectiveness of diverse evaluation data. arXiv preprint arXiv:2301.09406, 2023.
  3. Designing disaggregated evaluations of ai systems: Choices, considerations, and tradeoffs. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, pages 368–378, 2021.
  4. Enforcing group fairness in algorithmic decision making: Utility maximization under sufficiency. In 2022 ACM Conference on Fairness, Accountability, and Transparency, pages 2315–2326, 2022.
  5. A prompt array keeps the bias away: Debiasing vision-language models with adversarial learning. arXiv preprint arXiv:2203.11933, 2022.
  6. Multimodal datasets: misogyny, pornography, and malignant stereotypes. arXiv preprint arXiv:2110.01963, 2021.
  7. William Brown. Some experimental results in the correlation of mental abilities 1. British Journal of Psychology, 1904-1920, 3(3):296–322, 1910.
  8. Gender shades: Intersectional accuracy disparities in commercial gender classification. In Conference on fairness, accountability and transparency, pages 77–91. PMLR, 2018.
  9. Political partisanship influences perception of biracial candidates’ skin tone. Proceedings of the National Academy of Sciences, 106(48):20168–20173, 2009.
  10. Implicit diversity in image summarization. Proceedings of the ACM on Human-Computer Interaction, 4(CSCW2):1–28, 2020.
  11. Skin colour typology and suntanning pathways. International journal of cosmetic science, 13(4):191–208, 1991.
  12. Setting the tone: An investigation of skin color bias in asia. Race and Social Problems, 14(2):150–169, 2022.
  13. Pali: A jointly-scaled multilingual language-image model, 2022. URL https://arxiv.org/abs/2209.06794.
  14. Dall-eval: Probing the reasoning skills and social biases of text-to-image generative transformers. arXiv preprint arXiv:2202.04053, 2022.
  15. Every rating matters: Joint learning of subjective labels and individual annotators for speech emotion classification. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5886–5890. IEEE, 2019.
  16. The frontiers of fairness in machine learning. arXiv preprint arXiv:1810.08810, 2018.
  17. Understanding colourism in the uk: development and assessment of the everyday colourism scale. Ethnic and Racial Studies, pages 1–36, 2022.
  18. Dealing with disagreements: Looking beyond the majority vote in subjective annotations. Transactions of the Association for Computational Linguistics, 10:92–110, 2022.
  19. Variations in skin colour and the biological consequences of ultraviolet radiation exposure. British Journal of Dermatology, 169(s3):33–40, 2013.
  20. Crowdworksheets: Accounting for individual and collective identities underlying crowdsourced dataset annotation. In 2022 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’22, page 2342–2351, New York, NY, USA, 2022. Association for Computing Machinery. ISBN 9781450393522. doi: 10.1145/3531146.3534647. URL https://doi.org/10.1145/3531146.3534647.
  21. Demographics and dynamics of mechanical turk workers. In Proceedings of the eleventh ACM international conference on web search and data mining, pages 135–143, 2018.
  22. Auditing imagenet: Towards a model-driven framework for annotating demographic attributes of large-scale image datasets. arXiv preprint arXiv:1905.01347, 2019.
  23. Accuracy of self-report in assessing fitzpatrick skin phototypes i through vi. JAMA dermatology, 149(11):1289–1294, 2013.
  24. European Parliament and Council of the European Union. Regulation (EU) 2016/679 of the European Parliament and of the Council, 2016. URL https://data.europa.eu/eli/reg/2016/679/oj.
  25. Towards racially unbiased skin tone estimation via scene disambiguation. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XIII, pages 72–90. Springer, 2022.
  26. Thomas B Fitzpatrick. Soleil et peau. J. Med. Esthet., 2:33–34, 1975.
  27. Thomas B Fitzpatrick. The validity and practicality of sun-reactive skin types i through vi. Archives of dermatology, 124(6):869–871, 1988.
  28. Multivariate analysis of variance (manova), 2008.
  29. Sun sensitivity in 5 us ethnoracial groups. CUTIS-NEW YORK-, 80(1):25, 2007.
  30. Colored perceptions: Racially distinctive names and assessments of skin color. American Behavioral Scientist, 60(4):420–441, 2016.
  31. From dark to light: Skin color and wages among african-americans. Journal of Human Resources, 42(4):701–738, 2007.
  32. Is your toxicity my toxicity? exploring the impact of rater identity on toxicity annotation. Proc. ACM Hum.-Comput. Interact., 6(CSCW2), nov 2022. doi: 10.1145/3555088. URL https://doi.org/10.1145/3555088.
  33. Evaluating deep neural networks trained on clinical images in dermatology with the fitzpatrick 17k dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1820–1828, 2021.
  34. Towards transparency in dermatology image datasets with skin tone annotations by experts, crowds, and an algorithm. arXiv preprint arXiv:2207.02942, 2022.
  35. Skin typing: Fitzpatrick grading and others. Clinics in dermatology, 37(5):430–436, 2019.
  36. Computer vision and conflicting values: Describing people with automated alt text. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, pages 543–554, 2021.
  37. Modeling subjective affect annotations with multi-task learning. Sensors, 22(14), 2022. ISSN 1424-8220. doi: 10.3390/s22145245. URL https://www.mdpi.com/1424-8220/22/14/5245.
  38. Towards measuring fairness in ai: the casual conversations dataset. IEEE Transactions on Biometrics, Behavior, and Identity Science, 4(3):324–332, 2021.
  39. Self-reported pigmentary phenotypes and race are significant but incomplete predictors of fitzpatrick skin phototype in an ethnically diverse population. Journal of the American Academy of Dermatology, 71(4):731–737, 2014.
  40. The White House. Blueprint for an ai bill of rights, Mar 2023a. URL https://www.whitehouse.gov/ostp/ai-bill-of-rights/.
  41. The White House. Red-teaming large language models to identify novel ai risks, Aug 2023b. URL https://www.whitehouse.gov/ostp/news-updates/2023/08/29/red-teaming-large-language-models-to-identify-novel-ai-risks/.
  42. Reliability and validity of image-based and self-reported skin phenotype metrics. IEEE Transactions on Biometrics, Behavior, and Identity Science, 3(4):550–560, 2021.
  43. Fairface: Face attribute dataset for balanced race, gender, and age for bias measurement and mitigation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1548–1558, 2021.
  44. Skin tone and stratification in the black community. American journal of sociology, 97(3):760–778, 1991.
  45. Estimating skin tone and effects on classification performance in dermatology datasets. arXiv preprint arXiv:1910.13268, 2019.
  46. Pushing the frontiers of unconstrained face detection and recognition: Iarpa janus benchmark a. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1931–1939, 2015.
  47. Human skin detection using rgb, hsv and ycbcr color models. arXiv preprint arXiv:1708.02694, 2017.
  48. Analysis of manual and automated skin tone assignments. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 429–438, 2022.
  49. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. International Journal of Computer Vision (IJCV), pages 1–26, 2020.
  50. Absence of images of skin of colour in publications of covid-19 skin manifestations. British Journal of Dermatology, 183(3):593–595, 2020.
  51. Towards measuring fairness in speech recognition: casual conversations dataset transcriptions. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6162–6166. IEEE, 2022.
  52. Keith B Maddox. Perspectives on racial phenotypicality bias. Personality and Social Psychology Review, 8(4):383–401, 2004.
  53. On a test of whether one of two random variables is stochastically larger than the other. The annals of mathematical statistics, pages 50–60, 1947.
  54. Skin tone and mexicans’ perceptions of discrimination in new immigrant destinations. Social Psychology Quarterly, 85(4):374–385, 2022.
  55. Patrick L Mason. Annual income, hourly wages, and identity among mexican-americans and other latinos. Industrial Relations: A Journal of Economy and Society, 43(4):817–834, 2004.
  56. Forming inferences about some intraclass correlation coefficients. Psychological methods, 1(1):30, 1996.
  57. Gender artifacts in visual datasets. arXiv preprint arXiv:2206.09191, 2022.
  58. Diversity and inclusion metrics in subset selection. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, pages 117–123, 2020.
  59. Ellis Monk. Monk skin tone scale, 2022. URL https://skintone.google/.
  60. Ellis P Monk Jr. The cost of color: Skin color, discrimination, and health among african-americans. American Journal of Sociology, 121(2):396–444, 2015.
  61. Ellis P Monk Jr. The consequences of “race and color” in brazil. Social Problems, 63(3):413–430, 2016.
  62. Ellis P Monk Jr. The unceasing significance of colorism: Skin tone stratification in the united states. Daedalus, 150(2):76–90, 2021.
  63. Phenotype and schooling among mexican americans. Sociology of education, pages 276–289, 1996.
  64. Achieving fairness via post-processing in web-scale recommender systems. In 2022 ACM Conference on Fairness, Accountability, and Transparency, pages 715–725, 2022.
  65. Black women and the politics of skin color and hair. Women & Therapy, 6(1-2):89–102, 1987.
  66. Equity in skin typing: why it is time to replace the fitzpatrick scale. British Journal of Dermatology, 185(1):198–199, 2021.
  67. European Parliament. Meps ready to negotiate first-ever rules for safe and transparent ai, Jun 2023. URL https://www.europarl.europa.eu/news/en/press-room/20230609IPR96212/meps-ready-to-negotiate-first-ever-rules-for-safe-and-transparent-ai.
  68. Connecting vision and language with localized narratives. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part V 16, pages 647–664. Springer, 2020.
  69. The casual conversations v2 dataset, 2023.
  70. Data cards: Purposeful and transparent dataset documentation for responsible ai. In 2022 ACM Conference on Fairness, Accountability, and Transparency, pages 1776–1826, 2022.
  71. Closing the ai accountability gap: Defining an end-to-end framework for internal algorithmic auditing. In Proceedings of the 2020 conference on fairness, accountability, and transparency, pages 33–44, 2020.
  72. Dennis Reidsma and Rieks op den Akker. Exploiting ‘subjective’annotations. In Coling 2008: Proceedings of the workshop on Human Judgements in Computational Linguistics, pages 8–16, 2008.
  73. Beyond color correction: Skin color estimation in the wild through deep learning. Electronic Imaging, 32:1–8, 2020.
  74. Igor Ryabov. Colorism and educational outcomes of asian americans: evidence from the national longitudinal study of adolescent health. Social Psychology of Education, 19:303–324, 2016.
  75. Laion-5b: An open large-scale dataset for training next generation image-text models. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022.
  76. Transfer of machine learning fairness across domains. arXiv preprint arXiv:1906.09688, 2019.
  77. A step toward more inclusive people annotations for fairness. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, pages 916–925, 2021.
  78. Are the fitzpatrick skin phototypes valid for cancer risk assessment in a racially and ethnically diverse sample of women? Ethnicity & disease, 29(3):505, 2019.
  79. Charles Spearman. Correlation calculated from faulty data. British journal of psychology, 3(3):271, 1910.
  80. Felix von Luschan. Beiträge zur Völkerkunde der deutschen Schutzgebiete. D. Reimer, 1897.
  81. Revise: A tool for measuring and mitigating bias in visual datasets. International Journal of Computer Vision, 130(7):1790–1810, 2022.
  82. Towards fairness in visual recognition: Effective strategies for bias mitigation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8919–8928, 2020.
  83. Ka Wong and Praveen Paritosh. k-rater reliability: The correct unit of reliability for aggregated human annotations. arXiv preprint arXiv:2203.12913, 2022.
  84. Enhancing fairness in face detection in computer vision systems by demographic bias mitigation. In Proceedings of the 2022 AAAI/ACM Conference on AI, Ethics, and Society, pages 813–822, 2022.
  85. Understanding and evaluating racial biases in image captioning. In International Conference on Computer Vision (ICCV), 2021.
  86. Men also like shopping: Reducing gender bias amplification using corpus-level constraints. arXiv preprint arXiv:1707.09457, 2017.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Candice Schumann (10 papers)
  2. Gbolahan O. Olanubi (2 papers)
  3. Auriel Wright (2 papers)
  4. Ellis Monk Jr. (1 paper)
  5. Courtney Heldreth (5 papers)
  6. Susanna Ricco (10 papers)
Citations (16)

Summary

We haven't generated a summary for this paper yet.