Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 71 tok/s
Gemini 2.5 Pro 52 tok/s Pro
GPT-5 Medium 18 tok/s Pro
GPT-5 High 15 tok/s Pro
GPT-4o 101 tok/s Pro
Kimi K2 196 tok/s Pro
GPT OSS 120B 467 tok/s Pro
Claude Sonnet 4 37 tok/s Pro
2000 character limit reached

Fairness Feedback Loops: Training on Synthetic Data Amplifies Bias (2403.07857v1)

Published 12 Mar 2024 in cs.LG

Abstract: Model-induced distribution shifts (MIDS) occur as previous model outputs pollute new model training sets over generations of models. This is known as model collapse in the case of generative models, and performative prediction or unfairness feedback loops for supervised models. When a model induces a distribution shift, it also encodes its mistakes, biases, and unfairnesses into the ground truth of its data ecosystem. We introduce a framework that allows us to track multiple MIDS over many generations, finding that they can lead to loss in performance, fairness, and minoritized group representation, even in initially unbiased datasets. Despite these negative consequences, we identify how models might be used for positive, intentional, interventions in their data ecosystems, providing redress for historical discrimination through a framework called algorithmic reparation (AR). We simulate AR interventions by curating representative training batches for stochastic gradient descent to demonstrate how AR can improve upon the unfairnesses of models and data ecosystems subject to other MIDS. Our work takes an important step towards identifying, mitigating, and taking accountability for the unfair feedback loops enabled by the idea that ML systems are inherently neutral and objective.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (74)
  1. 90th United States Congress. 82 Stat. 73 - An Act to prescribe penalties for certain acts of violence or intimidation, and for other purposes, 1968. URL https://www.hud.gov/sites/dfiles/FHEO/documents/fairhousingact.pdf.
  2. F. H. Administration. Underwriting Manual: Underwriting and Valuation Procedure Under Title 2 of the National Housing Act. Department of Housing and Urban Development, February 1938. URL https://www.huduser.gov/portal/sites/default/files/pdf/Federal-Housing-Administration-Underwriting-Manual.pdf.
  3. Fairwashing: the risk of rationalization. In K. Chaudhuri and R. Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 161–170. PMLR, 09–15 Jun 2019. URL https://proceedings.mlr.press/v97/aivodji19a.html.
  4. Self-consuming generative models go MAD, 2023. URL https://arxiv.org/abs/2307.01850.
  5. Invariant risk minimization, 2020. URL https://arxiv.org/abs/1907.02893.
  6. Constitutional AI: Harmlessness from AI feedback, 2022. URL https://arxiv.org/abs/2212.08073.
  7. S. J. Bell and L. Sagun. Simplicity bias leads to amplified performance disparities. In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’23, page 355–369, New York, NY, USA, 2023. Association for Computing Machinery. ISBN 9798400701924. doi: 10.1145/3593013.3594003. URL https://doi.org/10.1145/3593013.3594003.
  8. Fairlearn: A toolkit for assessing and improving fairness in AI. Technical Report MSR-TR-2020-32, Microsoft, May 2020. URL https://www.microsoft.com/en-us/research/publication/fairlearn-a-toolkit-for-assessing-and-improving-fairness-in-ai/.
  9. J. Buolamwini and T. Gebru. Gender shades: Intersectional accuracy disparities in commercial gender classification. In FAT, 2018. URL https://api.semanticscholar.org/CorpusID:3298854.
  10. Building classifiers with independency constraints. In 2009 IEEE International Conference on Data Mining Workshops (ICDM), Miami, Florida, USA, Dec. 2009. URL https://ieeexplore.ieee.org/document/5360534.
  11. M. Callon. Introduction: The embeddedness of economic markets in economics. The Sociological Review, 46(1_suppl):1–57, 1998. doi: 10.1111/j.1467-954X.1998.tb03468.x. URL https://doi.org/10.1111/j.1467-954X.1998.tb03468.x.
  12. Curriculum labeling: Self-paced pseudo-labeling for semi-supervised learning. CoRR, abs/2001.06001, 2020. URL https://arxiv.org/abs/2001.06001.
  13. Can you fake it until you make it? impacts of differentially private synthetic data on downstream classification fairness. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21, page 149–160, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450383097. doi: 10.1145/3442188.3445879. URL https://doi.org/10.1145/3442188.3445879.
  14. A. Chouldechova. Fair prediction with disparate impact: A study of bias in recidivism prediction instruments, 2016.
  15. T. C. R. Collective. The combahee river collective statement, April 1977.
  16. Algorithmic decision making and the cost of fairness. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’17, page 797–806, New York, NY, USA, 2017. Association for Computing Machinery. ISBN 9781450348874. doi: 10.1145/3097983.3098095. URL https://doi.org/10.1145/3097983.3098095.
  17. DACS. Describing Archives: A Content Standard (DACS), an Implementation of General International Standard Archival Description (ISAD(G)). Standard, Society of American Archivists’ Technical Subcommittee on Describing Archives: A Content Standard (TS-DACS), May 2023. URL https://github.com/saa-ts-dacs/dacs.
  18. Algorithmic reparation. Big Data & Society, 8(2):20539517211044808, 2021. doi: 10.1177/20539517211044808. URL https://doi.org/10.1177/20539517211044808.
  19. L. Deng. The MNIST database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine, 29(6):141–142, 2012. URL https://ieeexplore.ieee.org/document/6296535.
  20. N. I. R. Department. COMPAS risk scales : Demonstrating accuracy equity and predictive parity performance of the COMPAS risk scales in Broward county, 2016. URL https://api.semanticscholar.org/CorpusID:51920414.
  21. Detroit Deomgraphics. The non-white popultion of metropolitan Detroit, 1955. URL https://hdl.handle.net/2027/mdp.39015060547265?urlappend=%3Bseq=21%3Bownerid=13510798897484245-29.
  22. C. D’Ignazio and L. F. Klein. Data feminism. MIT press, 2020.
  23. E. Einhorn and O. Lewis. Built to keep Black from white: Detroit segregation wall still stands, a stark reminder of racial divisions. NBC News, 2021. URL https://www.nbcnews.com/specials/detroit-segregation-wall/.
  24. Runaway feedback loops in predictive policing. CoRR, abs/1706.09847, 2017. URL http://arxiv.org/abs/1706.09847.
  25. Robin hood and matthew effects: Differential privacy has disparate impact on synthetic data. In ICML, pages 6944–6959, 2022. URL https://proceedings.mlr.press/v162/ganev22a.html.
  26. B. Green. The false promise of risk assessments: Epistemic reform and the limits of fairness. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, FAT* ’20, page 594–606, New York, NY, USA, 2020. Association for Computing Machinery. ISBN 9781450369367. doi: 10.1145/3351095.3372869. URL https://doi.org/10.1145/3351095.3372869.
  27. Fair generative modeling via weak supervision. arXiv preprint arXiv:1910.12008, 2019a. URL https://arxiv.org/abs/1910.12008.
  28. Bias correction of learned generative models using likelihood-free importance weighting, 2019b. URL https://proceedings.neurips.cc/paper/2019/file/d76d8deea9c19cc9aaf2237d2bf2f785-Paper.pdf.
  29. M. Hardt and C. Mendler-Dünner. Performative prediction: Past and future, 2023.
  30. Equality of opportunity in supervised learning. Advances in Neural Information Processing Systems (NIPS), 29:3315–3323, 2016. URL https://proceedings.neurips.cc/paper_files/paper/2016/file/9d2682367c3935defcb1f9e247a97c0d-Paper.pdf.
  31. Fairness without demographics in repeated loss minimization. In J. G. Dy and A. Krause, editors, Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pages 1934–1943. PMLR, 2018. URL http://proceedings.mlr.press/v80/hashimoto18a.html.
  32. Will large-scale generative models corrupt future datasets? 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 20498–20508, 2022. URL https://api.semanticscholar.org/CorpusID:253523513.
  33. Distilling the knowledge in a neural network, 2015.
  34. L. Hu and Y. Chen. Fair classification and social welfare. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, FAccT* ’20, page 535–545, New York, NY, USA, 2020. Association for Computing Machinery. ISBN 9781450369367. doi: 10.1145/3351095.3372857. URL https://doi.org/10.1145/3351095.3372857.
  35. J. Humerick. Reprogramming fairness: Affirmative action in algorithmic criminal sentencing. Columbia Human Rights Law Review, 2019. URL https://hrlr.law.columbia.edu/files/2020/04/8-Humerick_FINAL.pdf.
  36. Y. Idelbayev. Proper ResNet implementation for CIFAR10/CIFAR100 in PyTorch. https://github.com/akamaster/pytorch_resnet_cifar10, 2018. Accessed: 2023-07-26.
  37. K. T. Jackson. Crabgrass frontier: the suburbanization of the United States. Oxford University Press, 1985.
  38. E. S. Jo and T. Gebru. Lessons from archives: Strategies for collecting sociocultural data in machine learning. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, FAT* ’20, page 306–316, New York, NY, USA, 2020. Association for Computing Machinery. ISBN 9781450369367. doi: 10.1145/3351095.3372829. URL https://doi.org/10.1145/3351095.3372829.
  39. A hunt for the snark: Annotator diversity in data practices. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, CHI ’23, New York, NY, USA, 2023. Association for Computing Machinery. ISBN 9781450394215. doi: 10.1145/3544548.3580645. URL https://doi.org/10.1145/3544548.3580645.
  40. M. Kasy and R. Abebe. Fairness, equality, and power in algorithmic decision-making. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21, page 576–586, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450383097. doi: 10.1145/3442188.3445919. URL https://doi.org/10.1145/3442188.3445919.
  41. On the fairness of generative adversarial networks (GANs). Arxiv, abs/2103.00950, 2021. URL https://arxiv.org/abs/2103.00950.
  42. J. Kleinberg. Inherent trade-offs in algorithmic fairness. SIGMETRICS Perform. Eval. Rev., 46(1):40, jun 2018. ISSN 0163-5999. doi: 10.1145/3292040.3219634. URL https://doi.org/10.1145/3292040.3219634.
  43. K. Kärkkäinen and J. Joo. Fairface: Face attribute dataset for balanced race, gender, and age for bias measurement and mitigation. In 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1547–1557, 2021. doi: 10.1109/WACV48630.2021.00159.
  44. FairGAN: GANs-based fairness-aware learning for recommendations with implicit feedback. In Proceedings of the ACM Web Conference 2022, WWW ’22, page 297–307, New York, NY, USA, 2022. Association for Computing Machinery. ISBN 9781450390965. doi: 10.1145/3485447.3511958. URL https://doi.org/10.1145/3485447.3511958.
  45. A quantitative analysis of labeling issues in the CelebA dataset. In Advances in Visual Computing: 17th International Symposium, ISVC 2022, San Diego, CA, USA, October 3–5, 2022, Proceedings, Part I, page 129–141, Berlin, Heidelberg, 2022. Springer-Verlag. ISBN 978-3-031-20712-9. doi: 10.1007/978-3-031-20713-6˙10. URL https://doi.org/10.1007/978-3-031-20713-6_10.
  46. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), December 2015. URL https://ieeexplore.ieee.org/document/7410782.
  47. K. Lum and W. Isaac. To Predict and Serve? Significance, 13(5):14–19, 10 2016. ISSN 1740-9705. doi: 10.1111/j.1740-9713.2016.00960.x. URL https://doi.org/10.1111/j.1740-9713.2016.00960.x.
  48. On the applicability of machine learning fairness notions. SIGKDD Explor. Newsl., 23(1):14–23, may 2021. ISSN 1931-0145. doi: 10.1145/3468507.3468511. URL https://doi.org/10.1145/3468507.3468511.
  49. Towards understanding the interplay of generative artificial intelligence and the internet, 2023.
  50. G. J. McLachlan. Iterative reclassification procedure for constructing an asymptotically optimal rule of allocation in discriminant analysis. Journal of the American Statistical Association, 70(350):365–369, 1975. doi: 10.1080/01621459.1975.10479874. URL https://www.tandfonline.com/doi/abs/10.1080/01621459.1975.10479874.
  51. X. Meng and J. Yao. Impact of classification difficulty on the weight matrices spectra in deep learning and application to early-stopping. Journal of Machine Learning Research, 24(28):1–40, 2023. URL http://jmlr.org/papers/v24/21-1441.html.
  52. Mapping inequality, 2020. URL https://dsl.richmond.edu/panorama/redlining/#loc=5/39.1/-94.58.
  53. Reading digits in natural images with unsupervised feature learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning 2011, 2011. URL http://ufldl.stanford.edu/housenumbers/nips2011_housenumbers.pdf.
  54. NIST. 2018 differential privacy synthetic data challenge, 2018. URL https://www.nist.gov/ctl/pscr/open-innovation-prize-challenges/past-prize-challenges/2018-differential-privacy-synthetic,2018a.
  55. Performative prediction. CoRR, abs/2002.06673, 2020. URL https://arxiv.org/abs/2002.06673.
  56. RAD. Rules for Archival Description (RAD). Standard, Bureau of Canadian Archivists Planning Committe on Descriptive Standards, July 2008. URL https://archivescanada.ca/wp-content/uploads/2022/08/RADComplete_July2008.pdf.
  57. Mt detection in web-scraped parallel corpora. In Machine Translation Summit, 2011. URL https://api.semanticscholar.org/CorpusID:2289219.
  58. Dirty data, bad predictions: How civil rights violations impact police data, predictive policing systems, and justice, February 2019. URL https://ssrn.com/abstract=3333423.
  59. M. Romero. Introducing intersectionality. John Wiley & Sons, 2017.
  60. C. Shah and E. M. Bender. Envisioning information access systems: What makes for good tools and a healthy web? Under Review at non-double blind venue. September 1 version., 2023. URL https://faculty.washington.edu/ebender/papers/Envisioning_IAS_preprint.pdf.
  61. Linking environmental injustices in Detroit, MI to institutional racial segregation through historical federal redlining. Journal of Exposure Science and Environmental Epidemiology, 2022. doi: 10.1038/s41370-022-00512-y. URL https://doi.org/10.1038/s41370-022-00512-y.
  62. The curse of recursion: Training on generated data makes models forget, 2023. URL https://arxiv.org/abs/2305.17493.
  63. Beyond fairness: Reparative algorithms to address historical injustices of housing discrimination in the US. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency. Association for Computing Machinery, 2022. URL https://dl.acm.org/doi/fullHtml/10.1145/3531146.3533160.
  64. Synthetic data – anonymisation groundhog day. In 31st USENIX Security Symposium (USENIX Security 22), pages 1451–1468, Boston, MA, Aug. 2022. USENIX Association. ISBN 978-1-939133-31-1. URL https://www.usenix.org/conference/usenixsecurity22/presentation/stadler.
  65. A. Subramanian. Pytorch-VAE. https://github.com/AntixK/PyTorch-VAE, 2020.
  66. S. K. Sujit. VAE-pytorch. https://github.com/shivakanthsujit/VAE-PyTorch/tree/master, 2019.
  67. H. Suresh and J. Guttag. A framework for understanding sources of harm throughout the machine learning life cycle. In Proceedings of the 1st ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimization, EAAMO ’21, New York, NY, USA, 2021a. Association for Computing Machinery. ISBN 9781450385534. doi: 10.1145/3465416.3483305. URL https://doi.org/10.1145/3465416.3483305.
  68. H. Suresh and J. Guttag. A framework for understanding sources of harm throughout the machine learning life cycle. In Equity and Access in Algorithms, Mechanisms, and Optimization. ACM, oct 2021b. doi: 10.1145/3465416.3483305. URL https://doi.org/10.1145%2F3465416.3483305.
  69. R. Taori and T. Hashimoto. Data feedback loops: Model-driven amplification of dataset biases. In A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 33883–33920. PMLR, 23–29 Jul 2023. URL https://proceedings.mlr.press/v202/taori23a.html.
  70. Navigating the feedback loop in recommender systems: Insights and strategies from industry practice. In Proceedings of the 17th ACM Conference on Recommender Systems, RecSys ’23, page 1058–1061, New York, NY, USA, 2023. Association for Computing Machinery. ISBN 9798400702419. doi: 10.1145/3604915.3610246. URL https://doi.org/10.1145/3604915.3610246.
  71. Artificial artificial artificial intelligence: Crowd workers widely use large language models for text production tasks, 2023.
  72. W. Wu. Machine learning approaches to predict loan default. Intelligent Information Management, 14(5):157–164, 2022. URL https://www.scirp.org/journal/paperinformation.aspx?paperid=120102.
  73. Learning fair representations. In S. Dasgupta and D. McAllester, editors, Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, pages 325–333, Atlanta, Georgia, USA, 17–19 Jun 2013. PMLR. URL https://proceedings.mlr.press/v28/zemel13.html.
  74. The rich get richer: Disparate impact of semi-supervised learning. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=DXPftn5kjQK.
Citations (11)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-Up Questions

We haven't generated follow-up questions for this paper yet.