Investigating the Frequency Distortion of Word Embeddings and Its Impact on Bias Metrics (2211.08203v2)
Abstract: Recent research has shown that static word embeddings can encode word frequency information. However, little has been studied about this phenomenon and its effects on downstream tasks. In the present work, we systematically study the association between frequency and semantic similarity in several static word embeddings. We find that Skip-gram, GloVe and FastText embeddings tend to produce higher semantic similarity between high-frequency words than between other frequency combinations. We show that the association between frequency and similarity also appears when words are randomly shuffled. This proves that the patterns found are not due to real semantic associations present in the texts, but are an artifact produced by the word embeddings. Finally, we provide an example of how word frequency can strongly impact the measurement of gender bias with embedding-based metrics. In particular, we carry out a controlled experiment that shows that biases can even change sign or reverse their order by manipulating word frequencies.
- Nikolaos Aletras and Mark Stevenson. 2013. Evaluating topic coherence using distributional semantics. In Proceedings of the 10th International Conference on Computational Semantics (IWCS 2013)–Long Papers, pages 13–22.
- The interpretation of dream meaning: Resolving ambiguity using latent semantic analysis in a small corpus of text. Consciousness and Cognition, 56:178–187.
- A latent variable model approach to PMI-based word embeddings. Transactions of the Association for Computational Linguistics, 4:385–399.
- Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5:135–146.
- Man is to computer programmer as woman is to homemaker? debiasing word embeddings. In Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc.
- Semantics derived automatically from language corpora contain human-like biases. Science, 356(6334):183–186.
- Gender stereotypes in natural language: Word embeddings show robust consistency across child and adult language corpora of more than 65 million words. Psychological Science, 32(2):218–240.
- How language shapes prejudice against women: An examination across 45 world languages. Journal of Personality and Social Psychology, 119(1):7–22.
- A quantitative philology of introspection. Frontiers in integrative neuroscience, 6:80.
- Kawin Ethayarajh. 2019. How contextual are contextualized word representations? Comparing the geometry of BERT, ELMo, and GPT-2 embeddings. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 55–65, Hong Kong, China. Association for Computational Linguistics.
- Word embeddings quantify 100 years of gender and ethnic stereotypes. Proceedings of the National Academy of Sciences, 115(16):E3635–E3644.
- Frage: Frequency-agnostic word representation. In Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc.
- Johannes Hellrich and Udo Hahn. 2016. Bad Company—Neighborhoods in neural embedding spaces considered harmful. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pages 2785–2796, Osaka, Japan. The COLING 2016 Organizing Committee.
- Stereotypical gender associations in language have decreased over time. Sociological Science, 7:1–35.
- The geometry of culture: Analyzing the meanings of class through word embeddings. American Sociological Review, 84(5):905–949.
- Improving distributional similarity with lessons learned from word embeddings. Transactions of the Association for Computational Linguistics, 3:211–225.
- Molly Lewis and Gary Lupyan. 2020. Gender stereotypes are reflected in the distributional structure of 25 languages. Nature Human Behaviour, 4(10):1021–1028.
- Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc.
- David Mimno and Laure Thompson. 2017. The strange geometry of skip-gram with negative sampling. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2873–2878, Copenhagen, Denmark. Association for Computational Linguistics.
- Imagetic and affective measures of memory reverberation diverge at sleep onset in association with theta rhythm. NeuroImage, page 119690.
- Jiaqi Mu and Pramod Viswanath. 2018. All-but-the-top: Simple and effective postprocessing for word representations. In International Conference on Learning Representations.
- Glove: Global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543.
- Radim Řehůřek and Petr Sojka. 2010. Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pages 45–50, Valletta, Malta. ELRA. http://is.muni.cz/publication/884893/en.
- Evaluation methods for unsupervised word embeddings. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 298–307, Lisbon, Portugal. Association for Computational Linguistics.
- The Glasgow Norms: Ratings of 5,500 words on nine scales. Behavior Research Methods, 51:1258–1270.
- Erhan Sezerer and Selma Tekir. 2021. A survey on neural word embeddings. arXiv preprint, arXiv:2110.01804.
- The undesirable dependence on frequency of gender bias metrics based on word embeddings. In Findings of the Association for Computational Linguistics: EMNLP 2022. Association for Computational Linguistics.
- Benjamin J Wilson and Adriaan MJ Schakel. 2015. Controlled experiments for word embeddings. arXiv preprint arXiv:1510.02675.
- Francisco Valentini (5 papers)
- Juan Cruz Sosa (1 paper)
- Diego Fernandez Slezak (12 papers)
- Edgar Altszyler (10 papers)