Problems with Cosine as a Measure of Embedding Similarity for High Frequency Words (2205.05092v1)

Published 10 May 2022 in cs.CL and cs.AI

Abstract: Cosine similarity of contextual embeddings is used in many NLP tasks (e.g., QA, IR, MT) and metrics (e.g., BERTScore). Here, we uncover systematic ways in which word similarities estimated by cosine over BERT embeddings are understated and trace this effect to training data frequency. We find that relative to human judgements, cosine similarity underestimates the similarity of frequent words with other instances of the same word or other words across contexts, even after controlling for polysemy and other factors. We conjecture that this underestimation of similarity for high frequency words is due to differences in the representational geometry of high and low frequency words and provide a formal argument for the two-dimensional case.

Citations (57)

View on Semantic Scholar

Summary

The paper reveals that cosine similarity underestimates semantic similarity for high frequency words in BERT embeddings.
It demonstrates that high frequency words occupy larger geometric spaces, causing misalignment with human similarity judgments.
The findings prompt a reevaluation of cosine similarity in NLP tasks and suggest model training adjustments to mitigate bias.

Analysis of Cosine Similarity in Contextual Word Embeddings

The paper "Problems with Cosine as a Measure of Embedding Similarity for High Frequency Words" provides a comprehensive examination of the reliability and accuracy of cosine similarity measures when applied to high frequency words in contextual word embeddings, specifically BERT embeddings. The authors argue that, contrary to human judgment, cosine similarity underestimates the semantic similarity of high frequency words, attributing this discrepancy to differences in representational geometry based on frequency.

Key Findings

Frequency and Geometry: The authors identify that high frequency words occupy a considerably larger geometric space within the embedding model. The radius of the minimum bounding hyperspheres of sibling embeddings is positively correlated with word frequency, implying substantial spatial variability for frequent words.
Effect on Human Judgments: Through regression analyses using datasets such as Word-in-Context (WiC) and Stanford Contextualized Word Similarity (SCWS), it is demonstrated that, even after controlling for polysemy and other factors, frequent words show lower cosine similarity scores compared to human similarity assessments. This misalignment indicates that the representations of high frequency words are not as concentrated, leading to underestimation of their semantic similarity relative to human evaluations.
Theoretical Intuition: The paper offers a theoretical perspective using a two-dimensional model that explains the impact of representational geometry on cosine similarity. The distribution of sibling embeddings within the bounding space further contributes to the observed underestimations, emphasizing the role of anisotropy in embedding representations.
Implications for NLP Tasks: Considering that cosine similarity is integral to numerous NLP applications and metrics (e.g., BERTScore), the findings necessitate reevaluation of these measures, especially for tasks involving high frequency words. The underestimation has broad implications for applications like machine translation, information retrieval, and question answering.

Implications and Future Directions

The paper underscores the need for improved techniques that account for frequency-based distortions in contextual embeddings. Researchers should consider interventions during model training that could adjust for these biases, potentially enhancing the alignment between cosine similarity assessments and human intuition. Furthermore, the work suggests that documenting dataset construction, including word frequency distributions, is critical to reducing inequalities and biases inherent in LLMs.

In theoretical terms, investigating the geometric properties of contextual embeddings could yield further insights into the complexities of semantic similarity measures. Additionally, post-processing techniques aimed at aligning model predictions with human judgments could be explored. As LLMs grow in scale, understanding and mitigating frequency-related impacts become increasingly vital.

Overall, this paper provides a critical analysis of the limitations of cosine similarity as a measure of semantic similarity, setting the stage for future research aimed at overcoming these challenges and refining our approach to NLP tools and models.

PDF Markdown

Related Papers

Tweets

https://twitter.com/boch_euw/status/1854251286276792455

YouTube

Show All Videos