The Hidden Pitfalls of the Cosine Similarity Loss (2406.16468v1)

Published 24 Jun 2024 in cs.LG and stat.ML

Abstract: We show that the gradient of the cosine similarity between two points goes to zero in two under-explored settings: (1) if a point has large magnitude or (2) if the points are on opposite ends of the latent space. Counterintuitively, we prove that optimizing the cosine similarity between points forces them to grow in magnitude. Thus, (1) is unavoidable in practice. We then observe that these derivations are extremely general -- they hold across deep learning architectures and for many of the standard self-supervised learning (SSL) loss functions. This leads us to propose cut-initialization: a simple change to network initialization that helps all studied SSL methods converge faster.

Summary

The paper demonstrates how cosine similarity loss causes gradient stagnation by inflating embedding norms, resulting in a quadratic slowdown in convergence.
It reveals that large magnitude embeddings and opposite-sided latent points hinder gradient flow, critically impacting SSL methods like SimCLR and MoCov2.
Cut-initialization combined with weight decay effectively mitigates these issues, accelerating convergence and enhancing k-nn classifier accuracy in experiments.

Overview of "The Hidden Pitfalls of the Cosine Similarity Loss"

Cosine similarity is a widely-utilized loss function in self-supervised learning (SSL), yet its properties and implications on optimization have been largely overlooked. The paper "The Hidden Pitfalls of the Cosine Similarity Loss" provides a rigorous examination of the behavior of cosine similarity gradients within SSL contexts, identifying conditions under which the gradient approaches zero, and consequently, impedes convergence. This work specifically highlights two scenarios contributing to this phenomenon: (1) when a point in the latent space has a large magnitude, and (2) when points are situated on opposite sides of the latent space.

Theoretical Insights

The authors demonstrate through mathematical derivations that optimizing the cosine similarity contributes to an increase in the magnitudes of embeddings, leading to a catch-22 where models require small embeddings for optimal training, yet the optimization process inherently enlarges these embeddings. This adverse effect is substantiated by showing that the rate of convergence for gradient descent can be drastically limited — a mathematical bound is provided showing that large embedding norms impose a quadratic slowdown on convergence. Similarly, the opposite-halves effect, though less significant, can also slow down convergence.

Implications and Experimental Verification

The implications of these findings are noteworthy across architectures and objectives optimized through cosine similarity, including the InfoNCE loss, which is prevalent in contrastive SSL methods like SimCLR, MoCov2, and beyond. The paper extends its theoretical analyses with empirical evaluations, demonstrating that large embedding magnitudes significantly slow down SSL model convergence in practice. Experiments confirm that reducing embedding norms through weight decay, or what they term $-normalization, improves SSL training outcomes, as evidenced by higher$ k $-nn classifier accuracy.</p> <h3 class='paper-heading'>Proposed Solution: Cut-Initialization</h3> <p>In response to the pitfalls identified, the authors propose a novel technique, cut-initialization, designed to mitigate the growth of embeddings in the initial training phase. This method adjusts weight initialization by a factor of$ c > 1$, effectively curtailing embedding magnitudes from the outset and thereby expediting convergence across studied settings. The experiments reveal that when combined with weight decay, cut-initialization significantly accelerates SSL training, achieving faster convergence and higher performance across a variety of datasets and SSL architectures.

Conclusion and Future Directions

In summary, the research delineates inherent challenges posed by the use of cosine similarity as a loss function, particularly within the SSL landscape. By identifying the circumstances under which cosine similarity gradients can hinder convergence, it offers cut-initialization as a practical initialization strategy to enhance training efficacy. Moving forward, researchers might explore further variations and refinements in initialization strategies, investigate alternative loss functions, or develop complementary methods that jointly address both cosine similarity and embedding norm inflation. Expanding the scope to diverse architectures and datasets may yield more generalized strategies adaptable to a wider range of machine learning contexts.

PDF Markdown

Related Papers

Tweets

https://twitter.com/hiroto_kurita/status/1811621255063437635

https://twitter.com/wassname/status/1838486648583274730

YouTube

Show All Videos