Cosine Normalization: Using Cosine Similarity Instead of Dot Product in Neural Networks (1702.05870v5)

Published 20 Feb 2017 in cs.LG, cs.AI, and stat.ML

Abstract: Traditionally, multi-layer neural networks use dot product between the output vector of previous layer and the incoming weight vector as the input to activation function. The result of dot product is unbounded, thus increases the risk of large variance. Large variance of neuron makes the model sensitive to the change of input distribution, thus results in poor generalization, and aggravates the internal covariate shift which slows down the training. To bound dot product and decrease the variance, we propose to use cosine similarity or centered cosine similarity (Pearson Correlation Coefficient) instead of dot product in neural networks, which we call cosine normalization. We compare cosine normalization with batch, weight and layer normalization in fully-connected neural networks as well as convolutional networks on the data sets of MNIST, 20NEWS GROUP, CIFAR-10/100 and SVHN. Experiments show that cosine normalization achieves better performance than other normalization techniques.

Citations (192)

View on Semantic Scholar

Summary

The paper proposes replacing the traditional dot product with cosine similarity (particularly centered cosine similarity) in neural networks to constrain activations and reduce variance.
Experimental results show cosine normalization consistently achieves lower error rates and greater stability across various datasets compared to other normalization methods like batch normalization.
This method improves training efficiency by allowing higher learning rates and enhances model generalizability and robustness to scaling and shifting of input magnitudes.

Analysis of Cosine Normalization in Neural Networks

In a novel investigation into the optimization of neural network architectures, the authors of "Cosine Normalization: Using Cosine Similarity Instead of Dot Product in Neural Networks" propose cosine normalization as a replacement for the traditional dot product operation. Their methodology is designed to tackle the issues of large variance in neural activations and the consequential sensitivity to input distribution changes, which traditionally leads to poor generalization and a slowdown in training processes associated with internal covariate shifts.

Motivation and Approach

The primary issue addressed involves the unbounded nature of the dot product, which induces a risk of large variances in neural units. The authors propose cosine similarity, particularly centered cosine similarity (Pearson Correlation Coefficient), as a bounded alternative. This replacement constrains pre-activation values to between -1 and 1, permitting higher learning rates and rendering neural networks less susceptible to variance-based sensitivities.

Cosine normalization modifies the traditional network architecture by computing cosine similarity between input and weight vectors, rather than employing the dot product. The model thereby centers around Equation $\cos \theta = \frac{\vec{w} \cdot \vec{x}}{\left|\vec{w}\right| \left|\vec{x}\right|}$ , ensuring stability across varied input magnitudes.

Experimental Evaluation

The effectiveness of cosine normalization was tested against established normalization techniques—batch, weight, and layer normalization—across several datasets (MNIST, CIFAR-10/100, SVHN, 20NEWS GROUP) and network types (fully-connected and convolutional).

Key findings from the experiments are as follows:

Cosine normalization consistently achieved lower error rates across all datasets compared to conventional normalization methods.
For example, on the CIFAR-10 dataset, centered cosine normalization exhibited a marked reduction in test error to 6.39%—surpassing batch normalization's 8.08%.
The variance of test error was significantly lower for cosine normalization, indicating more stable learning and generalization during training.
Particularly, centered cosine normalization yielded the most substantial improvements across high-dimensional data classifications as demonstrated by its performance on the 20NEWS GROUP dataset.

Implications and Future Directions

The paper's outcomes suggest that cosine normalization provides a robust framework for reducing excessive neuron variance while enhancing model stability and performance across diverse tasks. The implications extend to theoretical neural network optimization and hint towards potentially novel architectures that utilize similarity measures as foundational operations.

Practically, this approach facilitates the design of networks with improved training efficiency since higher learning rates can be maintained without provoking large variance implosions. With inherent robustness to the scaling and shifting of input magnitudes, cosine normalization enables the creation of more generalizable models that excel in both standard and convoluted settings.

Looking forward, future avenues may explore broader adaptations of similarity metrics beyond cosine, assess impacts on deeper networks, and integrate cosine normalization into architectures featuring complex operations or unique activation functions. The exploration of synergistic integration with other advanced normalization techniques could further leverage the complementary benefits observed in this paper. It is conceivable that cosine normalization could reshape elements of model design, paving the way for more stable and high-performing machine learning applications.