Papers
Topics
Authors
Recent
Search
2000 character limit reached

Denoising Cosine Similarity (dCS)

Updated 22 April 2026
  • dCS is a mathematically principled method that corrects bias and reduces noise in cosine similarity estimates using eigenvalue clipping and explicit mean correction.
  • It leverages random matrix theory and statistical estimators to isolate the true signal from noise, substantially improving performance in k-nearest neighbor systems.
  • dCS enhances self-supervised representation learning by offering bias-corrected loss functions that lead to more stable and accurate downstream performance.

Denoising Cosine Similarity (dCS) refers to a suite of mathematically principled methods that correct the bias and reduce the noise intrinsic to empirical cosine similarity estimates, particularly when data are contaminated by sampling artifacts or stochastic noise. dCS approaches have recently been formalized both as spectral-cleaning operators for collaborative filtering and as self-supervised loss functions for robust representation learning. These procedures employ random matrix theory, eigenvalue shrinkage, explicit mean-correction, and statistical estimators to isolate true signal from noise-driven spurious similarity, substantially improving downstream performance in k-nearest neighbor (k-NN) systems and deep autoencoder frameworks (Khawar et al., 2019, Nakagawa et al., 2023).

1. Foundations of Cosine Similarity and Its Limitations

Cosine similarity is widely adopted for quantifying alignment between vectors in memory-based recommender systems and as an objective in representation learning. Given two vectors u,vRDu, v \in \mathbb{R}^D, the standard cosine similarity and its negative loss form are

CS(u,v)=u,vu2v2.\ell_{CS}(u, v) = - \frac{ \langle u, v \rangle }{ \| u \|_2 \| v \|_2 }.

In collaborative filtering, the empirical cosine similarity matrix between mm items from an n×mn \times m user-item matrix XX (with xijRx_{ij}\in \mathbb{R}) is constructed as Scos=D1/2XXD1/2S_{cos} = D^{-1/2} X^\top X D^{-1/2}, where D=diag(x:12,...,x:m2)D = \mathrm{diag}( \|x_{:1}\|^2, ..., \|x_{:m}\|^2 ). However, when XX is noisy or n,mn,m are comparable in size, empirical estimates can exhibit strong noise-induced eigenvalue spread and a systematic overestimation of the leading eigenvalues, especially due to nonzero mean effects (Khawar et al., 2019, Nakagawa et al., 2023).

2. Spectral Properties and Random Matrix Theory Analysis

Random Matrix Theory (RMT) provides the statistical underpinning for dCS corrections. For noise-only matrices (CS(u,v)=u,vu2v2.\ell_{CS}(u, v) = - \frac{ \langle u, v \rangle }{ \| u \|_2 \| v \|_2 }.0 with i.i.d. zero-mean, unit variance entries), the empirical eigenvalue spectrum of CS(u,v)=u,vu2v2.\ell_{CS}(u, v) = - \frac{ \langle u, v \rangle }{ \| u \|_2 \| v \|_2 }.1 follows the Marčenko–Pastur law: CS(u,v)=u,vu2v2.\ell_{CS}(u, v) = - \frac{ \langle u, v \rangle }{ \| u \|_2 \| v \|_2 }.2 with CS(u,v)=u,vu2v2.\ell_{CS}(u, v) = - \frac{ \langle u, v \rangle }{ \| u \|_2 \| v \|_2 }.3, CS(u,v)=u,vu2v2.\ell_{CS}(u, v) = - \frac{ \langle u, v \rangle }{ \| u \|_2 \| v \|_2 }.4, CS(u,v)=u,vu2v2.\ell_{CS}(u, v) = - \frac{ \langle u, v \rangle }{ \| u \|_2 \| v \|_2 }.5. This implies that in practical regimes (CS(u,v)=u,vu2v2.\ell_{CS}(u, v) = - \frac{ \langle u, v \rangle }{ \| u \|_2 \| v \|_2 }.6), most empirical eigenvalues arise from noise ('noise bulk'), and only those above CS(u,v)=u,vu2v2.\ell_{CS}(u, v) = - \frac{ \langle u, v \rangle }{ \| u \|_2 \| v \|_2 }.7 are potentially signal (Khawar et al., 2019). Notably, normalization by the column norms in CS(u,v)=u,vu2v2.\ell_{CS}(u, v) = - \frac{ \langle u, v \rangle }{ \| u \|_2 \| v \|_2 }.8 induces an inherent eigenvalue shrinkage relative to the Pearson estimator, but does not fully correct for mean-induced inflation of the largest eigenvalue.

3. Denoising Procedures: Eigenvalue Shrinkage, Clipping, and Bias Correction

dCS employs a two-step denoising procedure for similarity estimation:

  1. Eigenvalue Clipping (Noise Bulk Removal):
    • Compute the SVD of the column-normalized data CS(u,v)=u,vu2v2.\ell_{CS}(u, v) = - \frac{ \langle u, v \rangle }{ \| u \|_2 \| v \|_2 }.9 to obtain mm0.
    • All empirical eigenvalues mm1 below a threshold mm2 (from MP law) are set to zero. This eliminates noise-dominated modes.
  2. Explicit Mean Correction (Top Eigenvalue Shrinkage):

    • The largest empirical eigenvalue mm3 is corrected by subtracting a rank-one bias term mm4, where mm5 is the column mean of mm6 and mm7 its norm.
    • Define the cleaned eigenvalues as:

    mm8

  • The denoised similarity matrix is mm9 with the cleaned eigenmodes.

This process yields "cleaned cosine similarity" (Editor’s term: dCS), which better reflects true correlation structure and avoids overfitting to noise (Khawar et al., 2019).

4. dCS for Robust Representation Learning

dCS theory has been extended to self-supervised representation learning under noise (Nakagawa et al., 2023). Here, the problem is to learn representations n×mn \times m0 from noisy samples n×mn \times m1, where n×mn \times m2 is the latent clean signal and n×mn \times m3 is isotropic, zero-mean noise. The dCS loss is defined as: n×mn \times m4 where n×mn \times m5 is a random mask and n×mn \times m6 is a weight correcting bias introduced by normalization in the presence of noise. Under the specified noise model, dividing the masked cosine-similarity by n×mn \times m7 yields an unbiased surrogate for the alignment of the underlying signals (Theorem 1 in (Nakagawa et al., 2023)).

Practical estimation of the signal-to-noise ratio n×mn \times m8 uses two independent noisy views n×mn \times m9 of the same XX0, with plug-in estimators: XX1 For large XX2, the correction factor XX3 can be directly approximated as XX4.

5. Implementation Guidelines and Computational Considerations

For memory-based recommenders, the denoised similarity estimator is implemented via the following steps (Khawar et al., 2019):

  1. Compute column norms and means (XX5, XX6).
  2. Form column-normalized matrix XX7.
  3. Compute truncated SVD of XX8 to obtain XX9 largest singular values.
  4. Compute and correct the largest eigenvalue with xijRx_{ij}\in \mathbb{R}0.
  5. Clip all eigenvalues below xijRx_{ij}\in \mathbb{R}1.
  6. Construct xijRx_{ij}\in \mathbb{R}2 and xijRx_{ij}\in \mathbb{R}3 by retaining only nonzero modes.
  7. The k-NN embedding xijRx_{ij}\in \mathbb{R}4 allows fast similarity queries via dot-products.

For the dCS loss in deep networks, the practical algorithm samples masks, applies view-specific corruption (e.g., Blind-Spot Masking), and computes the unbiased loss using either a Monte Carlo or large-xijRx_{ij}\in \mathbb{R}5 approximation for xijRx_{ij}\in \mathbb{R}6. Empirically, moderate mask rates (xijRx_{ij}\in \mathbb{R}7–0.3) and Monte Carlo sample sizes of xijRx_{ij}\in \mathbb{R}8–xijRx_{ij}\in \mathbb{R}9 are recommended.

6. Empirical Results and Performance Assessment

In k-NN recommendation, dCS produces similarity matrices whose spectra are closer (in spectral norm) to low-rank ground truth than both raw cosine and Pearson matrices, leading to higher accuracy and increased recommendation diversity (Khawar et al., 2019).

In representation learning, dCS-regularized objectives have demonstrated superior or more stable downstream performance compared to standard CS, MSE, and recent self-supervised denoising baselines across vision and speech benchmarks. For example, in noisy MNIST, dCS yielded higher linear evaluation accuracy and degraded gracefully under increasing noise. On real-world datasets (MNIST, USPS, Pendigits, Fashion-MNIST), dCS matched or surpassed baselines. When used as a regularizer in SimSiam on CIFAR-100 and Tiny-ImageNet, dCS led to consistent gains over both CS-based and Noise2Void baselines, and in speech (ESC-50) with a Vision Transformer AE, dCS substantially outperformed other loss formulations (Nakagawa et al., 2023).

7. Theoretical Guarantees, Limitations, and Future Directions

dCS is supported by theoretical guarantees on spectral denoising, bias correction, and statistical concentration. The estimator for the weight Scos=D1/2XXD1/2S_{cos} = D^{-1/2} X^\top X D^{-1/2}0 is accurate up to Scos=D1/2XXD1/2S_{cos} = D^{-1/2} X^\top X D^{-1/2}1 in the feature dimension under mild tail assumptions, and the surrogate loss tightly bounds the true signal alignment. However, dCS presumes isotropic, zero-mean noise and relies on masking schemes (e.g., Bernoulli masking); non-isotropic or structured noise remains an open area for extension. Potential avenues include adapting the correction for more general noise models, learning the masking distribution, and integrating dCS into end-to-end architectures, such as masked autoencoders in vision or speech (Nakagawa et al., 2023).


Key References:

Title arXiv ID Main Contribution
Cleaned Similarity for Better Memory-Based Recommenders (Khawar et al., 2019) Spectral theory and practical estimator for dCS
Denoising Cosine Similarity: A Theory-Driven Approach for Efficient Representation Learning (Nakagawa et al., 2023) Bias-corrected loss for robust representation

These methodologies collectively define the modern denoising cosine similarity framework for robust similarity estimation and noise-aware representation learning.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Denoising Cosine Similarity (dCS).