Denoising Cosine Similarity (dCS)

Updated 22 April 2026

dCS is a mathematically principled method that corrects bias and reduces noise in cosine similarity estimates using eigenvalue clipping and explicit mean correction.
It leverages random matrix theory and statistical estimators to isolate the true signal from noise, substantially improving performance in k-nearest neighbor systems.
dCS enhances self-supervised representation learning by offering bias-corrected loss functions that lead to more stable and accurate downstream performance.

Denoising Cosine Similarity (dCS) refers to a suite of mathematically principled methods that correct the bias and reduce the noise intrinsic to empirical cosine similarity estimates, particularly when data are contaminated by sampling artifacts or stochastic noise. dCS approaches have recently been formalized both as spectral-cleaning operators for collaborative filtering and as self-supervised loss functions for robust representation learning. These procedures employ random matrix theory, eigenvalue shrinkage, explicit mean-correction, and statistical estimators to isolate true signal from noise-driven spurious similarity, substantially improving downstream performance in k-nearest neighbor (k-NN) systems and deep autoencoder frameworks (Khawar et al., 2019, Nakagawa et al., 2023).

1. Foundations of Cosine Similarity and Its Limitations

Cosine similarity is widely adopted for quantifying alignment between vectors in memory-based recommender systems and as an objective in representation learning. Given two vectors $u, v \in \mathbb{R}^D$ , the standard cosine similarity and its negative loss form are

$\ell_{CS}(u, v) = - \frac{ \langle u, v \rangle }{ \| u \|_2 \| v \|_2 }.$

In collaborative filtering, the empirical cosine similarity matrix between $m$ items from an $n \times m$ user-item matrix $X$ (with $x_{ij}\in \mathbb{R}$ ) is constructed as $S_{cos} = D^{-1/2} X^\top X D^{-1/2}$ , where $D = \mathrm{diag}( \|x_{:1}\|^2, ..., \|x_{:m}\|^2 )$ . However, when $X$ is noisy or $n,m$ are comparable in size, empirical estimates can exhibit strong noise-induced eigenvalue spread and a systematic overestimation of the leading eigenvalues, especially due to nonzero mean effects (Khawar et al., 2019, Nakagawa et al., 2023).

2. Spectral Properties and Random Matrix Theory Analysis

Random Matrix Theory (RMT) provides the statistical underpinning for dCS corrections. For noise-only matrices ( $\ell_{CS}(u, v) = - \frac{ \langle u, v \rangle }{ \| u \|_2 \| v \|_2 }.$ 0 with i.i.d. zero-mean, unit variance entries), the empirical eigenvalue spectrum of $\ell_{CS}(u, v) = - \frac{ \langle u, v \rangle }{ \| u \|_2 \| v \|_2 }.$ 1 follows the Marčenko–Pastur law: $\ell_{CS}(u, v) = - \frac{ \langle u, v \rangle }{ \| u \|_2 \| v \|_2 }.$ 2 with $\ell_{CS}(u, v) = - \frac{ \langle u, v \rangle }{ \| u \|_2 \| v \|_2 }.$ 3, $\ell_{CS}(u, v) = - \frac{ \langle u, v \rangle }{ \| u \|_2 \| v \|_2 }.$ 4, $\ell_{CS}(u, v) = - \frac{ \langle u, v \rangle }{ \| u \|_2 \| v \|_2 }.$ 5. This implies that in practical regimes ( $\ell_{CS}(u, v) = - \frac{ \langle u, v \rangle }{ \| u \|_2 \| v \|_2 }.$ 6), most empirical eigenvalues arise from noise ('noise bulk'), and only those above $\ell_{CS}(u, v) = - \frac{ \langle u, v \rangle }{ \| u \|_2 \| v \|_2 }.$ 7 are potentially signal (Khawar et al., 2019). Notably, normalization by the column norms in $\ell_{CS}(u, v) = - \frac{ \langle u, v \rangle }{ \| u \|_2 \| v \|_2 }.$ 8 induces an inherent eigenvalue shrinkage relative to the Pearson estimator, but does not fully correct for mean-induced inflation of the largest eigenvalue.

3. Denoising Procedures: Eigenvalue Shrinkage, Clipping, and Bias Correction

dCS employs a two-step denoising procedure for similarity estimation:

Eigenvalue Clipping (Noise Bulk Removal):
- Compute the SVD of the column-normalized data $\ell_{CS}(u, v) = - \frac{ \langle u, v \rangle }{ \| u \|_2 \| v \|_2 }.$ 9 to obtain $m$ 0.
- All empirical eigenvalues $m$ 1 below a threshold $m$ 2 (from MP law) are set to zero. This eliminates noise-dominated modes.
Explicit Mean Correction (Top Eigenvalue Shrinkage):
- The largest empirical eigenvalue $m$ 3 is corrected by subtracting a rank-one bias term $m$ 4, where $m$ 5 is the column mean of $m$ 6 and $m$ 7 its norm.
- Define the cleaned eigenvalues as:
$m$ 8

The denoised similarity matrix is $m$ 9 with the cleaned eigenmodes.

This process yields "cleaned cosine similarity" (Editor’s term: dCS), which better reflects true correlation structure and avoids overfitting to noise (Khawar et al., 2019).

4. dCS for Robust Representation Learning

dCS theory has been extended to self-supervised representation learning under noise (Nakagawa et al., 2023). Here, the problem is to learn representations $n \times m$ 0 from noisy samples $n \times m$ 1, where $n \times m$ 2 is the latent clean signal and $n \times m$ 3 is isotropic, zero-mean noise. The dCS loss is defined as: $n \times m$ 4 where $n \times m$ 5 is a random mask and $n \times m$ 6 is a weight correcting bias introduced by normalization in the presence of noise. Under the specified noise model, dividing the masked cosine-similarity by $n \times m$ 7 yields an unbiased surrogate for the alignment of the underlying signals (Theorem 1 in (Nakagawa et al., 2023)).

Practical estimation of the signal-to-noise ratio $n \times m$ 8 uses two independent noisy views $n \times m$ 9 of the same $X$ 0, with plug-in estimators: $X$ 1 For large $X$ 2, the correction factor $X$ 3 can be directly approximated as $X$ 4.

5. Implementation Guidelines and Computational Considerations

For memory-based recommenders, the denoised similarity estimator is implemented via the following steps (Khawar et al., 2019):

Compute column norms and means ( $X$ 5, $X$ 6).
Form column-normalized matrix $X$ 7.
Compute truncated SVD of $X$ 8 to obtain $X$ 9 largest singular values.
Compute and correct the largest eigenvalue with $x_{ij}\in \mathbb{R}$ 0.
Clip all eigenvalues below $x_{ij}\in \mathbb{R}$ 1.
Construct $x_{ij}\in \mathbb{R}$ 2 and $x_{ij}\in \mathbb{R}$ 3 by retaining only nonzero modes.
The k-NN embedding $x_{ij}\in \mathbb{R}$ 4 allows fast similarity queries via dot-products.

For the dCS loss in deep networks, the practical algorithm samples masks, applies view-specific corruption (e.g., Blind-Spot Masking), and computes the unbiased loss using either a Monte Carlo or large- $x_{ij}\in \mathbb{R}$ 5 approximation for $x_{ij}\in \mathbb{R}$ 6. Empirically, moderate mask rates ( $x_{ij}\in \mathbb{R}$ 7–0.3) and Monte Carlo sample sizes of $x_{ij}\in \mathbb{R}$ 8– $x_{ij}\in \mathbb{R}$ 9 are recommended.

6. Empirical Results and Performance Assessment

In k-NN recommendation, dCS produces similarity matrices whose spectra are closer (in spectral norm) to low-rank ground truth than both raw cosine and Pearson matrices, leading to higher accuracy and increased recommendation diversity (Khawar et al., 2019).

In representation learning, dCS-regularized objectives have demonstrated superior or more stable downstream performance compared to standard CS, MSE, and recent self-supervised denoising baselines across vision and speech benchmarks. For example, in noisy MNIST, dCS yielded higher linear evaluation accuracy and degraded gracefully under increasing noise. On real-world datasets (MNIST, USPS, Pendigits, Fashion-MNIST), dCS matched or surpassed baselines. When used as a regularizer in SimSiam on CIFAR-100 and Tiny-ImageNet, dCS led to consistent gains over both CS-based and Noise2Void baselines, and in speech (ESC-50) with a Vision Transformer AE, dCS substantially outperformed other loss formulations (Nakagawa et al., 2023).

7. Theoretical Guarantees, Limitations, and Future Directions

dCS is supported by theoretical guarantees on spectral denoising, bias correction, and statistical concentration. The estimator for the weight $S_{cos} = D^{-1/2} X^\top X D^{-1/2}$ 0 is accurate up to $S_{cos} = D^{-1/2} X^\top X D^{-1/2}$ 1 in the feature dimension under mild tail assumptions, and the surrogate loss tightly bounds the true signal alignment. However, dCS presumes isotropic, zero-mean noise and relies on masking schemes (e.g., Bernoulli masking); non-isotropic or structured noise remains an open area for extension. Potential avenues include adapting the correction for more general noise models, learning the masking distribution, and integrating dCS into end-to-end architectures, such as masked autoencoders in vision or speech (Nakagawa et al., 2023).

Key References:

Title	arXiv ID	Main Contribution
Cleaned Similarity for Better Memory-Based Recommenders	(Khawar et al., 2019)	Spectral theory and practical estimator for dCS
Denoising Cosine Similarity: A Theory-Driven Approach for Efficient Representation Learning	(Nakagawa et al., 2023)	Bias-corrected loss for robust representation

These methodologies collectively define the modern denoising cosine similarity framework for robust similarity estimation and noise-aware representation learning.

Markdown Report Issue Upgrade to Chat

References (2)

Cleaned Similarity for Better Memory-Based Recommenders (2019)

Denoising Cosine Similarity: A Theory-Driven Approach for Efficient Representation Learning (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Denoising Cosine Similarity (dCS).