Papers
Topics
Authors
Recent
Search
2000 character limit reached

Cosine-based Affinity Matrices

Updated 23 March 2026
  • Cosine-based affinity matrices are defined as normalized inner products between data points, providing a robust measure of pairwise similarities used in clustering and recommendation systems.
  • They employ L2-normalization, noise correction, and eigenvalue shrinkage to ensure accurate spectral properties and enhance performance in diverse applications.
  • Generalizations using Bregman-angle affinity and isotropic preprocessing further refine these matrices for precise embeddings and reliable recovery of signal structure.

Cosine-based affinity matrices are fundamental objects in data analysis, machine learning, and information retrieval, encoding pairwise similarity between data points via normalized inner products. These affinity matrices underpin memory-based recommendation, spectral clustering, biological association mining, and a wide range of embedding-based retrieval pipelines. Their construction, spectral properties, normalization strategies, and mathematical pitfalls have been investigated across multiple domains using advanced statistical and matrix analysis tools.

1. Construction and Definitions

Given nn data points x1,,xnx_1,\dots,x_n in Rd\mathbb{R}^d, the canonical cosine similarity between xix_i and xjx_j is defined as

cos(xi,xj)=xiTxjxixj.\cos(x_i, x_j) = \frac{x_i^T x_j}{\|x_i\|\|x_j\|}.

The resulting affinity matrix ARn×nA \in \mathbb{R}^{n \times n} has entries Aij=cos(xi,xj)A_{ij} = \cos(x_i, x_j). In collaborative filtering, given a user–item interaction matrix XRn×mX \in \mathbb{R}^{n \times m}, the empirical cosine similarity matrix between items is

Scos=D1/2(XTX)D1/2,S_{\mathrm{cos}} = D^{-1/2} (X^T X) D^{-1/2},

where x1,,xnx_1,\dots,x_n0 with x1,,xnx_1,\dots,x_n1 (Khawar et al., 2019). Equivalently, forming the column-normalized matrix x1,,xnx_1,\dots,x_n2, x1,,xnx_1,\dots,x_n3.

For alternative applications, cosine-based affinity matrices may arise from more general similarity measures, such as the "Bregman-angle" cosine between surface normals of a convex cost function x1,,xnx_1,\dots,x_n4, x1,,xnx_1,\dots,x_n5, with x1,,xnx_1,\dots,x_n6 (Gunay et al., 2014).

In bibliometrics, one constructs the item–item (author–author, word–word) co-occurrence matrix x1,,xnx_1,\dots,x_n7 from an occurrence matrix x1,,xnx_1,\dots,x_n8. The correct recovery of cosine affinities in this setting employs the Ochiai coefficient (Zhou et al., 2015): x1,,xnx_1,\dots,x_n9 which yields the same values as computing the cosine directly from Rd\mathbb{R}^d0.

2. Spectral Properties and Noise Effects

Random matrix theory describes the spectral behavior of empirical cosine affinity matrices, particularly under noisy input data. For Rd\mathbb{R}^d1 a random Rd\mathbb{R}^d2 matrix with i.i.d. zero-mean unit-variance entries, the eigenvalue spectrum of Rd\mathbb{R}^d3 converges to the Marčenko–Pastur law as Rd\mathbb{R}^d4 (Khawar et al., 2019). Key properties include:

  • All nontrivial eigenvalues are confined to the interval Rd\mathbb{R}^d5, Rd\mathbb{R}^d6, Rd\mathbb{R}^d7, with Rd\mathbb{R}^d8.
  • The spectrum of the cosine affinity matrix Rd\mathbb{R}^d9 exhibits natural shrinkage of non-top eigenvalues compared to that of the (demeaned) Pearson correlation estimator: for xix_i0, xix_i1, meaning vanilla cosine induces shrinkage "for free".
  • Non-centered data leads to a top-eigenvalue overestimation: decomposing xix_i2 (centered + mean matrix), one extra rank-1 term xix_i3 lifts one eigenvalue by xix_i4 (with xix_i5 the mean of column xix_i6), necessitating explicit correction for accurate downstream use.

3. Normalization, Gauge Ambiguity, and Proper Construction

Cosine similarity’s meaning depends crucially on normalization. For embeddings learned with general dot-product objectives (e.g., matrix factorization), the similarity is arbitrary up to an invertible diagonal "gauge" matrix xix_i7, xix_i8 (Bouhsine, 23 Feb 2026). The cosine between xix_i9 in xjx_j0 is: xjx_j1 which can be rendered arbitrary by adjusting xjx_j2.

Enforcing L2-normalization of each embedding vector, restricting xjx_j3 to the unit sphere xjx_j4, eliminates this gauge freedom entirely:

  • For unit-normed embeddings, cosine similarity reduces to the plain dot product.
  • On the sphere, cosine and squared Euclidean distance are linearly related:

xjx_j5

Thus, neighbor rankings under cosine and Euclidean distance become identical.

For co-occurrence matrices, applying standard cosine normalization or Pearson xjx_j6 to xjx_j7 induces double normalization—overestimating affinities and distorting downstream clustering (Zhou et al., 2015). The correct approach is to use the Ochiai coefficient on xjx_j8 or the cosine directly on the original occurrence matrix xjx_j9.

4. Variance of Cosine Similarity and the Isotropic Principle

The statistical properties of entries in cosine-based affinity matrices depend on the input data's covariance structure. For zero-mean data with covariance cos(xi,xj)=xiTxjxixj.\cos(x_i, x_j) = \frac{x_i^T x_j}{\|x_i\|\|x_j\|}.0, asymptotic analysis shows (Smith et al., 2023): cos(xi,xj)=xiTxjxixj.\cos(x_i, x_j) = \frac{x_i^T x_j}{\|x_i\|\|x_j\|}.1 The variance of cosine similarity is minimized when the covariance is isotropic (cos(xi,xj)=xiTxjxixj.\cos(x_i, x_j) = \frac{x_i^T x_j}{\|x_i\|\|x_j\|}.2 constant over cos(xi,xj)=xiTxjxixj.\cos(x_i, x_j) = \frac{x_i^T x_j}{\|x_i\|\|x_j\|}.3): this is the "isotropic principle". Preprocessing data via whitening (cos(xi,xj)=xiTxjxixj.\cos(x_i, x_j) = \frac{x_i^T x_j}{\|x_i\|\|x_j\|}.4) or more general isotropic linear maps ensures the affinity matrix's null distribution is as sharp as possible, improving discriminative power for clustering and retrieval tasks.

Modern practice extends this by optimizing over a parameterized family of transformations cos(xi,xj)=xiTxjxixj.\cos(x_i, x_j) = \frac{x_i^T x_j}{\|x_i\|\|x_j\|}.5, maximizing downstream objectives (e.g., spectral cutting, recall) in an end-to-end manner, backpropagating gradients through the cosine similarities.

5. Generalizations: Convex Cost Functions and Bregman-Angle Matrices

Cosine-based affinity can be generalized via the angular structure of surface normals to convex cost functions ("Bregman-angle" similarity) (Gunay et al., 2014). Let cos(xi,xj)=xiTxjxixj.\cos(x_i, x_j) = \frac{x_i^T x_j}{\|x_i\|\|x_j\|}.6 be a strictly convex (possibly non-differentiable) function. For each cos(xi,xj)=xiTxjxixj.\cos(x_i, x_j) = \frac{x_i^T x_j}{\|x_i\|\|x_j\|}.7, define the (possibly sub-)gradient cos(xi,xj)=xiTxjxixj.\cos(x_i, x_j) = \frac{x_i^T x_j}{\|x_i\|\|x_j\|}.8 and lifted normal cos(xi,xj)=xiTxjxixj.\cos(x_i, x_j) = \frac{x_i^T x_j}{\|x_i\|\|x_j\|}.9, then normalize ARn×nA \in \mathbb{R}^{n \times n}0. The affinity between ARn×nA \in \mathbb{R}^{n \times n}1 and ARn×nA \in \mathbb{R}^{n \times n}2 can be formulated as: ARn×nA \in \mathbb{R}^{n \times n}3 or via the angle ARn×nA \in \mathbb{R}^{n \times n}4. Choice of ARn×nA \in \mathbb{R}^{n \times n}5 can encode domain structure, such as negative entropy for distributions, or total variation for signals. Using Gaussian kernels of these angles, ARn×nA \in \mathbb{R}^{n \times n}6, often yields positive-definite affinity matrices suitable for clustering and spectral analysis.

This construction yields true angle metrics on the manifold of surface normals. Bregman-angle affinity may provide robustness to global shifts, is more faithful for structured signals, and relates to (but is distinct from) Bregman divergence, which measures tangential, rather than angular, differences.

6. Practical Algorithms and Applications

Algorithmic recipes for large-scale cosine-based affinity computation proceed via:

  • For standard use: L2-normalize each row (or column) vector, then compute the Gram matrix of dot products (Bouhsine, 23 Feb 2026).
  • In collaborative filtering, scale ARn×nA \in \mathbb{R}^{n \times n}7 columns, compute the top singular vectors of ARn×nA \in \mathbb{R}^{n \times n}8, correct the top singular value by subtracting the estimated rank-1 mean-overestimate, and reconstruct a low-rank affinity approximation (Clean-KNN) (Khawar et al., 2019).
  • In situations with only a co-occurrence matrix, use the Ochiai normalization, not raw cosine, to recover proper affinity values (Zhou et al., 2015).

Empirical results in recommendation tasks demonstrate that cleaning cosine-based affinity matrices (by bias removal and noise-bulk eigenvalue clipping) yields substantial improvements in recall, NDCG, AUC, and catalog diversity compared to uncorrected cosine or SVD-type baselines (Khawar et al., 2019).

In clustering, isotropy pre-processing maximizes statistical power for detecting structure as well as calibrating affinity thresholds based on the null distribution.

In bibliometrics, improper normalization of co-occurrence matrices distorts downstream multidimensional scaling and clustering outputs, often erasing subfield distinctions; Ochiai-normalization corrects this artefact (Zhou et al., 2015).

7. Domain-Specific Considerations and Recommendations

Proper construction and interpretation of cosine-based affinity matrices depend strongly on context:

  • Embedding-based retrieval: Always enforce L2-normalization before affinity computation. Dot-product objectives alone do not ensure meaningful cosine geometry. Post-processing by projection to the unit sphere suffices (Bouhsine, 23 Feb 2026).
  • Recommender systems: Centering and correcting empirical cosine matrices for mean-bias and noise improves accuracy and diversity, outperforming vanilla nearest neighbor and SVD-based models (Khawar et al., 2019).
  • Statistical power in biology or clustering: Whitening and isotropic scaling of data sharpen the null distribution and maximize sensitivity. Optimize data transformations to approach isotropy in the feature space (Smith et al., 2023).
  • Bibliometric mapping: Never apply cosine similarity or Pearson correlation directly to a co-occurrence matrix ARn×nA \in \mathbb{R}^{n \times n}9; instead, use the Ochiai coefficient, which is mathematically equivalent to the cosine similarity on Aij=cos(xi,xj)A_{ij} = \cos(x_i, x_j)0 and avoids double normalization (Zhou et al., 2015).
  • Structured/signal data: Where linear geometry is insufficient, generalize similarity by constructing affinity matrices via convex-cost surface-normals or Bregman-angles, choosing the convex function to match domain assumptions (Gunay et al., 2014).

Meticulous normalization and a careful understanding of the statistical and algebraic subtleties are essential for the reliable use of cosine-based affinity matrices in both foundational research and applied machine learning.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Cosine-based Affinity Matrices.