Cosine-based Affinity Matrices

Updated 23 March 2026

Cosine-based affinity matrices are defined as normalized inner products between data points, providing a robust measure of pairwise similarities used in clustering and recommendation systems.
They employ L2-normalization, noise correction, and eigenvalue shrinkage to ensure accurate spectral properties and enhance performance in diverse applications.
Generalizations using Bregman-angle affinity and isotropic preprocessing further refine these matrices for precise embeddings and reliable recovery of signal structure.

Cosine-based affinity matrices are fundamental objects in data analysis, machine learning, and information retrieval, encoding pairwise similarity between data points via normalized inner products. These affinity matrices underpin memory-based recommendation, spectral clustering, biological association mining, and a wide range of embedding-based retrieval pipelines. Their construction, spectral properties, normalization strategies, and mathematical pitfalls have been investigated across multiple domains using advanced statistical and matrix analysis tools.

1. Construction and Definitions

Given $n$ data points $x_1,\dots,x_n$ in $\mathbb{R}^d$ , the canonical cosine similarity between $x_i$ and $x_j$ is defined as

$\cos(x_i, x_j) = \frac{x_i^T x_j}{\|x_i\|\|x_j\|}.$

The resulting affinity matrix $A \in \mathbb{R}^{n \times n}$ has entries $A_{ij} = \cos(x_i, x_j)$ . In collaborative filtering, given a user–item interaction matrix $X \in \mathbb{R}^{n \times m}$ , the empirical cosine similarity matrix between items is

$S_{\mathrm{cos}} = D^{-1/2} (X^T X) D^{-1/2},$

where $x_1,\dots,x_n$ 0 with $x_1,\dots,x_n$ 1 (Khawar et al., 2019). Equivalently, forming the column-normalized matrix $x_1,\dots,x_n$ 2, $x_1,\dots,x_n$ 3.

For alternative applications, cosine-based affinity matrices may arise from more general similarity measures, such as the "Bregman-angle" cosine between surface normals of a convex cost function $x_1,\dots,x_n$ 4, $x_1,\dots,x_n$ 5, with $x_1,\dots,x_n$ 6 (Gunay et al., 2014).

In bibliometrics, one constructs the item–item (author–author, word–word) co-occurrence matrix $x_1,\dots,x_n$ 7 from an occurrence matrix $x_1,\dots,x_n$ 8. The correct recovery of cosine affinities in this setting employs the Ochiai coefficient (Zhou et al., 2015): $x_1,\dots,x_n$ 9 which yields the same values as computing the cosine directly from $\mathbb{R}^d$ 0.

2. Spectral Properties and Noise Effects

Random matrix theory describes the spectral behavior of empirical cosine affinity matrices, particularly under noisy input data. For $\mathbb{R}^d$ 1 a random $\mathbb{R}^d$ 2 matrix with i.i.d. zero-mean unit-variance entries, the eigenvalue spectrum of $\mathbb{R}^d$ 3 converges to the Marčenko–Pastur law as $\mathbb{R}^d$ 4 (Khawar et al., 2019). Key properties include:

All nontrivial eigenvalues are confined to the interval $\mathbb{R}^d$ 5, $\mathbb{R}^d$ 6, $\mathbb{R}^d$ 7, with $\mathbb{R}^d$ 8.
The spectrum of the cosine affinity matrix $\mathbb{R}^d$ 9 exhibits natural shrinkage of non-top eigenvalues compared to that of the (demeaned) Pearson correlation estimator: for $x_i$ 0, $x_i$ 1, meaning vanilla cosine induces shrinkage "for free".
Non-centered data leads to a top-eigenvalue overestimation: decomposing $x_i$ 2 (centered + mean matrix), one extra rank-1 term $x_i$ 3 lifts one eigenvalue by $x_i$ 4 (with $x_i$ 5 the mean of column $x_i$ 6), necessitating explicit correction for accurate downstream use.

3. Normalization, Gauge Ambiguity, and Proper Construction

Cosine similarity’s meaning depends crucially on normalization. For embeddings learned with general dot-product objectives (e.g., matrix factorization), the similarity is arbitrary up to an invertible diagonal "gauge" matrix $x_i$ 7, $x_i$ 8 (Bouhsine, 23 Feb 2026). The cosine between $x_i$ 9 in $x_j$ 0 is: $x_j$ 1 which can be rendered arbitrary by adjusting $x_j$ 2.

Enforcing L2-normalization of each embedding vector, restricting $x_j$ 3 to the unit sphere $x_j$ 4, eliminates this gauge freedom entirely:

For unit-normed embeddings, cosine similarity reduces to the plain dot product.
On the sphere, cosine and squared Euclidean distance are linearly related:

$x_j$ 5

Thus, neighbor rankings under cosine and Euclidean distance become identical.

For co-occurrence matrices, applying standard cosine normalization or Pearson $x_j$ 6 to $x_j$ 7 induces double normalization—overestimating affinities and distorting downstream clustering (Zhou et al., 2015). The correct approach is to use the Ochiai coefficient on $x_j$ 8 or the cosine directly on the original occurrence matrix $x_j$ 9.

4. Variance of Cosine Similarity and the Isotropic Principle

The statistical properties of entries in cosine-based affinity matrices depend on the input data's covariance structure. For zero-mean data with covariance $\cos(x_i, x_j) = \frac{x_i^T x_j}{\|x_i\|\|x_j\|}.$ 0, asymptotic analysis shows (Smith et al., 2023): $\cos(x_i, x_j) = \frac{x_i^T x_j}{\|x_i\|\|x_j\|}.$ 1 The variance of cosine similarity is minimized when the covariance is isotropic ( $\cos(x_i, x_j) = \frac{x_i^T x_j}{\|x_i\|\|x_j\|}.$ 2 constant over $\cos(x_i, x_j) = \frac{x_i^T x_j}{\|x_i\|\|x_j\|}.$ 3): this is the "isotropic principle". Preprocessing data via whitening ( $\cos(x_i, x_j) = \frac{x_i^T x_j}{\|x_i\|\|x_j\|}.$ 4) or more general isotropic linear maps ensures the affinity matrix's null distribution is as sharp as possible, improving discriminative power for clustering and retrieval tasks.

Modern practice extends this by optimizing over a parameterized family of transformations $\cos(x_i, x_j) = \frac{x_i^T x_j}{\|x_i\|\|x_j\|}.$ 5, maximizing downstream objectives (e.g., spectral cutting, recall) in an end-to-end manner, backpropagating gradients through the cosine similarities.

5. Generalizations: Convex Cost Functions and Bregman-Angle Matrices

Cosine-based affinity can be generalized via the angular structure of surface normals to convex cost functions ("Bregman-angle" similarity) (Gunay et al., 2014). Let $\cos(x_i, x_j) = \frac{x_i^T x_j}{\|x_i\|\|x_j\|}.$ 6 be a strictly convex (possibly non-differentiable) function. For each $\cos(x_i, x_j) = \frac{x_i^T x_j}{\|x_i\|\|x_j\|}.$ 7, define the (possibly sub-)gradient $\cos(x_i, x_j) = \frac{x_i^T x_j}{\|x_i\|\|x_j\|}.$ 8 and lifted normal $\cos(x_i, x_j) = \frac{x_i^T x_j}{\|x_i\|\|x_j\|}.$ 9, then normalize $A \in \mathbb{R}^{n \times n}$ 0. The affinity between $A \in \mathbb{R}^{n \times n}$ 1 and $A \in \mathbb{R}^{n \times n}$ 2 can be formulated as: $A \in \mathbb{R}^{n \times n}$ 3 or via the angle $A \in \mathbb{R}^{n \times n}$ 4. Choice of $A \in \mathbb{R}^{n \times n}$ 5 can encode domain structure, such as negative entropy for distributions, or total variation for signals. Using Gaussian kernels of these angles, $A \in \mathbb{R}^{n \times n}$ 6, often yields positive-definite affinity matrices suitable for clustering and spectral analysis.

This construction yields true angle metrics on the manifold of surface normals. Bregman-angle affinity may provide robustness to global shifts, is more faithful for structured signals, and relates to (but is distinct from) Bregman divergence, which measures tangential, rather than angular, differences.

6. Practical Algorithms and Applications

Algorithmic recipes for large-scale cosine-based affinity computation proceed via:

For standard use: L2-normalize each row (or column) vector, then compute the Gram matrix of dot products (Bouhsine, 23 Feb 2026).
In collaborative filtering, scale $A \in \mathbb{R}^{n \times n}$ 7 columns, compute the top singular vectors of $A \in \mathbb{R}^{n \times n}$ 8, correct the top singular value by subtracting the estimated rank-1 mean-overestimate, and reconstruct a low-rank affinity approximation (Clean-KNN) (Khawar et al., 2019).
In situations with only a co-occurrence matrix, use the Ochiai normalization, not raw cosine, to recover proper affinity values (Zhou et al., 2015).

Empirical results in recommendation tasks demonstrate that cleaning cosine-based affinity matrices (by bias removal and noise-bulk eigenvalue clipping) yields substantial improvements in recall, NDCG, AUC, and catalog diversity compared to uncorrected cosine or SVD-type baselines (Khawar et al., 2019).

In clustering, isotropy pre-processing maximizes statistical power for detecting structure as well as calibrating affinity thresholds based on the null distribution.

In bibliometrics, improper normalization of co-occurrence matrices distorts downstream multidimensional scaling and clustering outputs, often erasing subfield distinctions; Ochiai-normalization corrects this artefact (Zhou et al., 2015).

7. Domain-Specific Considerations and Recommendations

Proper construction and interpretation of cosine-based affinity matrices depend strongly on context:

Embedding-based retrieval: Always enforce L2-normalization before affinity computation. Dot-product objectives alone do not ensure meaningful cosine geometry. Post-processing by projection to the unit sphere suffices (Bouhsine, 23 Feb 2026).
Recommender systems: Centering and correcting empirical cosine matrices for mean-bias and noise improves accuracy and diversity, outperforming vanilla nearest neighbor and SVD-based models (Khawar et al., 2019).
Statistical power in biology or clustering: Whitening and isotropic scaling of data sharpen the null distribution and maximize sensitivity. Optimize data transformations to approach isotropy in the feature space (Smith et al., 2023).
Bibliometric mapping: Never apply cosine similarity or Pearson correlation directly to a co-occurrence matrix $A \in \mathbb{R}^{n \times n}$ 9; instead, use the Ochiai coefficient, which is mathematically equivalent to the cosine similarity on $A_{ij} = \cos(x_i, x_j)$ 0 and avoids double normalization (Zhou et al., 2015).
Structured/signal data: Where linear geometry is insufficient, generalize similarity by constructing affinity matrices via convex-cost surface-normals or Bregman-angles, choosing the convex function to match domain assumptions (Gunay et al., 2014).

Meticulous normalization and a careful understanding of the statistical and algebraic subtleties are essential for the reliable use of cosine-based affinity matrices in both foundational research and applied machine learning.