Similarity & Dissimilarity Distributions

Updated 20 February 2026

Similarity and dissimilarity distributions are metrics and pseudo-metrics that quantify how 'close' or 'far' objects are in datasets, enabling insights into cluster separability and data complexity.
Matrix, graph, and geometric approaches are employed to construct these distributions, with methods like learnable weighting and robust estimators enhancing interpretability and scalability.
Practical applications in unsupervised clustering, representation learning, and hypothesis testing showcase how these distributions provide actionable insights into high-dimensional and structured data.

Similarity and dissimilarity distributions characterize how “close” or “far apart” objects, groups, or probability distributions are, either in a pointwise sense, over matrices, or in abstract spaces of functions or measures. These concepts are foundational across unsupervised learning, hypothesis testing, pattern recognition, matrix and network analysis, and representation learning. Both similarity (high closeness) and dissimilarity (high difference) are encoded via explicit metrics, pseudo-metrics, or more general discrepancy functionals, whose empirical distribution over a dataset reveals global structure, cluster separability, data complexity, and "outlierness". Recent research unifies these tools with robust geometry, learnable weighting, invariance properties, and scalable algorithms, yielding powerful frameworks for both theoretical study and large-scale inference.

1. Definitions and Taxonomy of Similarity and Dissimilarity

Explicit dissimilarity measures include classical metrics (Euclidean, Manhattan, Minkowski) as well as problem-specific indices such as Bray–Curtis, Jaccard, and Dice. More recent constructions decompose dissimilarity via moments and sparsity, mapping each object or group to a vector of components: mean shift, spread (variance), and proportion of zeros (“sparsity dissimilarity”), particularly informative in high-dimensional or structured domains (Tuobang, 2024). In pairwise contexts, similarity matrices (S) and dissimilarity matrices (D) may be learned, regularized, or constructed via weighted k-NN graphs (Lyu et al., 2024), geometric overlaps (Zimmermann, 2022), or via kernel- or transport-based distances on distributions (Rakotomamonjy et al., 2018, Harel et al., 2012).

Distributions of similarity or dissimilarity scores capture the spectrum from “most similar” (low D) to “most dissimilar” (high D) among all pairs, triples, or groupings of objects, and support empirical diagnostics such as density plots, moment statistics, and downstream clustering (Zimmermann, 2022, Wald et al., 2023).

2. Methods for Constructing and Interpreting (Dis)similarity Distributions

Matrix- and Graph-Based Approaches

In matrix analysis, dissimilarity distributions emerge via row-wise or blockwise statistics. For m × n matrices (e.g., gene expression, survey scores), group-to-group dissimilarities are decomposed componentwise—mean (μΔ), standard deviation (σΔ), and sparsity (sΔ)—with robust estimation (Hodges–Lehmann) ensuring stability against outliers (Tuobang, 2024). When required, a composite pseudo-metric D = αμΔ + βσΔ + γsΔ is formed, with normalization or standardization to ensure comparability of components.

For graph-based data, similarity matrices are often constructed by aggregating k-th nearest neighbor (NN) slices. Classical k-NN graphs may misrepresent true latent clusters; learning a nonnegative weight vector w over all shells produces a similarity matrix S(w) in a low-dimensional simplex, while a dual vector p constructs a dissimilarity matrix D(p); jointly, these govern the optimization and ensure sharp separation between reliable (“trusted neighbor”) relations and unreliable ones (Lyu et al., 2024).

Distributional and Geometric Approaches

For multivariate distributions or non-vector data (images, time series), dissimilarity may be measured via optimal transport (Wasserstein distance), RKHS-based metrics (MMD), or newly-proposed invariants such as Diffeomorphism-Invariant Dissimilarity (DID), which is minimized over coordinate warps and regularizations, yielding invariance to large classes of transformations (Cantelobre et al., 2022). The DID empirical distribution, when plotted across random pairs, warped pairs, and same-class pairs, visually quantifies cluster tightness and transformation invariance.

Explicit Dissimilarity Distributions for Grouped/Combinatorial Data

The enumeration of “identity states” in aligned multi-object draws gives rise to a distribution over match/mismatch configurations, whose expectations are analytically tractable in population genetics and combinatorial settings (Ahsan et al., 2024). The mean expected dissimilarity reduces (universally, for unordered size-K draws from distributions p and q) to 1 – ⟨p, q⟩, independent of ploidy and alphabet size.

3. Learnable and Adaptive (Dis)similarity: Weighting and Regularization

Recent advances blend the classical graph-based approaches with adaptive, learnable mechanisms. Lyu & Jia (Lyu et al., 2024) show that learning nonnegative simplex-constrained weights w and p over k-NN shells allows for data-adaptive selection of informative versus noisy neighborhoods. This low-dimensional parameterization reduces overfitting and preserves interpretability. Further, a dual dissimilarity branch (D(p)) is paired, with a strict orthogonality constraint (wᵀp = 0) and cross-penalization in the overall clustering objective. These ingredients, along with a geometrically-motivated simplex-volume orthogonality regularizer on latent cluster assignments, are simultaneously optimized in a convergent alternating scheme, yielding substantial improvements over state-of-the-art clustering baselines. Both ablation and statistical analysis confirm that learning similarity and dissimilarity jointly is critical for top empirical performance.

4. Statistical Properties: Invariance, Robustness, and Hypothesis Testing

A variety of (dis)similarity measures come equipped with invariance, robustness, and statistical testing guarantees:

RKHS-based metrics (including MMD, LinCKA, and DID) are invariant to classes of data transformations (e.g., smooth diffeomorphisms (Cantelobre et al., 2022)), enabling discriminative yet transformation-stable pairwise comparisons.
Theoretical results for kernel-based multi-sample dissimilarities (KMD) (Huang et al., 2022), and perturbed variation (PV) (Harel et al., 2012), provide sharp bounds: PV interpolates between total variation and Wasserstein distances, allowing relaxation of exact matching via a perturbation parameter ε. Both admit computable, permutation-based hypothesis tests with finite-sample error controls and central-limit asymptotics for multi-group testing. PV in particular admits efficient sample-based estimation via bipartite graph matchings, with explicit bias and concentration theorems.
Moment–sparsity decompositions are robust via estimator choice (H–L location), interpretable (mean/sd/sparsity reflect interpretable shifts), and admit standardization for high-dimensional settings (Tuobang, 2024).

5. Practical Applications: Clustering, Representation Learning, Hypothesis Testing

Similarity and dissimilarity distributions underpin numerous practical pipelines:

Unsupervised Clustering: Adaptive, learnable similarity/dissimilarity matrices enable more precise, cluster-aligned SymNMF, outperforming fixed-graph or naive adaptive methods on omics, image, and document clustering tasks (Lyu et al., 2024). Empirical (dis)similarity distributions guide the choice of k, regularization, and reveal cluster tightness.
Representation Learning and Ensembles: Enforcing dissimilarity at intermediate network layers (e.g., using CKA, L2Corr, ExpVar metrics) provably increases diversity among model outputs; the empirical distributions of pairwise dissimilarity at each layer shift significantly under regularization, correlating with increased ensemble robustness and higher accuracy (Wald et al., 2023).
Distribution Testing: The KMD framework provides fast, scalable, nonparametric tests for equality across arbitrary numbers of groups, with sample complexity, permutation-invariance, and divergence-like properties (Huang et al., 2022). PV and other thresholds allow formal similarity (equivalence) reasoning, rather than strict difference testing (Harel et al., 2012).
Geometric Analysis: Overlap-based dissimilarity matrices among 2D shapes enable systematic geometric analysis, MDS or SVD-based visualization, and robust clustering, with the empirical D-matrix revealing both cluster block structure and intrinsic data dimensionality (Zimmermann, 2022).

Applications in “omics”, microbe profiling, social science surveys, and image domains consistently show that examining the empirical distribution of dissimilarity components across pairs or groups provides actionable, interpretable insight into differential structure, robustness to transformation, and effective data embedding for downstream learning (Tuobang, 2024, Huang et al., 2022, Cantelobre et al., 2022).

6. Empirical and Theoretical Insights: Typical Distributions, Separability, and Limitations

Typical dissimilarity distributions, when plotted as histograms of off-diagonal matrix entries or pairwise scores, reveal:

Spread and skew indicating the presence of tight clusters (peaked low, with heavy right tail) or diffuse populations (broad, high mean) (Zimmermann, 2022, Tuobang, 2024).
Clear separation between “similar” pairs (within-class, or diffeomorphic-warps) and “random” pairs, used for threshold setting in clustering or outlier detection (Cantelobre et al., 2022, Wald et al., 2023).
Limitations on metricity: many pseudo-metric (dis)similarity functions lack triangle inequality (e.g. moment–sparsity or composite scores), thus best interpreted as embeddings or features for downstream analysis, rather than strict distances (Tuobang, 2024, Harel et al., 2012).
High-dimensionality effects: sparsity dissimilarity often dominates in “omics” or survey data when m ≫ n, while convergence of empirical scores may suffer from the curse of dimensionality, partially alleviated via robust estimation and standardization (Tuobang, 2024, Harel et al., 2012).

A notable theoretical result for labeled draws from discrete distributions is that mean dissimilarity depends only on the inner product of the underlying frequencies, not on sample size or alphabet cardinality, and that under certain conditions, draws from the same law may appear more dissimilar (in expectation) than draws from different laws (Ahsan et al., 2024).

7. Comparative Summary and Research Directions

The table below summarizes representative families of similarity and dissimilarity distributions, illustrating their key properties:

Class of Method	Core Property	Reference
Learnable similarity/dissimilarity graphs	Simplex-constrained, adaptive, dual regularization	(Lyu et al., 2024)
Moment/sparsity-decomposed matrix distances	Interpretable, robust, pseudo-metric	(Tuobang, 2024)
Perturbed variation	ε-tolerance, relaxes TV, finite-sample guarantees	(Harel et al., 2012)
Kernel Multi-sample Dissimilarity (KMD)	Graph-based, scalable, divergence-like	(Huang et al., 2022)
Diffeomorphism-Invariant Functional Dissim.	RKHS-based, geometric invariance	(Cantelobre et al., 2022)
Explicit combinatorial (identity-state)	Analytical expectation, population genetics	(Ahsan et al., 2024)
Geometric overlap for shapes	Fully spatial, MDS visualization, clustering	(Zimmermann, 2022)
Representational (dis)similarity in DNNs	Per-layer structural shifts, ensemble diversity	(Wald et al., 2023)

Current research continues to develop settings with better transformation invariance, interpretable decompositions (mean/sd/sparsity), learnable low-dimensional parameterizations, and scalable statistical testing methods, particularly for high-dimensional, structured, or non-vectorial data. The systematic analysis of similarity and dissimilarity distributions, both as descriptive statistics and within learning objectives, remains central to advances in unsupervised clustering, hypothesis testing, ensemble learning, and geometric data analysis.