Papers
Topics
Authors
Recent
Search
2000 character limit reached

Vendi Score: Kernel-Based Diversity Metric

Updated 14 May 2026
  • Vendi Score is a kernel-based diversity metric that quantifies effective diversity by computing the entropy of eigenvalues from a trace-normalized similarity matrix.
  • It offers tunable sensitivity through the parameter q, allowing focus on rare types or dominant clusters across domains such as machine learning, ecology, and biology.
  • Its algorithm involves constructing a kernel matrix, normalizing it, and performing spectral decomposition, with scalable approximations like the Nyström method enhancing efficiency.

The Vendi Score is a general, kernel-based diversity metric designed for quantifying the effective number of distinct elements in a finite sample, with deep connections to ecological diversity indices and quantum statistics. Its foundation is the entropy of the spectrum of a similarity matrix constructed via a user-specified positive semi-definite kernel. Unlike domain-specific or label-dependent metrics, the Vendi Score allows for tunable sensitivity to rare versus abundant types, admits theoretical generalizations, and has become a standard tool for measuring and optimizing diversity across machine learning, computational biology, ecology, and experimental design.

1. Formal Definition and Mathematical Foundations

Let X={x1,,xn}\mathcal{X} = \{x_1, \dots, x_n\} be a collection of nn items (e.g., images, sequences, trajectories) and k:X×XR+k:\mathcal{X} \times \mathcal{X} \rightarrow \mathbb{R}_+ a user-chosen positive semi-definite kernel with the normalization k(xi,xi)=1k(x_i,x_i) = 1. Construct the n×nn \times n similarity matrix KK, where Kij=k(xi,xj)K_{ij} = k(x_i, x_j). For the trace-normalized matrix K~=K/tr(K)\tilde K = K / \mathrm{tr}(K), let λ1,,λn\lambda_1,\ldots,\lambda_n denote its eigenvalues (so i=1nλi=1\sum_{i=1}^n \lambda_i = 1).

The Vendi Score of order nn0 (with nn1) is defined as:

nn2

The nn3 case recovers the exponential of the Shannon/von Neumann entropy of the spectrum (also equivalent to the Hill number of order 1 in ecology):

nn4

This effective-number interpretation means that nn5 lies in nn6, achieving nn7 if all items are mutually orthogonal and nn8 if all are identical (Pasarkar et al., 2023, Friedman et al., 2022).

The kernel function nn9 is central—by selection of k:X×XR+k:\mathcal{X} \times \mathcal{X} \rightarrow \mathbb{R}_+0 the user defines which differences are considered meaningful.

2. Algorithmic Computation and Scalability

The computation of the Vendi Score proceeds as follows:

  1. Compute the k:X×XR+k:\mathcal{X} \times \mathcal{X} \rightarrow \mathbb{R}_+1 kernel matrix: k:X×XR+k:\mathcal{X} \times \mathcal{X} \rightarrow \mathbb{R}_+2.
  2. Normalize: k:X×XR+k:\mathcal{X} \times \mathcal{X} \rightarrow \mathbb{R}_+3.
  3. Spectral decomposition: Obtain eigenvalues k:X×XR+k:\mathcal{X} \times \mathcal{X} \rightarrow \mathbb{R}_+4 of k:X×XR+k:\mathcal{X} \times \mathcal{X} \rightarrow \mathbb{R}_+5.
  4. Aggregate: Evaluate the relevant k:X×XR+k:\mathcal{X} \times \mathcal{X} \rightarrow \mathbb{R}_+6-order function as above to compute k:X×XR+k:\mathcal{X} \times \mathcal{X} \rightarrow \mathbb{R}_+7.

Pseudocode: KK7

Complexity: Kernel matrix computation is k:X×XR+k:\mathcal{X} \times \mathcal{X} \rightarrow \mathbb{R}_+8; eigen-decomposition is k:X×XR+k:\mathcal{X} \times \mathcal{X} \rightarrow \mathbb{R}_+9 in general. If high-dimensional embeddings are available (dimension k(xi,xi)=1k(x_i,x_i) = 10), one can leverage the low-rank structure for k(xi,xi)=1k(x_i,x_i) = 11 complexity (Friedman et al., 2022, Lintunen, 3 Sep 2025). For large-scale problems, efficient approximations via the Nyström method or random projections enable sub-cubic scaling (Ospanov et al., 2024).

3. Theoretical Properties and Parameter Interpretation

Bounds, Invariance, and Interpretability

  • Range: k(xi,xi)=1k(x_i,x_i) = 12.
  • Duplication invariance: Duplicating an item does not increase VS; redundancy is not counted as diversity.
  • Similarity sensitivity: The score interpolates between "species richness" (k(xi,xi)=1k(x_i,x_i) = 13, counts modes) and "dominant-mode" (k(xi,xi)=1k(x_i,x_i) = 14, counts major clusters).
  • Label-free: No need for class labels or type frequencies; purely uses sample similarities (Pasarkar et al., 2023, Nielsen et al., 26 Sep 2025).

Tuning via k(xi,xi)=1k(x_i,x_i) = 15

  • k(xi,xi)=1k(x_i,x_i) = 16: Sensitive to rare clusters or outliers; emphasizes counting "distinct modes."
  • k(xi,xi)=1k(x_i,x_i) = 17: Balances rare and common types (Shannon entropy analog).
  • k(xi,xi)=1k(x_i,x_i) = 18: Emphasizes dominant groups; insensitive to rare types.
  • k(xi,xi)=1k(x_i,x_i) = 19: Counts the number of nonzero modes (matrix rank).
  • n×nn \times n0: Returns n×nn \times n1, i.e., size of largest cluster.

This allows targeted sensitivity in applications—rare variant detection in genomics (n×nn \times n2) or memorization in deep generative modeling (n×nn \times n3) (Pasarkar et al., 2023, Nielsen et al., 26 Sep 2025).

4. Practical Application Domains

Machine Learning

Experimental Design and Discovery

  • Quality-weighted VS: In scientific discovery and experimental design (e.g., active search, BO), VS is extended with a quality multiplier, yielding n×nn \times n4. Such criteria flexibly balance exploitation (high score) and exploration (diversity), resulting in 70–170% increases in effective discoveries (Nguyen et al., 2024).

Computational Biology and Genomics

  • Epidemiology: VS quantifies the diversity of viral populations in time-resolved sequence data and detects emerging low-diversity clusters indicative of new variants. It is particularly effective for unsupervised, reference-free tracking in large-scale surveillance (Nielsen et al., 26 Sep 2025).
  • Protein/materials universe analysis: The Vendiscope applies VS with learned weighting to entire scientific datasets, quantifying rarity and identifying near-duplicate and high-diversity instances at scale (Pasarkar et al., 15 Feb 2025).

Generative Model Evaluation

  • Conditional and Information-Vendi: For generative models conditioned on prompts, VS has been extended to decompose observed diversity into model-induced vs. prompt-induced components, enabling precise analysis of text-to-image, image-to-text, and video generators (Jalali et al., 2024).

OOD Detection

  • Vendi Novelty Score (VNS): Measures the increase in VS when a test sample is added to the in-distribution set. VNS achieves state-of-the-art OOD detection using only samples and similarities, avoiding density estimation (Pasarkar et al., 10 Feb 2026).

5. Approximation Methods and Convergence

Truncated and Approximated Versions

  • For large n×nn \times n5, the full spectrum is expensive and may not converge quickly (especially under infinite-dimensional kernels such as RBF/Gaussian).
  • The n×nn \times n6-truncated Vendi Score uses just the top n×nn \times n7 eigenvalues, requiring only n×nn \times n8 samples for convergence. Efficient approximations via Nyström and FKEA random-feature methods concentrate tightly around the truncated statistic, with precise finite-sample error bounds (Ospanov et al., 2024).

Empirical findings

  • On finite-dimensional kernels, VS converges rapidly with n×nn \times n9, where KK0 is feature dimension.
  • On infinite-dimensional kernels, convergence requires truncation and approximation. Nyström and random features provide accurate, scalable solutions.

6. Pitfalls, Limitations, and Implementation Guidance

  • Kernel dependence: The selected similarity function fundamentally determines what diversity is measured.
  • Computational scaling: Exact VS is KK1; scalable SVD/approximation methods are advised for KK2.
  • Reference-freeness: VS measures internal diversity; it must be paired with a quality or precision metric to avoid high-diversity but low-quality (e.g., random noise) artifacts (Friedman et al., 2022).
  • Sensitivity parameter tuning: The KK3 parameter must be selected according to application needs; KK4 is generally robust, but KK5 for rare species and KK6 for memorization/duplication detection (Pasarkar et al., 2023, Nguyen et al., 2024).
  • Sparse or imbalanced data: Imbalanced prevalence affects sensitivity; in extreme cases, the probability-weighted form of VS is recommended (Pasarkar et al., 15 Feb 2025).

7. Extensions and Theoretical Innovations


Key References

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Vendi Score.