Determinantal Point Processes

Updated 5 March 2026

Determinantal Point Processes are mathematical models that use determinants of submatrices of a positive semi-definite kernel to encode global negative dependence and repulsion.
They enable efficient algorithms for marginalization, conditioning, and sampling through techniques like eigendecomposition and low-rank approximations.
DPPs are applied in machine learning subset selection, randomized numerical linear algebra, and statistical physics, offering diversity and tractable inference.

A determinantal point process (DPP) is a mathematically rigorous model of a random subset of a ground set—finite, countable, or continuous—whose probabilities are governed by determinants of submatrices of a positive semi-definite kernel. DPPs are central objects in probability, statistical physics, and machine learning due to their ability to encode global negative dependence, thus promoting diversity and repulsion among sampled points. The determinant structure yields analytic tractability, enabling efficient algorithms for marginalization, conditioning, exact or approximate sampling, and learning. DPP kernels admit flexible geometric interpretations, and the theory connects to diverse domains ranging from random matrix spectra to randomized numerical linear algebra and stochastic spatial models.

1. Mathematical Foundations and Definitions

Let $\mathcal{Y} = \{1,\ldots,N\}$ be a finite ground set. A DPP on $\mathcal{Y}$ is specified by a positive semi-definite (PSD) matrix $K \in \mathbb{R}^{N \times N}$ with eigenvalues in $[0,1]$ , called the marginal kernel. The process $\mathbb{P}$ is defined by

$\mathbb{P}(A \subseteq Y) = \det(K_A)$

for any $A \subseteq \mathcal{Y}$ , where $K_A$ denotes the principal submatrix of $K$ indexed by $A$ (Kulesza et al., 2012, Kulesza et al., 2012).

Alternatively, a DPP can be parameterized as an L-ensemble: for $L \succeq 0$ , define

$\mathbb{P}_L(Y = S) = \frac{\det(L_S)}{\det(L + I)}$

with $L_S$ the principal submatrix for $S \subseteq \mathcal{Y}$ . The kernel and L-ensemble parametrizations are related by $K = L(L + I)^{-1}$ ; conversely, $L = K(I-K)^{-1}$ for $K \prec I$ (Kulesza et al., 2012).

On continuous spaces $S$ (e.g., $\mathbb{R}^d$ ), a DPP is a simple point process whose joint intensities

$\rho_n(x_1, \ldots, x_n) = \det[ K(x_i, x_j) ]_{i,j=1}^n$

are specified by a Hermitian, locally trace-class, $0 \preceq K \preceq I$ integral kernel (Katori, 2020, Katori et al., 2019).

A hallmark property is that DPPs model global negative dependence: for $i \ne j$ ,

$\mathbb{P}(\{i,j\} \subset Y) = K_{ii}K_{jj} - K_{ij}^2 \le K_{ii}K_{jj}$

ensuring “repulsion” (Kulesza et al., 2012).

2. Kernel Structure, Generalizations, and Extended Models

Quality-Similarity Decomposition

For applications, $L$ is often written as

$L_{ij} = q_i \, \phi_i^T \phi_j \, q_j$

where $q_i \ge 0$ is a “quality” score and $\phi_i$ a normalized feature embedding encoding similarity, so that $L = V V^T$ with $V_i = q_i \phi_i$ (Kulesza et al., 2012, Kulesza et al., 2012).

Fixed-Size and Projection DPPs

Conditioning on $|Y| = k$ yields a $k$ -DPP: $\mathbb{P}^k_L(S) = \frac{\det(L_S)}{\sum_{|S'|=k}\det(L_{S'})}$ for $|S| = k$ . For $\operatorname{rank} L = k$ , such DPPs are “projection” DPPs whose marginal kernel is an orthogonal projector; all samples have size $k$ (Kulesza et al., 2012).

Non-Symmetric Kernels

The standard theory assumes $L$ and $K$ symmetric, but DPPs extend to nonsymmetric kernels where all principal minors are non-negative ( $P_0$ -matrices). Probabilities and determinantal inclusion formulas extend following the theory of $P_0$ -matrices, with necessary and sufficient conditions for a nonsymmetric $K$ provided in terms of generalized principal minor inequalities (Arnaud, 2024).

Extended L-Ensembles and Partial-Projection DPPs

Extended L-ensembles $(L, V)$ generalize classical L-ensembles, enabling DPPs whose kernels have eigenvalues at $1$, corresponding to “fixed” points always included in the sample. Partial-projection DPPs, defined in this way, appear as flat or singular limits of L-ensembles and play a key role in universality phenomena for DPPs (Barthelmé et al., 2020).

3. Algorithms: Inference, Sampling, and Learning

Marginals, Conditioning, and MAP Inference

Marginal inclusion probabilities are diagonal entries $K_{ii}$ ; for $A \subseteq \mathcal{Y}$ , $\mathbb{P}(A \subseteq Y) = \det(K_A)$ . Efficient computation is enabled by eigendecomposition of $L$ or dual kernel techniques. Conditioning, restriction, and complement processes admit analytic transformations of $L$ or $K$ , preserving the DPP structure (Kulesza et al., 2012, Kulesza et al., 2012).

MAP (maximum a posteriori) subset selection is NP-hard, but greedy submodular approximations enable near-optimal solutions, leveraging the submodularity of $\log \det L_S$ (Kulesza et al., 2012).

Exact and Approximate Sampling

Sampling from a DPP proceeds by (a) eigendecomposition: select eigenvectors $v_i$ independently with probability $\lambda_i/(\lambda_i+1)$ , then (b) sequentially sample items according to squared volumes/projections (Kulesza et al., 2012, Li et al., 2015). For $k$ -DPPs, eigenvector selection is conditioned on fixed cardinality, using elementary symmetric polynomial recursions (cost $O(N^3)$ ). For scalability, dual sampling (projecting onto low-rank subspaces) or coreset-based approximate samplers achieve near-linear time at small total-variation error, as in CoreDpp (Li et al., 2015).

For continuous spaces, the dual DPP sampler, low-rank approximations (Nyström, random Fourier features), and Gibbs sampling (for $k$ -DPPs) enable tractable inference (Affandi et al., 2013).

Parameter Learning

Learning $L$ and $K$ is nonconvex and computationally demanding. For feature-parametric models, the log-likelihood is concave in parameters of log-linear $q_i(\theta)$ , enabling efficient convex optimization (Kulesza et al., 2012). For unconstrained $L$ , EM approaches in the eigenbasis (updating eigenvalues and vectors) guarantee monotonicity and yield consistent estimators; method-of-moments estimators also provide strongly consistent and asymptotically normal initializations (Gillenwater et al., 2014, Gouriéroux et al., 20 May 2025).

Nonparametric learning for continuous DPPs is feasible via representer theorems in RKHS, reducing the infinite-dimensional MLE to a finite-dimensional convex program, solved via fixed-point/Picard-type iteration and trace-penalization for regularization (Fanuel et al., 2021).

4. Applications and Model Extensions

Machine Learning and Subset Selection

DPPs are natural models for subset selection tasks requiring diversity—e.g., extractive summarization, diverse recommendations, active learning, sensor placement, and multiclass image analysis. The DPP framework offers exact inference, unbiased estimators, and negative correlation guarantees, outperforming or complementing i.i.d. or submodular randomization approaches (Kulesza et al., 2012, Kulesza et al., 2012, Dereziński et al., 2020).

Statistical Modelling and Bayesian Priors

DPPs serve as repulsive Bayesian priors in mixture and feature allocation models, promoting interpretable, parsimonious, and non-redundant latent structure (e.g., for MRI segmentation, protein/gene clustering). Posterior inference is conducted via reversible-jump MCMC, exploiting DPP priors for latent feature repulsion (Xu et al., 2015).

Structured, Sequential, and Markov DPPs

Structured DPPs (SDPPs) apply to combinatorial structures (timelines, pose configurations), exploiting dual representations for tractable inference in high dimensions (Kulesza et al., 2012). Markov DPPs (M-DPPs) enable sequential modeling of diverse subsets over time, maintaining DPP marginals and tractable union properties, with sampling and online learning algorithms for dynamic subset selection (Affandi et al., 2012).

Deep and Low-Rank DPPs

Scalability to large-scale discrete domains is enabled by low-rank kernel factorizations ( $L = V V^T$ ), mixtures of low-rank DPPs, and tensor/Kronecker decompositions. Deep DPPs parameterize $V$ via deep neural networks, enabling nonlinear, metadata-aware kernels and end-to-end learning, with empirical superiority over shallow DPPs in recommender tasks (Gartrell et al., 2018, Gartrell et al., 2016, Mariet et al., 2016).

Universality, Random Matrix Theory, and Physics

DPPs underlie classical ensembles in random matrix theory (e.g., Hermite, Laguerre, Ginibre), with explicit construction via orthogonal polynomials and robust scaling limits (sine, Airy, Bessel, Euclidean, Heisenberg kernels) (Katori, 2020, Katori et al., 2019). Universal limiting behavior under kernel flattening or perturbative limits yields projection or partial-projection DPPs, governed solely by kernel smoothness order, with implications for universality classes of spatial statistics and random geometry (Barthelmé et al., 2020).

Randomized Numerical Linear Algebra

In randomized linear algebra, DPPs yield unbiased and minimum-variance estimators for regression, optimal Nyström polygons, and sketching methods, outperforming i.i.d. leverage-score sampling in sample efficiency and statistical variance, at moderate additional computational cost (Dereziński et al., 2020).

5. Diversity, Rate-Distortion Theory, and Sampling Regimes

DPPs maximize diversity in subset selection, formalized via volume/squared-volume under the kernel. Recent developments elucidate the connection to rate-distortion theory: for a data matrix $Z$ , the rate at distortion $\epsilon^2$ aligns with $R(Z, \epsilon) = \frac12 \log\det(I + \alpha Z Z^T)$ , and the log-sum of DPP diversity scores underlies the rate-distortion objective (Chen et al., 2023).

Empirically, DPP-based diversity gains saturate at a “phase transition”—when the cardinality matches the rank of $L$ , further samples yield diminished incremental diversity. This insight motivates bi-modal approaches (e.g., RD-DPP): initial samples are DPP-maximal for diversity, while beyond the phase transition, uncertainty-driven or task-specific selection yields further performance gains. RD-DPP achieves superior generalization and test accuracy in low-budget regimes and cross-task generalization compared to pure DPP, random, uncertainty, or coreset baselines (Chen et al., 2023).

6. Operator Theory, Duality, and Advanced Structures

The operator-theoretic perspective connects DPPs to partial isometries between Hilbert spaces. Given $W: H_1 \to H_2$ (partial isometry, locally Hilbert-Schmidt), a pair of dual DPPs arise on $S_1, S_2$ via their kernels $K_1(x,x') = \int W(y, x) W(y, x') d\lambda_2(y)$ and $K_2(y,y') = \int W(y, x) W(y', x) d\lambda_1(x)$ . Fredholm determinant duality implies equality of probabilities for matching measurable sets under each DPP (Katori et al., 2019, Katori, 2020).

This construction unifies finite and infinite DPPs, enables duality relations between deterministic ensembles (continuous-discrete, e.g., Hermite/Laguerre vs dual ensembles), and underpins scaling limit results yielding universal DPP kernels (sine, Airy, Bessel, Ginibre families) in dimensions $d \ge 1$ .

7. Open Problems and Future Directions

Key directions include scalable nonparametric DPP learning for large/continuous domains (Fanuel et al., 2021), further development of efficient approximate samplers and randomized algorithms for high-dimensional and structured tasks, characterizing universality beyond isotropic settings (Barthelmé et al., 2020), exploring attractive and coupling structures in nonsymmetric DPPs (Arnaud, 2024), and integrating DPP priors in hierarchical Bayesian and deep learning pipelines. The intersection with spatial statistics, optimal experimental design, and robust approximation of stochastic processes continues to generate active theoretical and applied research.