Papers
Topics
Authors
Recent
2000 character limit reached

Pitman-Yor Process

Updated 30 November 2025
  • Pitman-Yor process is a two-parameter stochastic model that generalizes the Dirichlet process by modeling heavy-tailed, power-law cluster sizes.
  • It offers multiple representations—including stick-breaking, Chinese restaurant process, and EPPF—facilitating practical Bayesian nonparametric analysis.
  • Its flexibility makes it ideal for applications like density estimation and language modeling, where robust clustering of data is required.

The Pitman-Yor process (PYP) is a fundamental two-parameter family of discrete random probability measures that generalizes the Dirichlet process by inducing power-law tails in the sizes of partition blocks. It arises in Bayesian nonparametrics, combinatorial stochastic processes, genetics, and statistical mechanics. The PYP admits multiple representations: stick-breaking construction, exchangeable partition probability functions (EPPFs), Chinese restaurant process (CRP), and subordinator embedding. Its flexibility in modeling heavy-tailed distributions makes it a canonical prior for applications where the cluster-size distribution is empirically found to follow power laws.

1. Mathematical Definition and Core Representations

Let HH be a nonatomic base measure on a space Θ\Theta, d[0,1)d \in [0,1) a discount parameter, and α>d\alpha > -d a concentration parameter. The Pitman-Yor process is the law of a random probability measure GPY(d,α,H)G \sim \mathrm{PY}(d, \alpha, H) such that, for any partition (A1,...,Ak)(A_1, ..., A_k) of Θ\Theta, (G(A1),...,G(Ak))(G(A_1), ..., G(A_k)) has the law of the first kk masses of a Poisson-Dirichlet distribution PD(d,α)\operatorname{PD}(d, \alpha) (Chatzis et al., 2012, Lim et al., 2016).

The canonical stick-breaking representation is: VkBeta(1d,α+kd),πk=Vkl=1k1(1Vl),θkH,V_k \sim \mathrm{Beta}(1-d, \alpha + k d), \qquad \pi_k = V_k \prod_{l=1}^{k-1} (1 - V_l), \qquad \theta_k \sim H,

G=k=1πkδθkG = \sum_{k=1}^\infty \pi_k\,\delta_{\theta_k}

If d=0d=0, PY(0,α,H)\mathrm{PY}(0, \alpha, H) coincides with the Dirichlet process (Scricciolo, 2012, Rigon et al., 2020).

2. Partition Structure and Predictive Rules

A sample (θ1,...,θN)G(\theta_1, ..., \theta_N) \sim G induces a random partition with the following predictive probabilities:

  • If current clusters are labeled {ϕc}\{\phi_c\} with counts ncn_c: P(θM=ϕc{θi},d,α)=ncd(M1)+αP(\theta_{M} = \phi_{c} | \{\theta_{i}\}, d, \alpha) = \frac{n_c - d}{(M-1)+\alpha}

P(θM new{θi},d,α)=α+dt(M1)+αP(\theta_{M}\ \text{new} | \{\theta_{i}\}, d, \alpha) = \frac{\alpha + d\, t}{(M-1)+\alpha}

Here, tt is the current number of clusters (Chatzis et al., 2012, Lawless et al., 2018, Lim et al., 2016).

The EPPF is: p(n1,...,nk)=i=1k1(α+id)(α+1)n1j=1k(1d)nj1p(n_1,...,n_k) = \frac{\prod_{i=1}^{k-1} (\alpha + i d)} {(\alpha + 1)_{n-1} \prod_{j=1}^k (1-d)_{n_j-1}} where (x)n=x(x+1)...(x+n1)(x)_n=x(x+1)...(x+n-1) (Feng et al., 2016, Roy, 2014).

For d=0d=0 the clustering reduces to logarithmic growth (Dirichlet process); for d>0d>0 the expected number of clusters grows as ndn^d (power law) (Feng et al., 2016, Franssen et al., 2021, Franssen et al., 2022).

3. Extensions: Kernel Pitman-Yor and Hierarchical Constructions

Kernel Pitman-Yor Process (KPYP)

To model spatial or temporal dependencies, the KPYP modifies the stick-breaking by allowing each stick's discount parameter to depend on an external predictor xx: dk(x)=k(x,φk)d_k(x) = k(x, \varphi_k) where k(,)k(\cdot, \cdot) is a kernel (e.g., Gaussian/RBF), and {φk}\{\varphi_k\} are latent cluster locations. The local mixing measure at xx is: Gx=k=1πk(x)δθkG_x = \sum_{k=1}^\infty \pi_k(x)\delta_{\theta_k} where

Vk(x)Beta(1dk(x),α+kdk(x)), πk(x)=Vk(x)l<k(1Vl(x))V_k(x) \sim \mathrm{Beta}(1-d_k(x), \alpha + k d_k(x)),\ \pi_k(x) = V_k(x) \prod_{l<k}(1 - V_l(x))

This results in predictor-dependent random probability measures, enabling further spatial/temporal adaptivity (Chatzis et al., 2012).

Hierarchical Pitman-Yor Processes

Deep hierarchies are formed by recursively drawing child measures from parent PYPs: G0PY(d0,θ0,H)G_0 \sim \operatorname{PY}(d_0, \theta_0, H)

GjG0PY(d1,θ1,G0)G_j | G_0 \sim \operatorname{PY}(d_1, \theta_1, G_0)

This is central to hierarchical topic models and LLMs with power-law frequency behavior, and is the foundation for the Chinese restaurant franchise representation in collapsed Gibbs-sampling algorithms (Lim et al., 2016, Roy, 2014).

4. Limit Theorems and Asymptotics

The law of large numbers holds: empirical measures μn=1ni=1nδXi\mu_n = \frac{1}{n}\sum_{i=1}^n \delta_{X_i} from a PYP converge almost surely to the base measure HH.

For d0d \to 0, the Pitman-Yor process approaches the Dirichlet process.

Large deviations as d1d \to 1 exhibit phase transitions and non-Gaussian fluctuations. The first weight P1P_1 has limiting concentration for α1\alpha \to 1, and as α0\alpha \to 0, all mass concentrates at one atom (Feng et al., 2016).

Species sampling: For σ>0\sigma > 0, the sample contains Kn=O(nσ)K_n = O(n^\sigma) unique values, fundamentally changing cluster richness compared to logarithmic behavior of the Dirichlet process (Franssen et al., 2021, Franssen et al., 2022).

5. Bayesian Inference, Estimation, and Algorithms

Posterior inference adapts standard DP machinery:

  • Gibbs/CRP sampler for cluster assignment:
    • Assign existing cluster or create new cluster with probabilities as in Section 2.
  • Stick-breaking truncated samplers can control total variation approximation error via random truncation, with sampling cost scaling in the truncation point τ(ϵ)\tau(\epsilon), whose asymptotics are governed by polynomially tilted stable laws (Arbel et al., 2018).
  • Variational Bayesian approximation is tractable for KPYP using coordinate ascent in the truncated stick-breaking -- explicit update formulas are available for location-specific sticks, cluster indicators, atoms, and concentration parameter (Chatzis et al., 2012).

Empirical Bayes and full Bayes inference on the type parameter σ\sigma achieves asymptotic normality with rate 1/n0σ1/\sqrt{n^\sigma_0} where σ0\sigma_0 is the true power-law exponent (Franssen et al., 2022), and Bernstein-von Mises theorems provide posterior Gaussianity after bias correction for discrete data (Franssen et al., 2021).

Estimation of additive functionals (e.g., Shannon entropy) is analytically tractable for PYP priors, with closed-form posterior mean and variance expressions, and the Pitman-Yor Mixture (PYM) estimator achieves frequentist consistency for a broad class of models (Archer et al., 2013).

6. Applications and Generalizations

The Pitman-Yor process is widely used in:

  • Nonparametric density estimation: As kernel mixture prior achieves nearly parametric posterior contraction rates over adaptive smoothness classes (Scricciolo, 2012).
  • Bayesian entropy estimation for infinite discrete spaces (Archer et al., 2013).
  • Power-law topic and language modeling (Hierarchical PYP topic model, n-gram models with heavy tails) (Lim et al., 2016).
  • Functional and spatial clustering, speaker identification, and spatio-temporal point processes (via KPYP) (Chatzis et al., 2012).
  • Classification via species sampling frameworks, including explicit representation of diversity (James, 2019).
  • Product space modeling: Enriched Pitman-Yor processes (EPY) provide nested product models with independent clustering in each space, admitting “square-breaking” stick representations and unifying mixture-of-mixtures and spike-and-slab Bayesian priors (Rigon et al., 2020).

The PYP is interrelated with:

  • Poisson-Dirichlet processes (the ranked size sequence underlying the PYP weights).
  • Stable subordinators and generalized gamma processes: Stick-breaking constructions, bridge representations, and their conditioning extend to PG(α,ζ\alpha, \zeta) and EPG(α,ζ\alpha, \zeta) classes, covering all PD(α,θ\alpha, \theta) via appropriate mixing.
  • Indian Buffet Process (IBP): Two-parameter PYP is the combinatorial engine behind the power-law three-parameter IBP for exchangeable feature allocation (Roy, 2014).
  • Coagulation-fragmentation chains: Markov structure induced by successively deleting or merging atoms in the PD(α,θ\alpha, \theta) partitions (James, 2013, James, 2019).

Theoretical tools from subordinator calculus, large deviations, and regenerative composition structures underpin much of the rigorous analysis (Feng et al., 2016, James, 2013, James, 2019).


The Pitman-Yor process and its extensions serve as archetypal priors for nonparametric Bayesian inference in models that require more flexible, heavy-tailed clustering distributions than the Dirichlet process admits. Their analytic tractability, rich asymptotic theory, and broad applicability justify their central role in modern statistical learning (Chatzis et al., 2012, Lim et al., 2016, Arbel et al., 2018, Feng et al., 2016, Scricciolo, 2012, Franssen et al., 2021, James, 2019, Lawless et al., 2018, Archer et al., 2013, James, 2013, Roy, 2014, Franssen et al., 2022, Rigon et al., 2020).

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Pitman-Yor Process.