Pitman-Yor Chinese Restaurant Process

Updated 31 December 2025

The Pitman-Yor Chinese Restaurant Process is a Bayesian nonparametric model that extends the Dirichlet Process by incorporating a discount parameter to induce power-law behavior in cluster sizes.
It employs a stick-breaking construction and an exchangeable partition probability function to efficiently model random partitions and facilitate scalable inference.
The model is widely applied in hierarchical topic modeling and species-sampling, where its power-law properties lead to improved performance over traditional Dirichlet-based methods.

The Pitman–Yor Chinese Restaurant Process (PYCRP) is a cornerstone model in Bayesian nonparametric statistics for constructing random partitions and discrete random probability measures exhibiting power-law behavior. It generalizes the Dirichlet Process by introducing a second parameter to control the clustering structure and the frequency distribution of clusters. The PYCRP underpins a range of hierarchical models, especially in topic modeling and species-sampling problems, and supports efficient inference algorithms through its exchangeable partition structure and stick-breaking construction (Lim et al., 2016, Franssen et al., 2022, Lawless et al., 2018, Pereira et al., 2018, Arbel et al., 2018, Canale et al., 2019).

1. Two-Parameter Pitman–Yor Process and CRP Representation

The Pitman–Yor process, denoted $\operatorname{PY}(d, \theta, H)$ for discount parameter $d \in [0, 1)$ and concentration parameter $\theta > -d$ , defines an almost surely discrete random probability measure on a space $\mathcal{X}$ with base distribution $H$ . Its constructive stick-breaking representation involves i.i.d. atoms $\theta_k \sim H$ and associated weights:

$V_k \sim \text{Beta}(1-d, \theta + k d),\quad p_k = V_k \prod_{i=1}^{k-1} (1 - V_i),$

so that $G = \sum_{k=1}^\infty p_k \delta_{\theta_k}$ has the law $\operatorname{PY}(d,\theta,H)$ (Lawless et al., 2018, Canale et al., 2019).

The Chinese Restaurant Process (CRP) analogy interprets sample draws from $G$ as customers entering a restaurant: existing tables represent unique observed values ("clusters"). For $n$ customers seated at $m$ tables of sizes $n_1,\dots,n_m$ , the next customer sits:

At table $k$ with probability $\frac{n_k - d}{\theta + n}$ ,
At a new table with probability $\frac{\theta + m d}{\theta + n}$ .

As $d \to 0$ , the PYCRP reduces to the Dirichlet Process CRP with additive seating probability proportional to current table sizes alone (Lim et al., 2016, Lawless et al., 2018).

2. Partition Distribution and EPPF

The PYCRP induces exchangeable random partitions of $[n]$ characterized by the Exchangeable Partition Probability Function (EPPF). For a partition into $m$ blocks of sizes $n_1, \dots, n_m$ :

$\mathrm{EPPF}(n_1, \dots, n_m) = \frac{(\theta | d)_m}{(\theta)_n} \prod_{j=1}^m (1-d)_{n_j - 1},$

with the rising factorials $(x)_n = x(x+1)\cdots(x+n-1)$ and $(x|d)_m = x(x+d)\cdots(x+(m-1)d)$ . For $d = 0$ , this recovers the Dirichlet process EPPF (Lim et al., 2016, Franssen et al., 2022, Lawless et al., 2018, Canale et al., 2019).

This EPPF encapsulates the full probabilistic law over compositions of a sample into clusters, facilitating marginalization in inference tasks and providing direct access to species-sampling properties.

3. Power-Law and Partition Growth Properties

A defining property of the PYCRP is its control over cluster frequency distributions:

The expected number of clusters in a sample of size $n$ is $E[K_n] \sim (\theta/d) n^d$ for large $n$ , displaying polynomial (power-law) growth for $d > 0$ (Pereira et al., 2018, Franssen et al., 2022, Canale et al., 2019).
The distribution of cluster sizes exhibits Zipf's law: larger clusters are rarer, and the number of distinct clusters of given size decays polynomially.

Table counts $K_n$ and individual cluster size counts $K_{n,k}$ concentrate sharply around deterministic curves with fluctuation scale and rates provided in nonasymptotic results, controlling convergence and random variation (Pereira et al., 2018). The power-law regime matches empirical data in applications (e.g., word frequencies in language, species abundance, network structures) more accurately than the Dirichlet Process model (Lim et al., 2016).

4. Hierarchical and Franchise Extensions

In multi-level models such as nonparametric Bayesian topic models, multiple Pitman–Yor processes are hierarchically coupled—an arrangement known as the Chinese Restaurant Franchise (CRF). For instance, documents draw their own topic distributions from document-level PYPs, which themselves share statistical strength via a corpus-level PYP. Words are drawn from topic distributions, which may also be PYPs over vocabulary.

In this construction, restaurant table counts at children nodes serve as customers at parent nodes. Marginalizing over the random probabilities yields a tractable Markovian structure on the hierarchy of counts and tables, making inference scalable (Lim et al., 2016).

5. Inference Algorithms and Posterior Computation

PYCRP-based models support collapsed Gibbs sampling by exploiting the exchangeable nature of the process. For each data point:

Remove its seating assignment, decrementing counts (with random removal of empty tables).
Compute conditional predictive probabilities for assigning clusters (tables), accounting for both old and new tables as per the PYCRP rules.
Update the counts recursively along the franchise, recomputing likelihood ratios using modularized functions and Stirling-number calculations.

Efficient implementation can exploit Stirling number caching, gamma–beta augmentation for marginalizing over $\theta$ , and blocked or auxiliary variable sampling for hierarchical structures (Lim et al., 2016).

The Importance Conditional Sampling (ICS) algorithm is specifically tailored to the PYCRP structure, using posterior Dirichlet decomposition for cluster probabilities and importance-resampled auxiliary draws from the PY predictive for proposing new clusters. ICS achieves stable mixing and bounded cost per iteration, in contrast to slice or truncation-based samplers, which degrade with large discount parameter $d$ (Canale et al., 2019).

6. Parameter Estimation and Asymptotic Theory

The primary parameters $(d,\theta)$ (or equivalently $(\sigma,M)$ ) govern power-law behavior and clustering. Empirical Bayes and full-Bayes procedures for parameter estimation are developed via maximization or Bayesian integration over the partition likelihood (EPPF):

The marginal MLE (empirical Bayes) for $d$ solves the likelihood maximization problem derived from sample partitions, with explicit formulas for the log-likelihood and its derivatives (Franssen et al., 2022).
Posterior contraction for $d$ around its estimator occurs at rate $n^{d/2}$ (up to slow variation); a Bernstein–von Mises limit theorem shows asymptotic normality of the estimator for $d$ .
The precision parameter $\theta$ is treated via profile likelihood or prior augmentation; in forensic contexts, plug-in and Bayes estimates for match probabilities admit normal fluctuation limits under the PYCRP model (Franssen et al., 2022).

7. Applications and Empirical Performance

Hierarchical PYCRP models are extensively deployed in latent variable modeling for natural language (e.g., topic modeling for text corpora) and mixture modeling in clustering. In empirical studies on large-scale text data (e.g., Twitter), hierarchical Pitman–Yor models outperform Dirichlet-based baselines across multiple metrics:

Lower held-out perplexity for text
More accurate modeling of words, hashtags, and author networks
Improved performance in downstream clustering (measured by purity and NMI) and topic labeling

These gains are attributed to the accurate power-law modeling of rare types and robust handling of heavy-tailed observed data (Lim et al., 2016). Inclusion of shared-vocabulary PYPs (e.g., for hashtags) and network-level priors (e.g., via Gaussian processes) further enhances modeling capacity and empirical fit.

Summary Table: Core Aspects of the Pitman–Yor Chinese Restaurant Process

Feature	Mathematical Characterization	Key Papers
Predictive rule	$(n_k - d)/(\theta + n)$ , $(\theta + m d)/(\theta + n)$	(Lim et al., 2016, Lawless et al., 2018, Canale et al., 2019)
Partition distribution	$\mathrm{EPPF} = \frac{(\theta\|d)_m}{(\theta)_n} \prod_j (1-d)_{n_j-1}$	(Lim et al., 2016, Lawless et al., 2018, Canale et al., 2019)
Expected # clusters	$E[K_n] \sim (\theta/d) n^d$	(Pereira et al., 2018, Franssen et al., 2022, Canale et al., 2019)
Inference schemes	Collapsed Gibbs, ICS, stick-breaking truncation	(Lim et al., 2016, Canale et al., 2019, Arbel et al., 2018)
Application highlight	Topic models, power-law species/words, networked data	(Lim et al., 2016, Franssen et al., 2022)

The PYCRP is thus established as a flexible and analytically tractable extension of the Dirichlet process CRP, enforcing power-law cluster growth and supporting efficient, exact inference and parameter estimation in a broad array of nonparametric Bayesian models.