Empirical Bayes via Predictive Recursion

Updated 26 February 2026

The paper introduces a predictive recursion algorithm to nonparametrically estimate mixing distributions, yielding fast and consistent empirical Bayes rules.
It outlines a semiparametric framework where marginal likelihood maximization efficiently estimates structural parameters without requiring computationally intensive MCMC sampling.
Applications include large-scale multiple testing in hierarchical models, achieving improved tail-fitting, enhanced power, and realistic control over false discovery rates.

Nonparametric Empirical Bayes via Predictive Recursion refers to a methodology for empirical Bayes inference in mixture models in which the mixing distribution is estimated nonparametrically using the predictive recursion (PR) algorithm. This approach enables semiparametric or fully nonparametric estimation of the prior in hierarchical models, yielding fast and consistent empirical Bayes rules for large-scale inference, such as multiple testing, without resorting to computationally intensive Markov chain Monte Carlo. The core innovation is the PR algorithm's ability to efficiently learn an unknown mixing distribution, often in concert with maximum marginal likelihood estimation of parametric or structural components.

1. Hierarchical Models and the Empirical Bayes Problem

In empirical Bayes (EB) settings, the data $\{X_i\}_{i=1}^n$ are modeled as conditionally independent given latent parameters $\{\theta_i\}$ , each drawn i.i.d. from an unknown prior $F$ . The observation-level model is $X_i \sim K(\cdot|\theta_i, \vartheta)$ , with $K$ a known kernel and $\vartheta$ a finite-dimensional nuisance or structural parameter. The marginal likelihood is a mixture:

$f(x; \vartheta) = \int K(x | \theta, \vartheta) \, dF(\theta)$

The empirical Bayes objective is to estimate $F$ (nonparametrically) and $\vartheta$ (typically by marginal likelihood), then use the fitted model to form data-driven decisions or estimates (e.g., via plug-in Bayes rules) (Martin et al., 2011, Martin, 2012).

A canonical application is the two-groups model for large-scale multiple testing, where $Z_i$ are test statistics assumed arise from a mixture:

$f(z) = \pi f_0(z) + (1-\pi) f_1(z)$

with $f_0$ representing the null distribution (often an empirical null with unknown mean/variance), and $f_1$ a nonparametric alternative, typically modeled as a continuous mixture of the null kernel shifted or stretched (Martin et al., 2011).

2. The Predictive Recursion Algorithm

Predictive recursion (PR), introduced for nonparametric mixture estimation, produces a sequence of estimates for the unknown mixing density $p(u)$ (or, more generally, mixing measure $\Psi$ ) by recursively updating its previous value in light of each new data point. For observations $X_1, ..., X_n$ and structural parameter $\theta$ fixed, the PR update is:

$\begin{aligned} m_{i-1,\theta}(x) &= \int K(x|u,\theta) \, p_{i-1}(u)\, du \ p_i(u) &= (1-w_i) p_{i-1}(u) + w_i \frac{K(X_i|u,\theta) p_{i-1}(u)}{m_{i-1,\theta}(X_i)} \end{aligned}$

with an initial guess $p_0(u)$ and a weight sequence $\{w_i\}$ satisfying $\sum w_i = \infty$ , $\sum w_i^2 < \infty$ (e.g., $w_i = (i+1)^{-\gamma}, \gamma \in (2/3, 1]$ ). After $n$ steps, $p_n(u)$ is the final estimate (Martin et al., 2011, Martin, 2012, Martin et al., 2011).

Key features include:

Online recursion: One pass through the data, updating with each $X_i$ .
No MCMC required: Orders of magnitude faster than nonparametric Bayes via Dirichlet process mixtures.
Dependence on order: PR is not permutation invariant, so averaging over a small number of random permutations is advisable.

3. Predictive Recursion Marginal Likelihood and Parameter Estimation

In semiparametric settings, PR forms the basis for estimating finite-dimensional structural parameters by constructing a marginal likelihood. For parameter $\theta$ and data $\{X_i\}$ , the PR-marginal likelihood is

$L_{\mathrm{PR}}(\theta) = \prod_{i=1}^n m_{i-1,\theta}(X_i)$

where $m_{i-1,\theta}(X_i)$ is the current mixture density at $X_i$ given $\theta$ and past estimates $p_{i-1}$ .

Rather than a plug-in profile likelihood (which can be unstable), maximizing $L_{\mathrm{PR}}$ provides a computationally efficient, consistent estimator for $\theta$ . Regularization is typically introduced by adding a (weakly informative) prior on $\theta$ , penalizing implausible parameter values. Standard optimization routines (e.g., BFGS, quasi-Newton) are used for $\arg\max_\theta L_{\mathrm{PR}}(\theta)$ (Martin et al., 2011, Martin et al., 2011).

The algorithm is summarized as:

For each candidate $\theta$ , run PR over $\{X_i\}$ to obtain $L_{\mathrm{PR}}(\theta)$ .
Maximize $L_{\mathrm{PR}}(\theta)$ (plus log-prior, if regularized) to obtain $\hat\theta$ .
Run PR again at $\hat\theta$ to obtain the final mixing density estimate.

4. Application to Large-Scale Multiple Testing: The PRtest Procedure

In large-scale multiple testing, PR enables a fully nonparametric EB version of the two-groups model with empirical null. The null density $f_0(z)$ is typically $N(z|\mu,\sigma^2)$ with both mean and variance estimated from data; $f_1(z)$ is constructed as a nonparametric location mixture:

$f_1(z) = \int_{u\in[-1,1]} N\big(z \mid \mu+\tau\sigma u,\, \sigma^2\big) \psi(u) du$

where $\psi$ is an unknown density and $\tau \ge 1$ is a stretching parameter for the alternative (Martin et al., 2011).

The specific steps for PRtest are:

Fit the mixture model via PR and marginal likelihood maximization, simultaneously estimating $(\mu, \sigma, \tau, \pi_0)$ and $\psi(u)$ , using weakly-informative priors for regularization.
Compute the estimated marginal density $\hat f(z) = \hat\pi N(z|\hat\mu, \hat\sigma^2) + (1-\hat\pi) \hat f_1(z)$ .
Estimate the local false discovery rate (lFDR):

$\widehat{\mathrm{lFDR}}(z) = \frac{\hat\pi N(z|\hat\mu, \hat\sigma^2)}{\hat f(z)}$

Apply the decision rule: declare $Z_i$ non-null if $\widehat{\mathrm{lFDR}}(Z_i)<r$ for a threshold $r$ (typically $0.1$).

This procedure provides both better null estimation and improved tail-fit for $f_1$ , enabling higher power and more realistic control over empirical FDR, validated in simulation and on microarray data (Martin et al., 2011).

5. Theoretical Properties: Consistency, Identifiability, and Asymptotic Optimality

PR admits strong theoretical guarantees:

Consistency: Under mild conditions (weak compactness of candidate prior family, continuity of kernels, weight decay), the estimated mixing distribution $F_n$ converges weakly to the oracle $F$ as $n\to\infty$ , and the mixture density converges in Kullback-Leibler divergence (Martin, 2012, Martin et al., 2011).
Identifiability: For the two-groups model with $\tau > 1$ and smooth kernels, the mapping $(\mu, \sigma, \pi, \tau, \psi) \mapsto \pi f_0 + (1-\pi) f_1$ is injective (Martin et al., 2011).
Asymptotic optimality of EB rules: Data-driven plug-in EB rules using $F_n$ (for hypothesis testing, estimation under square loss, etc.) are asymptotically Bayes optimal, i.e., their frequentist risk converges to the Bayes oracle risk under the true prior (Martin, 2012). Sufficient conditions for this optimality include boundedness and dominated convergence for losses and convergence of $F_n$ .

6. Computational Complexity and Practical Implementation

The PR algorithm is computationally efficient:

Complexity: One pass of PR over $n$ samples for a mixing density represented on an $m$ -point grid requires $O(n m)$ operations.
No MCMC: Unlike Dirichlet process mixture approaches, there is no need to store or sample cluster allocations. The marginal likelihood is available as a byproduct of the main recursion.
Order dependence: While PR is order dependent, permutation averaging over a modest number of random orderings nearly eliminates this effect.
Parameter tuning: Decay rate $\gamma$ (for $w_i$ ) and initial guess for the mixing density should be chosen judiciously; in practice, $\gamma\approx 2/3$ or $1/(n+1)$ for weights provides reliable convergence. Grid choice in $u$ is a tradeoff between computational cost and resolution.

7. Empirical Evidence and Comparative Performance

Extensive simulation and real-data analysis indicate that PR-based nonparametric EB methods (e.g., PRtest) achieve:

Accurate estimation of the mixing proportion $\pi$ over a wide range of sparsity levels, often outperforming both parametric EB and alternative nonparametric methods (e.g., Jin–Cai, mixfdr) in estimation and tail fit.
Empirical FDR and Bayes risk curves closely matched to Bayes oracle procedures across realistic data-generating scenarios.
Competitive power and false non-discovery rates: In the "golden spike" microarray benchmark, PRtest closely matched the oracle in number of true discoveries, unlike methods that grossly under- or overestimate signal counts.
Improved tail-fitting and signal detection in real microarray datasets, correctly flagging significant genes missed by approaches that overshrink the estimated null component.

These results demonstrate that predictive recursion, when coupled with marginal likelihood optimization and nonparametric mixing, delivers empirical Bayes procedures for large-scale inference that are both theoretically grounded and practically competitive (Martin et al., 2011, Martin, 2012, Martin et al., 2011).