Gaussian Sieve Priors

Updated 18 November 2025

Gaussian sieve priors are hierarchical Bayesian priors that express infinite-dimensional functions using truncated orthonormal basis expansions.
They enable adaptive nonparametric inference by selecting a variable truncation level, achieving near-minimax global L2 contraction rates.
However, they exhibit suboptimal performance for pointwise and semi-parametric loss functions due to insufficient regularization of intermediate-frequency components.

Gaussian sieve priors are hierarchical Bayesian priors designed for adaptive nonparametric inference, especially in settings where the underlying signal or function admits a sparse or truncated orthonormal expansion. In such models, the key feature is to express the infinite-dimensional parameter (such as a function or spectral density) in a suitable basis, and then encode prior information via a variable truncation level and, conditionally, independent Gaussian priors on the expansion coefficients. The formulation enables dimension adaptation and allows contraction rates that (up to logarithmic factors) closely track minimax optimality for certain global loss functions. Gaussian sieve priors have been rigorously analyzed for models including the Gaussian white noise model and semi-parametric Gaussian time series, revealing both their strengths in global $L_2$ adaptation and their limitations under pointwise or semi-parametric loss functions (Arbel et al., 2012, Kruijer et al., 2012).

1. Construction of Gaussian Sieve Priors

In the Gaussian white noise model

$dX^n(t)=f_0(t)dt+n^{-1/2}dW(t), \quad t\in[0,1],$

where $f_0$ is the unknown function and $W$ is standard Brownian motion, the function is expressed in an orthonormal basis $\{\varphi_j\}_{j\geq 1}$ as $f_0(t)=\sum_{j=1}^\infty \theta_{0j}\varphi_j(t)$ . The observations in the basis are

$X_j^n = \int_0^1 \varphi_j(t) dX^n(t) = \theta_{0j} + n^{-1/2}\xi_j, \quad \xi_j\sim N(0,1).$

The Gaussian sieve prior places a hierarchical structure on the coefficients $\theta=(\theta_j)_{j\geq 1}$ : $\Pi(d\theta) = \sum_{k=1}^\infty \pi(k)\Pi_k(d\theta),$ where

$\pi(k)$ is a prior over truncation level $k$ ,
Conditionally on $k$ , $\theta_j\sim N(0,\tau_j^2)$ for $j=1,\ldots,k$ , and $\theta_j\equiv 0$ for $j>k$ .

Canonical choices are $\pi(k)$ a Poisson distribution with parameter $\lambda$ , and $\tau_j^2 = \tau_0 j^{-2q}$ for $1/2 < q \leq 1$ , $\tau_0>0$ (Arbel et al., 2012). The prior is thus a random mixture over finite-dimensional Gaussians, which induces dimension reduction and penalizes complexity via the decay of $\pi(k)$ (typically $e^{-ak\log k} \leq \pi(k) \leq e^{-bk\log k}$ for constants $a,b>0$ ).

In semi-parametric time series models, e.g., the FEXP model for Gaussian long-memory series,

$f_{d,k,\theta}(\lambda) = (2\pi)^{-1}|1-e^{-i\lambda}|^{-2d}\exp\Bigl\{\sum_{j=0}^k \theta_j \cos(j\lambda)\Bigr\},$

a sieve prior is formulated by independently assigning:

$d$ a (fixed) density supported in $(-1,1)$ ,
$k$ either fixed at a deterministic rate in $n$ or random with a Poisson/geometric prior,
$\theta|k$ a distribution supported on a Sobolev ball of smoothness $\beta>1$ (Kruijer et al., 2012).

2. Posterior Contraction Rates: $L_2$ and Global Loss

Under mild regularity assumptions, Gaussian sieve priors yield adaptive minimax-optimal posterior contraction rates (up to log-factors) for global $L_2$ (or $\ell_2$ ) loss over appropriately defined Sobolev-type parameter spaces.

Specifically, for the Sobolev ball

$\Theta_\beta(L_0) = \Bigl\{\theta:\sum_{j=1}^\infty j^{2\beta}\theta_j^2 \leq L_0\Bigr\},$

the minimax $L_2$ -estimation rate is $n^{-\beta/(2\beta+1)}$ . The Gaussian sieve prior achieves

$\varepsilon_n(\beta) = C\left(\frac{\log n}{n}\right)^{\beta/(2\beta+1)},$

in the sense that, for any $\theta_0\in \Theta_\beta(L_0)$ and $M$ sufficiently large,

$E_{\theta_0}\Pi\Bigl(\theta:\|\theta-\theta_0\|_2^2 \geq M\log n\, \varepsilon_n^2\,|\,X^n\Bigr)\to 0,$

as $n\to\infty$ [(Arbel et al., 2012), Theorem 3.4, Proposition 4.1, Section 5.1]. The Bayes risk associated with the posterior mean also achieves $O(\varepsilon_n^2)$ .

The key mechanism underlying these results is a balance of approximation error (controlled by the truncation $k$ and the prior scales $\tau_j$ ) and stochastic error (arising from the noise level and the prior’s effective sample size). The proof utilizes decisive prior-mass bounds over Kullback–Leibler neighborhoods, metric entropy estimates, and non-asymptotic testing inequalities.

3. Adaptation and Suboptimality for Other Losses

While Gaussian sieve priors provide sharp global $L_2$ adaptation, their behavior under other loss functions can be markedly different. For pointwise (local) risk

$R_n^\mathrm{loc}(\theta_0,t) = E_{\theta_0}\Bigl[\bigl(\hat{f}_n(t)-f_0(t)\bigr)^2\Bigr]$

where $\hat{f}_n(t)=\sum_j\varphi_j(t)\hat{\theta}_{nj}$ , the minimax rate over $\Theta_\beta$ is $n^{-(2\beta-1)/(2\beta)}$ . Under the Gaussian sieve prior, the attained pointwise risk decays only as

$R_n^\mathrm{loc}(\theta_0,t)\gtrsim \frac{n^{-(2\beta-1)/(2\beta+1)}}{(\log n)^2},$

which is slower than the minimax rate by a polynomial factor in $n$ [(Arbel et al., 2012), Proposition 5.3]. The dominant error arises from intermediate-frequency coefficients that are insufficiently regularized by the sieve prior, causing excess local variance.

A similar phenomenon appears in semi-parametric estimation of long-memory parameters in time series. With random truncation priors (Poisson or geometric) on the expansion length $k$ , the contraction rate of the long-memory parameter $d$ is

$(n/\log n)^{-(2\beta-1)/(4\beta+2)},$

whereas the minimax rate is $(n/\log n)^{-(2\beta-1)/(4\beta)}$ . Only when $k$ is tuned deterministically to scale as $n^{1/(2\beta)}$ (rather than assigned a prior) does the sieve prior attain the nearly optimal rate [(Kruijer et al., 2012), Theorems 3.2–3.4].

4. Key Technical Conditions and Proof Structure

Contraction theorems for sieve priors rest on an overview of several analytic conditions:

KL-approximation (A1): Existence of low-dimensional truncations that approximate the true model in Kullback–Leibler divergence.
Reverse KL– $\ell_2$ control (A2): Uniform control of KL divergence in terms of $\ell_2$ -distance around the truncation.
Metric entropy and covering (A3): Ability to cover sieved parameter sets in $d_n$ -distance by $\ell_2$ -balls, which facilitates the construction of exponentially powerful tests.
Testing (A4): Construction of tests with exponentially decaying type I and II errors for hypotheses separated in $d_n$ .
Prior tails and scales (A5): Appropriate decay of $\pi(k)$ , sufficient mass for scales $\tau_j$ , and tail regularity for the conditional prior densities.

The proof proceeds via a testing–prior-mass approach. The numerator of the posterior probability for "bad" sets (where the contraction fails) is controlled by a union bound over tests, while the denominator is lower-bounded by the prior mass of suitable KL neighborhoods. Contributions from large and small $k$ are handled via tail bounds on $\pi(k)$ (Arbel et al., 2012).

5. Implications for Adaptive Bayesian Estimation

Gaussian sieve priors illustrate both the strengths and limitations of hierarchical Bayesian adaptivity. For global loss functions such as integrated squared error, the prior achieves adaptive minimax rates over a large class of smoothness spaces (e.g., Sobolev balls). The adaptation occurs via the random selection (or deterministic choice) of the truncation level $k_n$ , which balances bias and variance automatically.

However, for more localized or semi-parametric functionals (e.g., pointwise function estimation, long-memory parameter estimation), full Bayesian adaptation via a sieve prior leads to a trade-off in contraction rate. The failure to attain optimal local rates is due to a fundamental mismatch between the global regularization induced by the sieve structure and the localized risk structure of the problem (Arbel et al., 2012, Kruijer et al., 2012).

This behavior underscores the necessity of careful prior design or tuning for objectives that go beyond global estimation: for instance, fixing the sieve truncation to match the minimax-optimal dimension achieves nearly optimal rates for long-memory parameters in time series, whereas data-driven or fully random truncation does not.

6. Comparison with Frequentist and Other Bayesian Approaches

Analysis reveals that the convergence properties of Gaussian sieve priors often parallel those of frequentist sieve estimators, particularly in nonparametric and semi-parametric models. For example, periodogram-based estimators for long-memory time series achieve the minimax rate $n^{-(2\beta-1)/(4\beta)}$ , which matches the contraction rate of the sieve prior with deterministic $k_n$ (up to log-factors) (Kruijer et al., 2012).

In contrast to fully Bayesian methods that randomize $k$ with heavy-tailed priors (facilitating automatic adaptation over the entire parameter space), deterministic or empirically tuned sieves can deliver sharper convergence rates for specific functionals. This reflects an inherent trade-off: full Bayesian adaptation excels for function estimation in global metrics, but sacrifices efficiency for certain semi-parametric objectives.

7. Summary Table: Sieve Prior Contraction Rates

Problem/Class	Sieve Prior Type	Achieved Rate (up to logs)	Minimax/Optimal Rate
Global $L_2$ (Sobolev, white noise)	Poisson $\pi(k)$ , $\tau_j^2=\tau_0j^{-2q}$	$n^{-\beta/(2\beta+1)}$	$n^{-\beta/(2\beta+1)}$
Pointwise (Sobolev, white noise)	Same	$n^{-(2\beta-1)/(2\beta+1)}$	$n^{-(2\beta-1)/(2\beta)}$
Long-memory $d$ , FEXP (semi-parametric)	$k_n\sim n^{1/(2\beta)}$ (deterministic)	$(n/\log n)^{-(2\beta-1)/(4\beta)}$	$(n/\log n)^{-(2\beta-1)/(4\beta)}$
Long-memory $d$ , FEXP	Poisson/geometric $\pi(k)$	$(n/\log n)^{-(2\beta-1)/(4\beta+2)}$	$(n/\log n)^{-(2\beta-1)/(4\beta)}$

The contraction properties confirm that Gaussian sieve priors are robust tools for adaptive nonparametric Bayes inference in high-dimensional and infinite-dimensional settings, but their performance must be evaluated in light of the specific estimation criterion of interest.

References: (Arbel et al., 2012, Kruijer et al., 2012).