Papers
Topics
Authors
Recent
2000 character limit reached

Gaussian Sieve Priors

Updated 18 November 2025
  • Gaussian sieve priors are hierarchical Bayesian priors that express infinite-dimensional functions using truncated orthonormal basis expansions.
  • They enable adaptive nonparametric inference by selecting a variable truncation level, achieving near-minimax global L2 contraction rates.
  • However, they exhibit suboptimal performance for pointwise and semi-parametric loss functions due to insufficient regularization of intermediate-frequency components.

Gaussian sieve priors are hierarchical Bayesian priors designed for adaptive nonparametric inference, especially in settings where the underlying signal or function admits a sparse or truncated orthonormal expansion. In such models, the key feature is to express the infinite-dimensional parameter (such as a function or spectral density) in a suitable basis, and then encode prior information via a variable truncation level and, conditionally, independent Gaussian priors on the expansion coefficients. The formulation enables dimension adaptation and allows contraction rates that (up to logarithmic factors) closely track minimax optimality for certain global loss functions. Gaussian sieve priors have been rigorously analyzed for models including the Gaussian white noise model and semi-parametric Gaussian time series, revealing both their strengths in global L2L_2 adaptation and their limitations under pointwise or semi-parametric loss functions (Arbel et al., 2012, Kruijer et al., 2012).

1. Construction of Gaussian Sieve Priors

In the Gaussian white noise model

dXn(t)=f0(t)dt+n1/2dW(t),t[0,1],dX^n(t)=f_0(t)dt+n^{-1/2}dW(t), \quad t\in[0,1],

where f0f_0 is the unknown function and WW is standard Brownian motion, the function is expressed in an orthonormal basis {φj}j1\{\varphi_j\}_{j\geq 1} as f0(t)=j=1θ0jφj(t)f_0(t)=\sum_{j=1}^\infty \theta_{0j}\varphi_j(t). The observations in the basis are

Xjn=01φj(t)dXn(t)=θ0j+n1/2ξj,ξjN(0,1).X_j^n = \int_0^1 \varphi_j(t) dX^n(t) = \theta_{0j} + n^{-1/2}\xi_j, \quad \xi_j\sim N(0,1).

The Gaussian sieve prior places a hierarchical structure on the coefficients θ=(θj)j1\theta=(\theta_j)_{j\geq 1}: Π(dθ)=k=1π(k)Πk(dθ),\Pi(d\theta) = \sum_{k=1}^\infty \pi(k)\Pi_k(d\theta), where

  • π(k)\pi(k) is a prior over truncation level kk,
  • Conditionally on kk, θjN(0,τj2)\theta_j\sim N(0,\tau_j^2) for j=1,,kj=1,\ldots,k, and θj0\theta_j\equiv 0 for j>kj>k.

Canonical choices are π(k)\pi(k) a Poisson distribution with parameter λ\lambda, and τj2=τ0j2q\tau_j^2 = \tau_0 j^{-2q} for 1/2<q11/2 < q \leq 1, τ0>0\tau_0>0 (Arbel et al., 2012). The prior is thus a random mixture over finite-dimensional Gaussians, which induces dimension reduction and penalizes complexity via the decay of π(k)\pi(k) (typically eaklogkπ(k)ebklogke^{-ak\log k} \leq \pi(k) \leq e^{-bk\log k} for constants a,b>0a,b>0).

In semi-parametric time series models, e.g., the FEXP model for Gaussian long-memory series,

fd,k,θ(λ)=(2π)11eiλ2dexp{j=0kθjcos(jλ)},f_{d,k,\theta}(\lambda) = (2\pi)^{-1}|1-e^{-i\lambda}|^{-2d}\exp\Bigl\{\sum_{j=0}^k \theta_j \cos(j\lambda)\Bigr\},

a sieve prior is formulated by independently assigning:

  • dd a (fixed) density supported in (1,1)(-1,1),
  • kk either fixed at a deterministic rate in nn or random with a Poisson/geometric prior,
  • θk\theta|k a distribution supported on a Sobolev ball of smoothness β>1\beta>1 (Kruijer et al., 2012).

2. Posterior Contraction Rates: L2L_2 and Global Loss

Under mild regularity assumptions, Gaussian sieve priors yield adaptive minimax-optimal posterior contraction rates (up to log-factors) for global L2L_2 (or 2\ell_2) loss over appropriately defined Sobolev-type parameter spaces.

Specifically, for the Sobolev ball

Θβ(L0)={θ:j=1j2βθj2L0},\Theta_\beta(L_0) = \Bigl\{\theta:\sum_{j=1}^\infty j^{2\beta}\theta_j^2 \leq L_0\Bigr\},

the minimax L2L_2-estimation rate is nβ/(2β+1)n^{-\beta/(2\beta+1)}. The Gaussian sieve prior achieves

εn(β)=C(lognn)β/(2β+1),\varepsilon_n(\beta) = C\left(\frac{\log n}{n}\right)^{\beta/(2\beta+1)},

in the sense that, for any θ0Θβ(L0)\theta_0\in \Theta_\beta(L_0) and MM sufficiently large,

Eθ0Π(θ:θθ022Mlognεn2Xn)0,E_{\theta_0}\Pi\Bigl(\theta:\|\theta-\theta_0\|_2^2 \geq M\log n\, \varepsilon_n^2\,|\,X^n\Bigr)\to 0,

as nn\to\infty [(Arbel et al., 2012), Theorem 3.4, Proposition 4.1, Section 5.1]. The Bayes risk associated with the posterior mean also achieves O(εn2)O(\varepsilon_n^2).

The key mechanism underlying these results is a balance of approximation error (controlled by the truncation kk and the prior scales τj\tau_j) and stochastic error (arising from the noise level and the prior’s effective sample size). The proof utilizes decisive prior-mass bounds over Kullback–Leibler neighborhoods, metric entropy estimates, and non-asymptotic testing inequalities.

3. Adaptation and Suboptimality for Other Losses

While Gaussian sieve priors provide sharp global L2L_2 adaptation, their behavior under other loss functions can be markedly different. For pointwise (local) risk

Rnloc(θ0,t)=Eθ0[(f^n(t)f0(t))2]R_n^\mathrm{loc}(\theta_0,t) = E_{\theta_0}\Bigl[\bigl(\hat{f}_n(t)-f_0(t)\bigr)^2\Bigr]

where f^n(t)=jφj(t)θ^nj\hat{f}_n(t)=\sum_j\varphi_j(t)\hat{\theta}_{nj}, the minimax rate over Θβ\Theta_\beta is n(2β1)/(2β)n^{-(2\beta-1)/(2\beta)}. Under the Gaussian sieve prior, the attained pointwise risk decays only as

Rnloc(θ0,t)n(2β1)/(2β+1)(logn)2,R_n^\mathrm{loc}(\theta_0,t)\gtrsim \frac{n^{-(2\beta-1)/(2\beta+1)}}{(\log n)^2},

which is slower than the minimax rate by a polynomial factor in nn [(Arbel et al., 2012), Proposition 5.3]. The dominant error arises from intermediate-frequency coefficients that are insufficiently regularized by the sieve prior, causing excess local variance.

A similar phenomenon appears in semi-parametric estimation of long-memory parameters in time series. With random truncation priors (Poisson or geometric) on the expansion length kk, the contraction rate of the long-memory parameter dd is

(n/logn)(2β1)/(4β+2),(n/\log n)^{-(2\beta-1)/(4\beta+2)},

whereas the minimax rate is (n/logn)(2β1)/(4β)(n/\log n)^{-(2\beta-1)/(4\beta)}. Only when kk is tuned deterministically to scale as n1/(2β)n^{1/(2\beta)} (rather than assigned a prior) does the sieve prior attain the nearly optimal rate [(Kruijer et al., 2012), Theorems 3.2–3.4].

4. Key Technical Conditions and Proof Structure

Contraction theorems for sieve priors rest on an overview of several analytic conditions:

  • KL-approximation (A1): Existence of low-dimensional truncations that approximate the true model in Kullback–Leibler divergence.
  • Reverse KL–2\ell_2 control (A2): Uniform control of KL divergence in terms of 2\ell_2-distance around the truncation.
  • Metric entropy and covering (A3): Ability to cover sieved parameter sets in dnd_n-distance by 2\ell_2-balls, which facilitates the construction of exponentially powerful tests.
  • Testing (A4): Construction of tests with exponentially decaying type I and II errors for hypotheses separated in dnd_n.
  • Prior tails and scales (A5): Appropriate decay of π(k)\pi(k), sufficient mass for scales τj\tau_j, and tail regularity for the conditional prior densities.

The proof proceeds via a testing–prior-mass approach. The numerator of the posterior probability for "bad" sets (where the contraction fails) is controlled by a union bound over tests, while the denominator is lower-bounded by the prior mass of suitable KL neighborhoods. Contributions from large and small kk are handled via tail bounds on π(k)\pi(k) (Arbel et al., 2012).

5. Implications for Adaptive Bayesian Estimation

Gaussian sieve priors illustrate both the strengths and limitations of hierarchical Bayesian adaptivity. For global loss functions such as integrated squared error, the prior achieves adaptive minimax rates over a large class of smoothness spaces (e.g., Sobolev balls). The adaptation occurs via the random selection (or deterministic choice) of the truncation level knk_n, which balances bias and variance automatically.

However, for more localized or semi-parametric functionals (e.g., pointwise function estimation, long-memory parameter estimation), full Bayesian adaptation via a sieve prior leads to a trade-off in contraction rate. The failure to attain optimal local rates is due to a fundamental mismatch between the global regularization induced by the sieve structure and the localized risk structure of the problem (Arbel et al., 2012, Kruijer et al., 2012).

This behavior underscores the necessity of careful prior design or tuning for objectives that go beyond global estimation: for instance, fixing the sieve truncation to match the minimax-optimal dimension achieves nearly optimal rates for long-memory parameters in time series, whereas data-driven or fully random truncation does not.

6. Comparison with Frequentist and Other Bayesian Approaches

Analysis reveals that the convergence properties of Gaussian sieve priors often parallel those of frequentist sieve estimators, particularly in nonparametric and semi-parametric models. For example, periodogram-based estimators for long-memory time series achieve the minimax rate n(2β1)/(4β)n^{-(2\beta-1)/(4\beta)}, which matches the contraction rate of the sieve prior with deterministic knk_n (up to log-factors) (Kruijer et al., 2012).

In contrast to fully Bayesian methods that randomize kk with heavy-tailed priors (facilitating automatic adaptation over the entire parameter space), deterministic or empirically tuned sieves can deliver sharper convergence rates for specific functionals. This reflects an inherent trade-off: full Bayesian adaptation excels for function estimation in global metrics, but sacrifices efficiency for certain semi-parametric objectives.

7. Summary Table: Sieve Prior Contraction Rates

Problem/Class Sieve Prior Type Achieved Rate (up to logs) Minimax/Optimal Rate
Global L2L_2 (Sobolev, white noise) Poisson π(k)\pi(k), τj2=τ0j2q\tau_j^2=\tau_0j^{-2q} nβ/(2β+1)n^{-\beta/(2\beta+1)} nβ/(2β+1)n^{-\beta/(2\beta+1)}
Pointwise (Sobolev, white noise) Same n(2β1)/(2β+1)n^{-(2\beta-1)/(2\beta+1)} n(2β1)/(2β)n^{-(2\beta-1)/(2\beta)}
Long-memory dd, FEXP (semi-parametric) knn1/(2β)k_n\sim n^{1/(2\beta)} (deterministic) (n/logn)(2β1)/(4β)(n/\log n)^{-(2\beta-1)/(4\beta)} (n/logn)(2β1)/(4β)(n/\log n)^{-(2\beta-1)/(4\beta)}
Long-memory dd, FEXP Poisson/geometric π(k)\pi(k) (n/logn)(2β1)/(4β+2)(n/\log n)^{-(2\beta-1)/(4\beta+2)} (n/logn)(2β1)/(4β)(n/\log n)^{-(2\beta-1)/(4\beta)}

The contraction properties confirm that Gaussian sieve priors are robust tools for adaptive nonparametric Bayes inference in high-dimensional and infinite-dimensional settings, but their performance must be evaluated in light of the specific estimation criterion of interest.

References: (Arbel et al., 2012, Kruijer et al., 2012).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Gaussian Sieve Priors.