Papers
Topics
Authors
Recent
Search
2000 character limit reached

Sieve-SGD: Scalable Online Regression

Updated 18 February 2026
  • Sieve-SGD is a scalable online learning method that combines nonparametric sieve estimation with stochastic gradient descent to achieve rate-optimal mean squared error.
  • It restricts the hypothesis space to a sequence of growing finite-dimensional subspaces, balancing approximation and stochastic errors while significantly reducing computation.
  • Empirical results and theoretical guarantees demonstrate that Sieve-SGD outperforms classical kernel methods in high-dimensional regression and law-dependent SDE applications.

Sieve Stochastic Gradient Descent (Sieve-SGD) is a methodology that combines nonparametric sieve estimation with stochastic approximation, yielding scalable online learning algorithms for high-dimensional function estimation. The principal innovation is restricting the infinite-dimensional hypothesis space to a rapidly growing sequence of finite-dimensional subspaces (sieves), using SGD within each subspace, and carefully matching the rate at which the sieve dimension grows to the decay of approximation and stochastic error. This paradigm enables rate-optimal mean squared error (MSE) in online regression and efficient approximation for certain classes of stochastic differential equations, while dramatically reducing computational and memory requirements compared to classical kernel methods (Zhang et al., 2021, Agarwal et al., 2023).

1. Statistical Setting and Sieve Construction

Consider the classic nonparametric regression framework with sequential data: At each time tt, one observes (Xt,Yt)(X_t,Y_t), drawn i.i.d. from a distribution ρ\rho on X×R\mathcal{X} \times \mathbb{R}. The objective is to construct a sequence of estimators f^t\hat{f}_t for the regression function fρ(x)=E[YX=x]f_\rho(x)=\mathbb{E}[Y|X=x], minimizing the expected L2L^2-error Ef^Tfρ2\mathbb{E}\|\hat{f}_T-f_\rho\|^2.

The underlying hypothesis space is typically a Sobolev ellipsoid: W(s,Q)={f=j=1θjψj : j=1(jsθj)2Q2},W(s,Q) = \left\{f = \sum_{j=1}^\infty \theta_j \psi_j\ :\ \sum_{j=1}^\infty (j^s \theta_j)^2 \le Q^2 \right\}, for s>1/2s > 1/2 and a fixed orthonormal basis {ψj}j1\{\psi_j\}_{j\geq1} in L2(ν)L^2(\nu). Functions fW(s,Q)f \in W(s,Q) have expansion coefficients θj\theta_j that decay as js1/2j^{-s-1/2}.

To balance approximation and stochastic error, Sieve-SGD employs a sequence of truncated basis spaces (sieves). At time tt, estimation is restricted to the subspace VJt=span{ψ1,,ψJt}V_{J_t} = \operatorname{span}\{\psi_1,\ldots,\psi_{J_t}\} with Jt=tαJ_t = \lfloor t^\alpha\rfloor. The exponent α=1/(2s+1)\alpha = 1/(2s+1) is chosen so that approximation error O(Jts)O(J_t^{-s}) matches estimation error; this recovers the minimax bias-variance tradeoff for the given smoothness.

2. Algorithmic Implementation

The Sieve-SGD update rule maintains an estimator ft(x)=j=1Jtθt,jψj(x)f_t(x) = \sum_{j=1}^{J_t}\theta_{t,j}\psi_j(x). At iteration tt, after receiving (Xt,Yt)(X_t,Y_t), the algorithm performs:

  1. Compute ft1(Xt)=j=1Jt1θt1,jψj(Xt)f_{t-1}(X_t) = \sum_{j=1}^{J_{t-1}} \theta_{t-1,j} \psi_j(X_t).
  2. Calculate the residual rt=Ytft1(Xt)r_t=Y_t-f_{t-1}(X_t).
  3. For jJtj \le J_t, update coefficients:

θt,j=θt1,j+γtrtj2sψj(Xt)\theta_{t,j}= \theta_{t-1,j} + \gamma_t r_t j^{-2s} \psi_j(X_t)

where γt=γ0t1/(2s+1)\gamma_t = \gamma_0 t^{-1/(2s+1)}, and γ01/(2M2ζ(2s))\gamma_0 \leq 1/(2M^2\zeta(2s)) is a suitably small stepsize (Zhang et al., 2021).

  1. Optionally, apply Polyak averaging: fˉT=1Tt=1Tft\bar{f}_T = \frac{1}{T}\sum_{t=1}^T f_t, which empirically stabilizes estimation and achieves optimal risk.

The functional update can also be rendered as a projected move in function space: ft=PVJt(ft1+γtrtKXt,Jt),f_t = P_{V_{J_t}}\Big(f_{t-1} + \gamma_t r_t K_{X_t,J_t}\Big), with KXt,Jt()=j=1Jtj2sψj(Xt)ψj()K_{X_t,J_t}(\cdot)=\sum_{j=1}^{J_t} j^{-2s} \psi_j(X_t)\psi_j(\cdot). This structure leverages the spectrum of the Sobolev ellipsoid for efficient decomposition and computation.

3. Theoretical Guarantees

Under the conditions:

  • (A1) (Xt,Yt)(X_t,Y_t) i.i.d.,
  • (A2) pXp_X (density of XX) is bounded: pXu\ell \leq p_X \leq u,
  • (A3) fρW(s,Q)f_\rho \in W(s,Q), basis functions ψj\psi_j bounded,
  • (A4) the regression noise εt=Ytfρ(Xt)\varepsilon_t=Y_t-f_\rho(X_t) is bounded or Var(ε)<\mathrm{Var}(\varepsilon)<\infty,

the Sieve-SGD estimator satisfies

EfˉTfρL2(ρX)2=O(T2s/(2s+1))\mathbb{E}\|\bar{f}_T - f_\rho\|_{L^2(\rho_X)}^2 = O(T^{-2s/(2s+1)})

matching the minimax lower bound for W(s,Q)W(s,Q). The analysis decomposes estimation error into a deterministic bias component and a stochastic noise component, leveraging spectral properties (eigenvalue decay λjj2s\lambda_j\approx j^{-2s}) and metric entropy tools (Carl's inequality) to propagate contraction rates. This guarantees rate-optimal learning with minimal assumptions (Zhang et al., 2021).

4. Computational Complexity and Comparisons

For each iteration:

  • Evaluating ft1(Xt)f_{t-1}(X_t) costs O(Jt1)O(J_{t-1}) operations.
  • Updating coefficients is O(Jt)O(J_t).
  • With Jt=O(t1/(2s+1))J_t=O(t^{1/(2s+1)}), the tt-th step is O(t1/(2s+1))O(t^{1/(2s+1)}).

Aggregating over nn samples, total time is O(n1+1/(2s+1))O(n^{1+1/(2s+1)}). Memory usage is O(Jn)=O(n1/(2s+1))O(J_n)=O(n^{1/(2s+1)}), corresponding directly to the current sieve dimension.

In contrast, kernel-SGD methods (using nn basis functions at time nn) have O(n)O(n) memory and O(n2)O(n^2) total time complexity. Thus, Sieve-SGD achieves a near-minimal polynomial scaling in both time and space for rate-optimal nonparametric online regression (Zhang et al., 2021).

Algorithm Memory Complexity Time Complexity
Sieve-SGD O(n1/(2s+1))O(n^{1/(2s+1)}) O(n1+1/(2s+1))O(n^{1+1/(2s+1)})
Kernel-SGD O(n)O(n) O(n2)O(n^2)

5. Extensions to McKean–Vlasov SDEs

A related Sieve-SGD methodology is employed in the context of McKean–Vlasov stochastic differential equations (MV-SDEs), where the solution is a law-dependent SDE with separable measure dependence in drift and diffusion. Here, the approach reformulates existence of the solution as a minimization problem over continuous time functions γC([0,T];RK)\gamma\in C([0,T];\mathbb{R}^K): L[γ]=Ψ(γ)γL2([0,T];RK)2,L[\gamma] = \|\Psi(\gamma)-\gamma\|_{L^2([0,T];\mathbb{R}^K)}^2, where Ψ(γ)(t)=E[φ(Ztγ)]\Psi(\gamma)(t)=\mathbb{E}[\varphi(Z_t^\gamma)] and ZtγZ_t^\gamma solves a Markovian SDE parameterized by γ\gamma.

The infinite-dimensional optimization is approximated in a finite-dimensional sieve Sn={γ(t)=i=0naigi(t):aiRK}S_n = \{\gamma(t)=\sum_{i=0}^n a_i g_i(t) : a_i\in\mathbb{R}^K\} for scalar basis g0,,gng_0,\ldots,g_n. SGD is applied to minimize the induced empirical loss G(a)G(a) over aR(n+1)Ka\in\mathbb{R}^{(n+1)K}. Convergence of Sieve-SGD in this setting is ensured by establishing uniform moment bounds and differentiability of the loss, and unbiased gradient stochasticity under classical stepsize schedules ηm\eta_m.

Numerical experiments (e.g., for the Kuramoto–Sakaguchi model, quadratic-drift MV–SDE, and convolution-type MV–SDE) show that modest sieve dimensions (nn) and moderate mini-batch sizes (MM) allow fast and precise recovery of law-dependent solution trajectories. Empirical results confirm significant computational advantages over interacting particle system (IPS) simulations, which require large populations for comparable accuracy (Agarwal et al., 2023).

6. Algorithmic Variants and Empirical Observations

Sieve-SGD supports online and mini-batch variants. The classical update iterates

am+1=amηm(v(am;ξm+1,Wm+1;ξ~m+1,W~m+1)+aH(am)),a_{m+1} = a_m - \eta_m \big(v(a_m;\xi_{m+1},W_{m+1};\tilde{\xi}_{m+1},\tilde{W}_{m+1}) + \nabla_a H(a_m)\big),

where v()v(\cdot) is an unbiased estimator of the functional gradient (often involving tangent processes), and HH is an optional regularization penalty. A mini-batch version averages this estimator over MM replicates for variance reduction.

Key empirical features include:

  • Use of Lagrange polynomials at Chebyshev nodes for sieve basis functions (gj(t)g_j(t)).
  • Stepsize decay ηm=r0/(m+1)ρ, ρ(0.5,1]\eta_m = r_0/(m+1)^\rho,\ \rho\in(0.5,1].
  • In practice, no penalty term (H0H\equiv0) and initialization a0a_0 matches empirical means.
  • Stopping criteria can be tied to a relative L2L^2-error threshold.
  • Euler–Maruyama discretization for underlying SDEs.

Numerical benchmarks indicate accelerated convergence with growing mini-batch size. For example, for T=0.5T=0.5 and polynomial degree n=3n=3, average iterations decrease from 278\sim278 (mini-batch size M=1M=1) to 2\sim2 (M=1000M=1000), with commensurate CPU time savings (Agarwal et al., 2023).

7. Significance and Outlook

Sieve-SGD synthesizes the statistical optimality of sieve methods with the computational advantages of SGD, addressing key scalability bottlenecks in online nonparametric regression and law-dependent SDEs. By adaptively growing the effective model complexity and matching learning rates to the smoothness class of the target function, Sieve-SGD attains minimax risk rates with provably minimal computational overhead.

A plausible implication is that Sieve-SGD provides a unified template for efficient stochastic function approximation in both regression and high-dimensional dynamical systems, substantially broadening the accessible scale for online and streaming data applications without resorting to kernel-based or particle-based methods (Zhang et al., 2021, Agarwal et al., 2023).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Sieve Stochastic Gradient Descent (Sieve-SGD).