Sieve-SGD: Scalable Online Regression

Updated 18 February 2026

Sieve-SGD is a scalable online learning method that combines nonparametric sieve estimation with stochastic gradient descent to achieve rate-optimal mean squared error.
It restricts the hypothesis space to a sequence of growing finite-dimensional subspaces, balancing approximation and stochastic errors while significantly reducing computation.
Empirical results and theoretical guarantees demonstrate that Sieve-SGD outperforms classical kernel methods in high-dimensional regression and law-dependent SDE applications.

Sieve Stochastic Gradient Descent (Sieve-SGD) is a methodology that combines nonparametric sieve estimation with stochastic approximation, yielding scalable online learning algorithms for high-dimensional function estimation. The principal innovation is restricting the infinite-dimensional hypothesis space to a rapidly growing sequence of finite-dimensional subspaces (sieves), using SGD within each subspace, and carefully matching the rate at which the sieve dimension grows to the decay of approximation and stochastic error. This paradigm enables rate-optimal mean squared error (MSE) in online regression and efficient approximation for certain classes of stochastic differential equations, while dramatically reducing computational and memory requirements compared to classical kernel methods (Zhang et al., 2021, Agarwal et al., 2023).

1. Statistical Setting and Sieve Construction

Consider the classic nonparametric regression framework with sequential data: At each time $t$ , one observes $(X_t,Y_t)$ , drawn i.i.d. from a distribution $\rho$ on $\mathcal{X} \times \mathbb{R}$ . The objective is to construct a sequence of estimators $\hat{f}_t$ for the regression function $f_\rho(x)=\mathbb{E}[Y|X=x]$ , minimizing the expected $L^2$ -error $\mathbb{E}\|\hat{f}_T-f_\rho\|^2$ .

The underlying hypothesis space is typically a Sobolev ellipsoid: $W(s,Q) = \left\{f = \sum_{j=1}^\infty \theta_j \psi_j\ :\ \sum_{j=1}^\infty (j^s \theta_j)^2 \le Q^2 \right\},$ for $s > 1/2$ and a fixed orthonormal basis $\{\psi_j\}_{j\geq1}$ in $L^2(\nu)$ . Functions $f \in W(s,Q)$ have expansion coefficients $\theta_j$ that decay as $j^{-s-1/2}$ .

To balance approximation and stochastic error, Sieve-SGD employs a sequence of truncated basis spaces (sieves). At time $t$ , estimation is restricted to the subspace $V_{J_t} = \operatorname{span}\{\psi_1,\ldots,\psi_{J_t}\}$ with $J_t = \lfloor t^\alpha\rfloor$ . The exponent $\alpha = 1/(2s+1)$ is chosen so that approximation error $O(J_t^{-s})$ matches estimation error; this recovers the minimax bias-variance tradeoff for the given smoothness.

2. Algorithmic Implementation

The Sieve-SGD update rule maintains an estimator $f_t(x) = \sum_{j=1}^{J_t}\theta_{t,j}\psi_j(x)$ . At iteration $t$ , after receiving $(X_t,Y_t)$ , the algorithm performs:

Compute $f_{t-1}(X_t) = \sum_{j=1}^{J_{t-1}} \theta_{t-1,j} \psi_j(X_t)$ .
Calculate the residual $r_t=Y_t-f_{t-1}(X_t)$ .
For $j \le J_t$ , update coefficients:

$\theta_{t,j}= \theta_{t-1,j} + \gamma_t r_t j^{-2s} \psi_j(X_t)$

where $\gamma_t = \gamma_0 t^{-1/(2s+1)}$ , and $\gamma_0 \leq 1/(2M^2\zeta(2s))$ is a suitably small stepsize (Zhang et al., 2021).

Optionally, apply Polyak averaging: $\bar{f}_T = \frac{1}{T}\sum_{t=1}^T f_t$ , which empirically stabilizes estimation and achieves optimal risk.

The functional update can also be rendered as a projected move in function space: $f_t = P_{V_{J_t}}\Big(f_{t-1} + \gamma_t r_t K_{X_t,J_t}\Big),$ with $K_{X_t,J_t}(\cdot)=\sum_{j=1}^{J_t} j^{-2s} \psi_j(X_t)\psi_j(\cdot)$ . This structure leverages the spectrum of the Sobolev ellipsoid for efficient decomposition and computation.

3. Theoretical Guarantees

Under the conditions:

(A1) $(X_t,Y_t)$ i.i.d.,
(A2) $p_X$ (density of $X$ ) is bounded: $\ell \leq p_X \leq u$ ,
(A3) $f_\rho \in W(s,Q)$ , basis functions $\psi_j$ bounded,
(A4) the regression noise $\varepsilon_t=Y_t-f_\rho(X_t)$ is bounded or $\mathrm{Var}(\varepsilon)<\infty$ ,

the Sieve-SGD estimator satisfies

$\mathbb{E}\|\bar{f}_T - f_\rho\|_{L^2(\rho_X)}^2 = O(T^{-2s/(2s+1)})$

matching the minimax lower bound for $W(s,Q)$ . The analysis decomposes estimation error into a deterministic bias component and a stochastic noise component, leveraging spectral properties (eigenvalue decay $\lambda_j\approx j^{-2s}$ ) and metric entropy tools (Carl's inequality) to propagate contraction rates. This guarantees rate-optimal learning with minimal assumptions (Zhang et al., 2021).

4. Computational Complexity and Comparisons

For each iteration:

Evaluating $f_{t-1}(X_t)$ costs $O(J_{t-1})$ operations.
Updating coefficients is $O(J_t)$ .
With $J_t=O(t^{1/(2s+1)})$ , the $t$ -th step is $O(t^{1/(2s+1)})$ .

Aggregating over $n$ samples, total time is $O(n^{1+1/(2s+1)})$ . Memory usage is $O(J_n)=O(n^{1/(2s+1)})$ , corresponding directly to the current sieve dimension.

In contrast, kernel-SGD methods (using $n$ basis functions at time $n$ ) have $O(n)$ memory and $O(n^2)$ total time complexity. Thus, Sieve-SGD achieves a near-minimal polynomial scaling in both time and space for rate-optimal nonparametric online regression (Zhang et al., 2021).

Algorithm	Memory Complexity	Time Complexity
Sieve-SGD	$O(n^{1/(2s+1)})$	$O(n^{1+1/(2s+1)})$
Kernel-SGD	$O(n)$	$O(n^2)$

5. Extensions to McKean–Vlasov SDEs

A related Sieve-SGD methodology is employed in the context of McKean–Vlasov stochastic differential equations (MV-SDEs), where the solution is a law-dependent SDE with separable measure dependence in drift and diffusion. Here, the approach reformulates existence of the solution as a minimization problem over continuous time functions $\gamma\in C([0,T];\mathbb{R}^K)$ : $L[\gamma] = \|\Psi(\gamma)-\gamma\|_{L^2([0,T];\mathbb{R}^K)}^2,$ where $\Psi(\gamma)(t)=\mathbb{E}[\varphi(Z_t^\gamma)]$ and $Z_t^\gamma$ solves a Markovian SDE parameterized by $\gamma$ .

The infinite-dimensional optimization is approximated in a finite-dimensional sieve $S_n = \{\gamma(t)=\sum_{i=0}^n a_i g_i(t) : a_i\in\mathbb{R}^K\}$ for scalar basis $g_0,\ldots,g_n$ . SGD is applied to minimize the induced empirical loss $G(a)$ over $a\in\mathbb{R}^{(n+1)K}$ . Convergence of Sieve-SGD in this setting is ensured by establishing uniform moment bounds and differentiability of the loss, and unbiased gradient stochasticity under classical stepsize schedules $\eta_m$ .

Numerical experiments (e.g., for the Kuramoto–Sakaguchi model, quadratic-drift MV–SDE, and convolution-type MV–SDE) show that modest sieve dimensions ( $n$ ) and moderate mini-batch sizes ( $M$ ) allow fast and precise recovery of law-dependent solution trajectories. Empirical results confirm significant computational advantages over interacting particle system (IPS) simulations, which require large populations for comparable accuracy (Agarwal et al., 2023).

6. Algorithmic Variants and Empirical Observations

Sieve-SGD supports online and mini-batch variants. The classical update iterates

$a_{m+1} = a_m - \eta_m \big(v(a_m;\xi_{m+1},W_{m+1};\tilde{\xi}_{m+1},\tilde{W}_{m+1}) + \nabla_a H(a_m)\big),$

where $v(\cdot)$ is an unbiased estimator of the functional gradient (often involving tangent processes), and $H$ is an optional regularization penalty. A mini-batch version averages this estimator over $M$ replicates for variance reduction.

Key empirical features include:

Use of Lagrange polynomials at Chebyshev nodes for sieve basis functions ( $g_j(t)$ ).
Stepsize decay $\eta_m = r_0/(m+1)^\rho,\ \rho\in(0.5,1]$ .
In practice, no penalty term ( $H\equiv0$ ) and initialization $a_0$ matches empirical means.
Stopping criteria can be tied to a relative $L^2$ -error threshold.
Euler–Maruyama discretization for underlying SDEs.

Numerical benchmarks indicate accelerated convergence with growing mini-batch size. For example, for $T=0.5$ and polynomial degree $n=3$ , average iterations decrease from $\sim278$ (mini-batch size $M=1$ ) to $\sim2$ ( $M=1000$ ), with commensurate CPU time savings (Agarwal et al., 2023).

7. Significance and Outlook

Sieve-SGD synthesizes the statistical optimality of sieve methods with the computational advantages of SGD, addressing key scalability bottlenecks in online nonparametric regression and law-dependent SDEs. By adaptively growing the effective model complexity and matching learning rates to the smoothness class of the target function, Sieve-SGD attains minimax risk rates with provably minimal computational overhead.

A plausible implication is that Sieve-SGD provides a unified template for efficient stochastic function approximation in both regression and high-dimensional dynamical systems, substantially broadening the accessible scale for online and streaming data applications without resorting to kernel-based or particle-based methods (Zhang et al., 2021, Agarwal et al., 2023).

Markdown Report Issue Upgrade to Chat

References (2)

A Sieve Stochastic Gradient Descent Estimator for Online Nonparametric Regression in Sobolev ellipsoids (2021)

Numerical approximation of McKean-Vlasov SDEs via stochastic gradient descent (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Sieve Stochastic Gradient Descent (Sieve-SGD).

Sieve-SGD: Scalable Online Regression

1. Statistical Setting and Sieve Construction

2. Algorithmic Implementation

3. Theoretical Guarantees

4. Computational Complexity and Comparisons

5. Extensions to McKean–Vlasov SDEs

6. Algorithmic Variants and Empirical Observations

7. Significance and Outlook

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sieve-SGD: Scalable Online Regression

1. Statistical Setting and Sieve Construction

2. Algorithmic Implementation

3. Theoretical Guarantees

4. Computational Complexity and Comparisons

5. Extensions to McKean–Vlasov SDEs

6. Algorithmic Variants and Empirical Observations

7. Significance and Outlook

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research