Accelerated Gradient Stacking Methods

Updated 27 February 2026

The paper demonstrates that stacking past gradients, via techniques like stacking initialization and Anderson Acceleration, mimics Nesterov’s accelerated gradient descent for significant empirical speedups.
It leverages stagewise training protocols with momentum-inspired initialization to enhance convergence under both convex and certain nonconvex settings.
Practical implementations highlight optimal hyperparameter choices and restarting strategies that achieve acceleration rates comparable to AGD.

Accelerated gradient perspectives on stacking formalize the longstanding intuition that stacking past gradients or model increments—whether via Anderson Acceleration (AA) with restarting in convex optimization or layer-wise stacking in deep architectures—connects deeply to accelerated first-order optimization, notably Nesterov’s accelerated gradient descent (AGD). This linkage provides rigorous insight into the empirical speedups conferred by stacking heuristics in modern training workflows, especially in stagewise and ensemble learning, and offers convergence guarantees under both convexity and certain nonconvex settings (Ouyang et al., 2022, Agarwal et al., 2024).

1. Stacking and Stagewise Training Protocols

Stacking refers to sequentially building a model by combining a sequence of “simple” functions (ensemble elements or layers), such as $F_T = \sum_{t=1}^T f_t$ for additive ensembles (boosting), or $F_T = (I+f_T)\circ \dots \circ (I+f_1)$ for deep residual networks. At each stage $t+1$ , the new component $f_{t+1}$ is initialized at a particular starting point, and then optimized—often via a small number of gradient steps—while holding previous components fixed.

A critical practical distinction arises in the initialization of $f_{t+1}$ :

Zero-initialization ( $f_{t+1}^0 \equiv 0$ ) corresponds to classical GD steps.
Random-initialization emulates stochastic smoothing of the loss landscape.
Stacking-initialization ( $f_{t+1}^0 = \beta f_t$ for $\beta\in[0,1)$ ) exploits momentum by using the previous increment, matching the inference of momentum-driven optimization (Agarwal et al., 2024).

The stacking heuristic—particularly with $\beta$ close to one and subsequent early stopping—sets up an iterative scheme that closely mimics the updates of AGD in function space, laying the foundation for theoretical understanding of its empirical acceleration.

2. Anderson Acceleration as Multistep Gradient Stacking

Anderson Acceleration (AA), originally devised to speed fixed-point iterations, accelerates the standard gradient descent update $x^{k+1}=x^k-\alpha\nabla f(x^k)$ by forming a linear combination of the $m$ most recent residuals (steps), producing a new candidate and then applying an additional gradient step. With memory parameter $m$ , the AA update at step $k$ is:

$x^{k+1} = g(x^{k-\hat{m}}) + \sum_{i=1}^{\hat{m}} \beta_i^k [g(x^{k-\hat{m}+i}) - g(x^{k-\hat{m}})],$

where $g(x)=x-\alpha\nabla f(x)$ and coefficients $\beta^k$ are obtained by minimizing the fixed-point residual in the affine span of the recent steps (Ouyang et al., 2022).

This linear combination “stacks” past iterates or gradients into a broad search direction, closely emulating the behavior of Conjugate Gradient (CG) in the local quadratic approximation of $f$ . In functional boosting and deep learning, layer-wise stacking with initialization from prior stages is thus formally linked to the action of AA on the optimization trajectory.

3. Equivalence to Accelerated Gradient Methods

In the additive stacking regime, initializing stage $t+1$ at $f_{t+1}^0 = \beta f_t$ and performing a quadratic approximation gives the update:

$F_{t+1} = F_t + \beta(F_t - F_{t-1}) - \frac{1}{L} \nabla \mathcal{L}(F_t + \beta(F_t - F_{t-1})),$

which exactly mirrors the Nesterov AGD iteration:

$y_t = x_t + \beta (x_t - x_{t-1}), \quad x_{t+1} = y_t - \frac{1}{L} \nabla \mathcal{L}(y_t).$

In the residual composition setting relevant for deep networks, $f_{t+1}^0 = \beta f_t$ yields equivalent accelerated updates up to commutator errors, which diminish as one nears the optimum and the increments shrink (Agarwal et al., 2024).

AA with restarting implements a related multistep stacking, combining $m$ previous residuals and iterates to approximate CG in the local quadratic regime, further confirming the correspondence between these stacking strategies and accelerated gradient descent (Ouyang et al., 2022).

4. Descent Guarantees and Acceleration Factors

AA with restarting satisfies a local descent theorem: there exists a neighborhood $U$ around a minimizer $x^*$ and a constant $c>0$ such that

$f(x^{k+1}) \leq f(x^k) - c \|\nabla f(x^k)\|^2 + O(\|\nabla f(x^{k-\hat{m}})\|^3),$

ensuring local decrease of the objective at least as rapid as basic gradient descent, with possible acceleration when the stacked directions are nearly conjugate. The convergence rate satisfies

$\|h(x^{k+1})\| \leq (1 - 1/\kappa)\|h(x^k)\| + O(\|h(x^k)\|\sum_i \|h(x^i)\|),$

persisting at the optimal GD rate for small residuals and exceeding it as the combination more fully explores the underlying curvature (Ouyang et al., 2022).

For stacking in deep linear residual networks with square loss and well-conditioned minimizers, if the layers are initialized within $O(\frac{\mu^5}{d\,L^4} \sigma_{\min}(W^*)^2)$ of optimum and $\beta = (\sqrt{\kappa}-1)/(\sqrt{\kappa}+1)$ , stacking achieves the rate

$\mathcal{L}(W_T) - \mathcal{L}(W^*) \leq \exp(-\Omega(T/\sqrt{\kappa})),$

matching that of AGD. Accelerated rates are robust to small momentum errors, provided they are $O(\kappa^{-2})$ -small relative to the stepsize (Agarwal et al., 2024).

5. Globalization, Restarting, and Nonconvex Extensions

To ensure global convergence outside locally strongly convex regimes, AA incorporates a function-value-based acceptance criterion: at each step, the AA update is accepted only if it yields a sufficient descent relative to the last iterate, i.e.,

$f(x_{AA}^k) \leq f(x^k) - \gamma \|\nabla f(x^k)\|^2 + \min\{ c_1 \|\nabla f(x^{k-\hat{m}})\|^\nu, c_2 \|\nabla f(x^{k-\hat{m}})\|^2, c_3 \}$

with $0<\gamma<1/(2L)$ , $\nu\in(2,3)$ , and suitably chosen $c_1, c_2, c_3$ . Otherwise, a fallback to the gradient step is performed. This rule guarantees that, under $L$ -smoothness and boundedness below, the norm of the gradient converges to zero and no global convexity is required.

Locally, once iterates enter a region of strong convexity and smooth Hessian, the function-value criterion is always satisfied and the pure accelerated regime resumes (Ouyang et al., 2022). A similar guarantee is available for stacking in additive and residual ensembles, where a brief period of zero-initialized “warm-up” brings the iterate into the region where AGD-like acceleration applies globally (Agarwal et al., 2024).

6. Empirical Validation and Comparison

Empirical studies confirm the theoretical acceleration from stacking and AA. On synthetic deep linear networks with square loss and on nonconvex classification tasks (CIFAR10, STL10, Student’s-t loss, and nonlinear least-squares with $\ell_2$ regularization), AA with restarting and stacking initialization consistently outperformed vanilla GD, L-BFGS, and random-initialized approaches—yielding 2×–4× speedup in convergence for regular memory ( $m \geq 10$ ) (Ouyang et al., 2022).

In BERT-Base pretraining, stacking initialization reduced loss more rapidly at each stage; combining stacking with a trainable $\beta$ did not degrade, and sometimes improved, final performance. Comparisons on layer initialization strategies (random vs stacking vs exact Nesterov) indicate stacking nearly matches the optimal scheme implied by AGD, while random-initialization can yield slower or unstable convergence (Agarwal et al., 2024).

7. Practical Implementation and Hyperparameter Choices

Guidelines for effective deployment of stacking-based acceleration and AA with restarting include:

Choosing memory parameter $m$ in the range 5–30, with $m \approx 10$ –$20$ offering favorable trade-offs between computational cost and acceleration.
Setting step size $\alpha = 1/L$ , or adapting via backtracking line search.
Restarting the AA memory every $m+1$ iterations or on ill-conditioning of the local least-squares system.
Using a function-value descent rule with small $\gamma$ (e.g., $\gamma \approx 0.01/(2L)$ ), $\nu \in (2,3)$ , and $c_2$ near $1/(2mL)$.
Solving the $m\times m$ least-squares subproblems robustly, e.g., via QR or stable iterative methods with tight tolerance (Ouyang et al., 2022).

For ensemble models and boosting, stacking initialization by copying previous model components or classifiers and performing a controlled number of optimization steps effectively implements momentum, matching AGD rates both in theory and in empirical results (Agarwal et al., 2024).

References:

Descent Properties of an Anderson Accelerated Gradient Method With Restarting (Ouyang et al., 2022)
Stacking as Accelerated Gradient Descent (Agarwal et al., 2024)

Markdown Report Issue Upgrade to Chat

References (2)

Descent Properties of an Anderson Accelerated Gradient Method With Restarting (2022)

Stacking as Accelerated Gradient Descent (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Accelerated Gradient Perspective on Stacking.