Papers
Topics
Authors
Recent
Search
2000 character limit reached

Randomly Weighted Gradient Descent

Updated 26 January 2026
  • Randomly Weighted Gradient Descent is an optimization method that applies stochastic reweighting to standard gradient updates for improved variance control.
  • Adaptive sampling in RWGD dynamically adjusts data weights, yielding empirical convergence speedups of up to 5×–10×.
  • The technique finds broad applications in linear regression, neural network training, and reinforcement learning, balancing rapid optimization with robust statistical performance.

Randomly Weighted Gradient Descent (RWGD) refers to a set of optimization methodologies in which the standard gradient update is modified by introducing random weights—either by importance sampling, data-dependent reweighting, or explicit randomization in the loss contributions—at each iteration. The principle is applied across convex and non-convex problems, including linear regression, neural network training, matrix factorization, and reinforcement learning. These approaches create a broad algorithmic landscape that unifies variants such as weighted stochastic gradient descent (SGD), adaptive weighted SGD (AW-SGD), and multi-restart neural network regression. RWGD fundamentally connects optimization dynamics, variance reduction, implicit regularization, and statistical performance.

1. Foundational Mechanisms of Randomly Weighted Gradient Descent

RWGD arises when the standard SGD update,

wt+1=wtρtwf(xt;wt),w_{t+1} = w_t - \rho_t \nabla_w f(x_t; w_t),

is replaced by an update of the form

wt+1=wtρtwf(xt;wt)q(xt),w_{t+1} = w_t - \rho_t \frac{\nabla_w f(x_t; w_t)}{q(x_t)},

where xtx_t is drawn from a sampling distribution QQ with density q(x)q(x), and the weight 1/q(xt)1/q(x_t) corrects for sampling bias. This importance-weighted form preserves unbiasedness of the gradient estimator under mild conditions and forms the archetype of RWGD in stochastic optimization (Bouchard et al., 2015). Randomness thus enters both via the sampled data point and the multiplicative reweighting by 1/q(xt)1/q(x_t), a paradigm that generalizes to more elaborate schemes in which the weight sequence is itself stochastic or adaptively learned.

In the context of linear regression with nn data points (xi,yi)(x_i, y_i), RWGD is instantiated by introducing random weights ωi(t)\omega_i^{(t)} into the squared loss at iteration tt:

t(w)=i=1nωi(t)(xiTwyi)2.\ell_t(w) = \sum_{i=1}^n \omega_i^{(t)} (x_i^T w - y_i)^2.

This model encompasses both traditional reweighting (fixed or adaptive deterministic weights), as well as stochastic or heavy-tailed weighting distributions (Clara et al., 11 Dec 2025).

2. Variance Reduction and Adaptive Sampling

Variance of the stochastic gradient estimate is a fundamental bottleneck in SGD convergence. The AW-SGD algorithm (Bouchard et al., 2015) directly addresses this by jointly optimizing both the model parameters ww and the sampling distribution parameters τ\tau:

  • The model parameters ww are updated via an importance-weighted gradient.
  • The sampling distribution parameters τ\tau are updated via stochastic gradient ascent on the variance-minimization objective:

J(w,τ)=ExQτwf(x;w)q(x;τ)2.J(w, \tau) = \mathbb{E}_{x \sim Q_\tau} \left\| \frac{\nabla_w f(x; w)}{q(x;\tau)} \right\|^2.

The stochastic gradient with respect to τ\tau leverages the log-derivative trick.

The iteration,

{wt+1=wtρtwf(xt;wt)q(xt;τt) τt+1=τt+ηtwf(xt;wt)q(xt;τt)2τlogq(xt;τt)\begin{cases} w_{t+1} = w_t - \rho_t \frac{\nabla_w f(x_t; w_t)}{q(x_t; \tau_t)} \ \tau_{t+1} = \tau_t + \eta_t \left\| \frac{\nabla_w f(x_t; w_t)}{q(x_t; \tau_t)} \right\|^2 \nabla_\tau \log q(x_t; \tau_t) \end{cases}

dynamically steers the sampler to concentrate on data points where the reweighted gradient norm is large, reducing estimator variance and empirically yielding constant-factor speedups (commonly 5×5\times10×10\times) in convergence (Bouchard et al., 2015).

3. Statistical and Optimization Properties under Random Weighting

In linear regression, the expected loss under random weights is characterized by the “mean-weight” matrix M2=diag(ν1,,νn)M_2 = \operatorname{diag}(\nu_1, \dots, \nu_n), where νi=E[(ωi(t))2]\nu_i = \mathbb{E}[(\omega_i^{(t)})^2]. For sufficiently small constant η\eta, the first and second moment convergence rates are governed by the minimal singular value of XTM2XX^T M_2 X and the noise properties of the weights:

  • The bias decays as

E[Δt]2exp(ησmin(M)t)Δ02.\|\mathbb{E}[\Delta_t]\|_2 \leq \exp(-\eta\,\sigma_{\min}(M) t)\,\|\Delta_0\|_2.

  • The variance achieves a steady-state determined by the affine operator Sη\mathcal{S}_\eta involving both M2M_2 and the covariance of the weights Σω\Sigma_\omega (Clara et al., 11 Dec 2025).

In the context of nonparametric regression using neural networks with random restarts, the estimator

  • Repeats GD from InI_n randomly initialized starting points,
  • Selects the final model that minimizes penalized empirical L2L_2 risk,
  • Achieves (up to logarithmic factors) the optimal minimax rate n2p/(2p+1)n^{-2p/(2p+1)} under projection-pursuit model assumptions, independent of ambient dimension dd (Braun et al., 2019).

4. Parameterization, Weighting Schemes, and Trade-offs

Discrete settings often employ a softmax parameterization for the sampling law:

q(i;τ)=exp(τi)j=1nexp(τj).q(i; \tau) = \frac{\exp(\tau_i)}{\sum_{j=1}^n \exp(\tau_j)}.

For binary-class imbalance, a sigmoid parameterization is used, and for matrix factorization, marginal row and column distributions are factorized by softmax (Bouchard et al., 2015). In linear models, example weighting can be uniform, importance-based (e.g., proportional to xi2\|x_i\|^2), or drawn from continuous or heavy-tailed distributions. These choices induce different optimization dynamics:

  • Importance sampling can improve conditioning and speed up convergence.
  • Heavy-tailed weights maintain the same M2M_2 but inflate the stationary variance.
  • Bias-variance trade-offs arise: aggressive down-weighting of noisy data can deteriorate statistical error, even as optimization converges rapidly (Clara et al., 11 Dec 2025).

5. Applications and Empirical Evidence

RWGD and its variants address a range of modern large-scale learning problems:

  • Image classification: Hard-negative mining via adaptive sampling on large datasets, exploiting label-dependence (Bouchard et al., 2015).
  • Matrix factorization: Non-uniform sampling strategies for rows and columns, yielding computational savings (Bouchard et al., 2015).
  • Reinforcement learning: Off-policy policy-gradient methods, with joint learning of exploration and exploitation policies (Bouchard et al., 2015).
  • Neural network regression: Projection pursuit models are fit via repeated random restarts and model selection, achieving sharp minimax rates and outperforming thin-plate splines and kk-nearest neighbors in simulated benchmarks (Braun et al., 2019).

Empirical results across these domains demonstrate substantially reduced training error and improved efficiency compared to uniform sampling or traditional single-path GD.

6. Convergence, Stationary Behavior, and Long-term Dynamics

RWGD algorithms enjoy non-asymptotic error bounds under reasonable step-size regimes. Geometric moment contraction analysis shows that, under bounded support of the weights and constant step-size, there exists a unique stationary law for the iterate distribution, independent of initialization (Clara et al., 11 Dec 2025):

Wq(Law(wt),π)Cexp(η(2ητ2XTX)σmin(M)qt),\mathcal{W}_q(\operatorname{Law}(w_t), \pi_\infty) \leq C \exp\left(-\frac{\eta(2-\eta\tau^2 \|X^TX\|) \sigma_{\min}(M)}{q} t\right),

with explicit formulas for stationary covariance in the linear case. In practice, the stationary error is determined by the combination of sampling-induced bias, noise covariance, and the variance structure of the weight distribution.

7. Limitations and Open Directions

While RWGD algorithms offer significant variance reduction and improved conditioning, the statistical risk may be adversely affected by skewed weight distributions or poor adaptation. Optimal balance between fast optimization and robust generalization depends on both the spectrum of M2M_2 and the interaction with data noise. Extensions to control variates, hardware-aware sampling, and more general model classes remain active areas of research (Bouchard et al., 2015). Moreover, rigorously characterizing the trade-off between fast convergence and asymptotic prediction accuracy in overparameterized settings is an open problem (Clara et al., 11 Dec 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Randomly Weighted Gradient Descent.