Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
GPT-4o
Gemini 2.5 Pro Pro
o3 Pro
GPT-4.1 Pro
DeepSeek R1 via Azure Pro
2000 character limit reached

Memoryless Gradient Descent

Updated 2 August 2025
  • Memoryless gradient descent is a class of optimization algorithms that update parameters solely using the current gradient without relying on past iterates or momentum.
  • Adaptive learning rate techniques and smoothing methods, such as Laplacian smoothing, enhance convergence speed and improve noise robustness.
  • Extensions to decentralized and geometric settings demonstrate the scalability and practical efficiency of memoryless methods in modern high-dimensional optimization.

Memoryless gradient descent refers to a class of optimization algorithms in which each update step relies exclusively on the current iterate and the value of the gradient or an immediate surrogate. These methods do not exploit explicit momentum, iterate averaging, accumulated history, or any auxiliary state except possibly static coefficients or schedules. As such, the update rule is designed such that the search direction and step are determined strictly by the information at the current point, rendering the method Markovian. Numerous theoretical and practical developments leverage this paradigm to enhance variance characteristics, adapt step sizes, improve noise robustness, and accommodate geometric generalizations. This article surveys the core concepts, algorithmic formulations, theoretical guarantees, and representative applications of memoryless gradient descent.

1. Algorithmic Principles and Markovian Update Structure

Memoryless gradient descent (MGD), in its canonical and most general form, generates the next parameter vector w(t+1)w^{(t+1)} according to an update rule that depends only on the current parameters and a function of the gradient at that point: w(t+1)=w(t)η(t)D(t)L(w(t))w^{(t+1)} = w^{(t)} - \eta^{(t)}\, D^{(t)}\, \nabla L(w^{(t)}) where:

  • L:RdRL: \mathbb{R}^d \to \mathbb{R} is the objective (loss) function,
  • η(t)\eta^{(t)} is a (possibly adaptive or scheduled) step-size,
  • D(t)D^{(t)} is a symmetric positive-definite matrix or a scalar (possibly identity), which may be fixed or vary by iteration but must be determined without using the history of gradients or iterates beyond w(t)w^{(t)}.

The “memoryless” property is formalized by the Markovian mapping: P(w(t+1)w(t),w(t1),,w(0))=P(w(t+1)w(t))\mathbb{P}(w^{(t+1)} \mid w^{(t)}, w^{(t-1)}, \ldots, w^{(0)}) = \mathbb{P}(w^{(t+1)} \mid w^{(t)}) This structure is preserved under various modifications such as deterministic schedules for η(t)\eta^{(t)} or D(t)D^{(t)}, per-iteration adaptive choices based on current gradient magnitudes, and randomized perturbations that are independent across steps. In particular, methods incorporating momentum, Nesterov acceleration, or moving averages are explicitly excluded from this category.

2. Adaptive Learning Rate Strategies

Conventional memoryless gradient descent employs a fixed learning rate, but several enhancements aim to optimize the step size per iteration by adapting to the local geometry or observed progress:

  • Online Learning Rate Search (Ravaut et al., 2018): The update law augments the standard memoryless update with an inner optimization of η(t)\eta^{(t)} at every iteration. Two principal strategies are used:

    • First-order update (gradient descent on the learning rate):

    η(t+1)=η(t)αfη(η(t))\eta^{(t+1)} = \eta^{(t)} - \alpha \frac{\partial f}{\partial \eta}(\eta^{(t)})

    where f(η)=L(w(t)ηL(w(t)))f(\eta) = L(w^{(t)} - \eta \nabla L(w^{(t)})) and α\alpha is a meta-learning rate. The critical derivative at each step satisfies

    f(η(t))=L(w(t))TL(w(t+1))f'(\eta^{(t)}) = -\nabla L(w^{(t)})^T \nabla L(w^{(t+1)}) - Second-order update (Newton’s method on the learning rate):

    η(t+1)=η(t)f(η(t))f(η(t))\eta^{(t+1)} = \eta^{(t)} - \frac{f'(\eta^{(t)})}{f''(\eta^{(t)})}

    The second derivative f(η)f''(\eta) involves the Hessian, which is approximated by finite differences. For example, \begin{align*} f'(\eta) &\approx \frac{f(\eta+\epsilon) - f(\eta-\epsilon)}{2\epsilon} \ f''(\eta) &\approx \frac{f(\eta + 2\epsilon) + f(\eta - 2\epsilon) - 2f(\eta)}{4\epsilon2} \end{align*} Five forward passes per iteration are required in the second-order approach.

Compared to a fixed learning rate, these procedures accelerate early convergence and adapt to local variations in the loss landscape, often attaining slightly improved accuracy, at the cost of per-iteration overhead.

3. Smoothing, Preconditioning, and Variance Reduction

A major challenge for memoryless methods in stochastic settings is the variance in the stochastic gradients. Structural modifications can mitigate this:

  • Laplacian Smoothing Gradient Descent (LSGD) (Osher et al., 2018): Rather than using the raw gradient, LSGD pre-multiplies by the inverse of a circulant, discrete Laplacian-based matrix AσA_\sigma:

w(k+1)=w(k)ηAσ1f(w(k))w^{(k+1)} = w^{(k)} - \eta\, A_\sigma^{-1} \nabla f(w^{(k)})

Here Aσ=I+2σIσSσSA_\sigma = I + 2\sigma I - \sigma S - \sigma S^{\top}, with SS a shift operator under periodic boundary conditions. This operator acts as a low-pass filter, suppressing high-frequency noise. The smoothing preserves the update’s memoryless nature, and in the case of stochastic or minibatch gradients, it provably and empirically reduces variance, enabling step sizes larger than those typically stable for standard memoryless SGD.

  • Theoretical Insights via PDEs: The implicit update can be interpreted as a discretization of gradient flow on a “smoothed” or “more convex” surrogate, as specified by a Hamilton–Jacobi PDE:

ut+12wu,Aσ1wu=0,u(w,0)=f(w)u_t + \frac{1}{2} \langle \nabla_w u, A_\sigma^{-1} \nabla_w u \rangle = 0, \quad u(w,0) = f(w)

The viscosity solution u(w,t)u(w,t) suppresses sharp minima and reduces the effective Lipschitz constant, further improving stability and generalization in practice.

4. Effects of Noise and Stochasticity

Memoryless gradient descent remains Markovian even when random perturbations or noise is introduced at each step (Cooper, 2018, Cooper, 2018): pt+1=ptτL(pt)ϵtp_{t+1} = p_t - \tau\, \nabla L(p_t) - \epsilon_t where ϵt\epsilon_t is a zero-mean, iid Gaussian perturbation. Studies in one and higher dimensions show that:

  • Noise can bias the terminal point distribution heavily toward wider or deeper minima, as shallow or narrow basins are more easily escaped.
  • The criticality of noise magnitude and step size is demonstrated by sharp transitions in occupation probabilities as these parameters vary.
  • In higher codimension or multidimensional settings, moderate noise biases the outcome robustly toward deep minima, and even facilitates escape from saddle points.
  • Despite the stochasticity, the process remains memoryless in the sense that only the current state (and fresh noise sample) determines the next move.

5. Geometric Extensions and Memoryless Quasi-Newton Methods

Memoryless concepts extend naturally to constrained or geometric optimization:

  • Modified Memoryless Spectral-Scaling Broyden Family on Riemannian Manifolds (Sakai et al., 2023): In the Riemannian context, memoryless quasi-Newton methods avoid storing or updating Hessian approximations, instead using search directions of the form

ηk=γk1gk+(curvature corrections)\eta_k = -\gamma_{k-1} \, g_k + \text{(curvature corrections)}

where gkg_k is the Riemannian gradient, γk1\gamma_{k-1} a scaling parameter, and the extra terms involve only transported gradient and displacement information from the previous step, without accumulating history. - When the corrections are disabled (e.g., setting correction parameters to zero), the method reduces to pure memoryless gradient descent on the manifold, with descent/convergence properties established under Riemannian Wolfe conditions. - Empirically, including lightweight corrections while maintaining the memoryless spirit (i.e., only most recent information) achieves performance superior to naive gradient descent, especially in ill-conditioned or constrained geometries.

  • Decentralized Memoryless BFGS (Wang et al., 11 Sep 2024): For distributed strongly convex optimization, “memoryless” BFGS techniques approximate curvature using only local, instantaneous differences between tracked gradients and iterates. No history is retained except the immediately preceding variables, and only current neighbors’ information is required in the distributed setting. The method attains linear convergence analogous to centralized first-order schemes and achieves reduced communication and storage overhead relative to full-matrix or limited-memory alternatives.

6. Statistical Inference and Memory in High Dimensions

A foundational question is how the memoryless character manifests in the dynamics and inference properties of gradient descent, especially in high-dimensional, mean-field regimes (Han et al., 12 Dec 2024).

  • In the mean-field setting (samples proportional to dimension), the full sequence of GD iterates exhibits intricate non-Markovian dependencies. However, by employing a non-asymptotic state evolution framework that tracks “Onsager correction matrices,” it is possible to debias the iterates so that the resulting objects are effectively memoryless, i.e., approximately independent and Gaussian in law. This decoupling facilitates principled confidence estimation and generalization error prediction for the output of memoryless gradient descent.

7. Extensions, Variants, and Implications

Several associated directions and implications arise within the framework of memoryless gradient descent:

  • Blind Descent (Gupta et al., 2020): An extreme memoryless paradigm eliminates gradients entirely. Proposed updates are random, accepted only if they immediately reduce the loss. While less efficient, such methods are provably memoryless and demonstrate the existence of viable (albeit less optimal) memoryless search strategies that bypass even gradient computation.
  • Occam Gradient Descent and Model Compression (Kausik, 30 May 2024): While not devised explicitly as memoryless, Occam Gradient Descent’s alternating gradient update and weight pruning can be viewed as an adaptive mechanism where the “memory” of now-unnecessary parameters is eliminated, yielding progressively simpler models without historical state tracking. This approach outperforms traditional gradient descent with or without post-hoc pruning in empirical studies.
  • Diminishing Step-Size Variants Under Local Smoothness (Patel et al., 2022): Pre-scheduled, diminishing step sizes (rather than adaptive or history-dependent ones) preserve the memoryless property while improving theoretical convergence guarantees under merely local (not global) Lipschitz continuity of the gradient. This is crucial for applications to deep learning and nonconvex loss functions, where standard assumptions do not hold.
  • Zeroth-Order Methods (Bai et al., 2019): In the absence of oracle gradients, memoryless behavior can be preserved by constructing stochastic gradient estimates from function evaluations alone (e.g., Gaussian smoothing), and using only current iterations' information to update. Perturbations are introduced only when certain local conditions are met (e.g., proximity to saddle points), with no accumulation of additional memory.

Summary Table: Algorithmic Variants of Memoryless Gradient Descent

Class/Variant Key Mechanism Preserves Memorylessness?
Classic (Fixed LR) w(t+1)=w(t)ηLw^{(t+1)} = w^{(t)} - \eta \nabla L Yes
Adaptive LR (Ravaut et al., 2018) Online step-size, immediate lookahead Yes
Laplacian Smoothed GD (Osher et al., 2018) Aσ1A_\sigma^{-1} preconditioning Yes
Noisy/Randomized (Cooper, 2018, Cooper, 2018) I.i.d. noise or perturbations Yes
Blind Descent (Gupta et al., 2020) Random, local descent-only acceptance Yes (no gradient)
Decentralized Memoryless BFGS (Wang et al., 11 Sep 2024) Local, immediate gradient tracking Yes
Diminishing Step-size (Patel et al., 2022) Pre-scheduled, spatially uniform decay Yes
L-BFGS, Momentum, Accumulated Grad Uses history of iterates/gradients No

Concluding Remarks

Memoryless gradient descent and its variants occupy a foundational position in optimization, balancing computational simplicity, theoretical analyzability, and practical robustness. Methods in this class are widely used in high-dimensional machine learning, distributed optimization, and geometric settings, and continue to be refined by advances in step adaptation, variance reduction, statistical inference, and architectural compression. The memoryless principle ensures that algorithms remain lightweight, parallelizable, and interpretable while retaining competitive convergence and generalization properties across a range of challenging regimes.