Polyak Stepsize (AdaSPS) Optimization

Updated 26 November 2025

Polyak Stepsize (AdaSPS) is a parameter-free adaptation method that automatically tunes the learning rate using local suboptimality and gradient estimates.
It employs strategies like twin-model and dynamic level adjustment to estimate unknown optima and stabilize convergence in noisy or nonconvex regimes.
The method extends to reinforcement learning by adapting step-sizes in policy gradient methods, ensuring robust performance with minimal hyperparameter tuning.

The Polyak Stepsize (often referred to as SPS or AdaSPS in adaptive variants) is a class of parameter-free or nearly parameter-free step-size adaptation schemes for first-order stochastic optimization, designed to automatically tune the learning rate based on the local suboptimality and gradient information. Its modern adaptive instantiations, particularly those under the "AdaSPS" label, provide robust step-size schedules for stochastic (and sometimes non-convex or non-interpolating) regimes and have recently been extended to policy gradient methods in reinforcement learning and other complex optimization landscapes.

1. Formal Derivation and Algorithmic Structure

The classical Polyak step-size, originally introduced for deterministic convex minimization, selects the step at iteration $k$ as

$\gamma_k = \frac{f(x_k) - f^*}{\|\nabla f(x_k)\|^2},$

where $f^* = \min_x f(x)$ is the optimal value. In stochastic optimization, only noisy or sample-based approximations are accessible, and $f^*$ is typically unknown. The extension to stochastic settings replaces $f(x_k)$ and $\nabla f(x_k)$ with their sample (or mini-batch) counterparts, yielding the core step

$\gamma_k = \frac{f_i(x_k) - f_i^*}{\|\nabla f_i(x_k)\|^2},$

where $f_i^* = \inf_x f_i(x)$ , and $i$ is a randomly sampled data index or trajectory batch.

However, the unknown $f^*$ (or $f_i^*$ ) must be replaced, estimated, or lower-bounded, and various capping and regularization strategies are employed to avoid numerical instability or divergence. Modern AdaSPS schemes introduce "slack" variables, twin-model pessimistic bounds, or accumulated surrogate values to address this challenge.

In policy gradient reinforcement learning, the objective is maximization,

$L(\theta) = V^\theta(\rho) + \alpha\,\mathbb{E}_{s\sim d^{\pi^\theta}}[H(\pi^\theta(\cdot|s))],$

where $V^\theta(\rho)$ is the expected total reward, $\theta$ parametrize the policy, and $\alpha H(\cdot)$ is an entropy regularizer. The adapted Polyak-AdaSPS step in this RL setting becomes

$\eta_k = \min\left\{ \frac{V^*_k - \widehat{L}^{\theta_k}}{c\,\|\widehat{\nabla}_{\theta_k}L\|^2 },\,\gamma_b \right\},$

where $V^*_k$ is a pessimistic estimate of the optimal value at iteration $k$ , $c>0$ is a scaling constant, and $\gamma_b$ is a hard upper cap on the step-size (Li et al., 2024).

2. Practical Estimation of the Optimum: Twin-Model and Level Adjustment Strategies

A central difficulty in Polyak-based methods is the unknown optimum value $f^*$ . Multiple strategies have emerged:

Twin-Model Approach: Two models are maintained with parameters $(\theta_1, \theta_2)$ . At each iteration, fresh trajectory batches are collected for both, their empirical losses $\hat{L}(\theta_1), \hat{L}(\theta_2)$ are computed, and $V^*$ is set to the higher loss. Only the "worse" model is updated, ensuring that the used $V^*$ always bounds the optimum from above. This pessimistic, leapfrogging update prevents aggressive steps and supports stable convergence, even with noisy objective estimates (Li et al., 2024).
Dynamic Level Adjustment: In classical convex settings, level-value sequences $\bar{f}_k$ are updated via decision-guided tests (e.g., Polyak Stepsize Violation Detector—PSVD). This ensures that the stepsize denominator never uses a level value exceeding the true optimum, offering theoretically justified convergence properties. These ideas can be extended into the stochastic Polyak/AdaSPS framework by periodically testing for violation of descent conditions and raising the level as needed (Liu et al., 2023).
Accumulated Gap/Slack Methods: Scalar slack terms or running averages of per-iteration suboptimality can modulate the numerator or adjust the stepsize adaptively, bridging the gap between aggressive updates when far from optimality and conservative steps as the iterates approach stationarity (Gower et al., 2022, Horváth et al., 2022).

3. Convergence Properties and Regimes

Convergence guarantees for Polyak/AdaSPS rules depend critically on convexity, interpolation, regularity, and the method of handling $f^*$ . In classical convex or strongly convex and smooth settings with exact $f^*$ (or correct batch-level $f_i^*$ ), SPS and AdaSPS enjoy linear convergence rates of the form: $\E\|x^k - x^*\|^2 \leq (1 - \mu \gamma_{\min})^k \|x^0 - x^*\|^2, \quad \gamma_{\min} \geq 1/L,$ where $\mu$ is strong convexity and $L$ is smoothness (Horváth et al., 2022). If only a lower bound on $f^*$ is available, the method converges to an $O(\sigma^2/\mu)$ -radius neighborhood, where $\sigma^2$ encodes the bias in the lower bound (Horváth et al., 2022, Jiang et al., 2023).

In the stochastic and non-interpolating regime, as in deep learning or RL, the step-size adaptation mechanisms (cumulative gap normalization, capping, pessimistic level selection) are essential for maintaining sublinear $O(1/\sqrt{T})$ or $O(1/T)$ convergence for objectives or stationarity measures (e.g., gradient norm), as established for AdaSPS on both convex and nonconvex problems (Wu, 25 Nov 2025).

The RL Polyak-AdaSPS framework (Li et al., 2024) transfers these principles but relies on empirical regularization rather than theoretical global convergence, as standard RL objectives are non-concave with unbounded variance.

4. Practical Considerations, Pseudocode, and Algorithmic Safeguards

The following table gives a high-level pseudocode outline for Polyak-AdaSPS as in policy gradient RL (Li et al., 2024):

Step	Description
Sample trajectories	Draw independent batches for each twin parameter θ₁, θ₂
Evaluate Monte Carlo objectives	Compute $\hat{L}(\theta_1),\,\hat{L}(\theta_2)$ on each batch
Select V* and update candidate	Set $V^* = \max\{\hat{L}(\theta_1),\hat{L}(\theta_2)\}$ , update the worse parameter
Compute GPOMDP gradient	Estimate grad on the worse policy and corresponding data
Step-size selection	$\eta_k = \min\{ (V^* - \hat{L}(\theta_{\text{target}})) / (c\,\\|g\\|^2 ),\, \gamma_b \}$
Parameter update	$\theta_{\text{target}} \leftarrow \theta_{\text{target}} + \eta_k g$

Practical elements:

Upper capping ( $\gamma_b$ ) precludes divergence when gradients are small or the estimated gap is unexpectedly large.
Entropy regularization ensures that suboptimal policies do not produce vanishing gradients, maintaining meaningful updates.
Initialization of twin models with small distance mitigates rapid dominance of one model over the other, supporting stable leapfrogging.

Conservative update rules using the better model's reward as $V^*$ avoid overestimation and overly aggressive steps when stochastic noise leads to transiently optimistic value estimates. Incorporation of momentum, per-coordinate preconditioning, and variance reduction techniques has been proposed in related AdaSPS literature but are not yet standard in deep RL implementations (Wang et al., 2023, Abdukhakimov et al., 2023).

5. Variants, Extensions, and Relationships to Other Adaptive Schemes

AdaSPS and related Polyak-style adaptive strategies form a broader suite of adaptive, no-tune optimization schemes:

Classical SPS/SPSₘₐₓ: Rely on exact or capped Polyak step-sizes, suffer from bias if $f^*$ is underestimated.
Slack-based AdaSPS: Incorporate a dynamic slack variable to buffer aggressive updates and allow adaptation to poorly-estimated $f^*$ (Gower et al., 2022).
Twin/Mini-batch AdaSPS: Use running-best function values or batch-minima to construct $V^*$ surrogates for step-size adaptation (Abdukhakimov et al., 24 Aug 2025).
Level-adjusted/PSVD AdaSPS: Utilize periodic descent-violation checks and convex combinations to update levels, ensuring the step-size adaptation reacts to evidence of over-optimism (Liu et al., 2023).
Preconditioned and Momentum AdaSPS: Combine Polyak denominator adaptation with Adam, AdaGrad, and momentum, improving robustness to scale and curvature (Wang et al., 2023, Abdukhakimov et al., 2023, Oikonomou et al., 2024).

For RL, the Polyak-AdaSPS approach (Li et al., 2024) is the first demonstration of such leapfrogging, pessimistic estimation in policy-gradient methods, requiring only one additional forward/backward pass per iteration.

6. Empirical Performance and Comparative Findings

Extensive empirical results demonstrate that Polyak-AdaSPS and its extensions:

Outperform fixed learning rate baselines (e.g., Adam with static or decayed step-sizes) in both speed and stability, especially in discrete-action RL benchmarks such as Acrobot, CartPole, and LunarLander.
Achieve robust convergence trajectories insensitive to reasonable variations of hyperparameters $c$ and $\gamma_b$ .
Exhibit automatic and monotonic decay of the step-size as the policy approaches optimality, resulting in the policy naturally "freezing" rather than exhibiting post-convergence oscillation or divergence.
Deliver tighter stability envelopes, with less variance across random seeds than Adam.
Learning curves for reward plateau earlier and more smoothly, and the step-size evolution $\eta_k$ decays rapidly to zero as optimality is approached.

Empirical results in batch supervised learning and deep nets (AdaSPS, MomAdaSPS) reinforce these RL findings, showing parameter-free adaptivity and competitive or superior convergence compared to hand-tuned optimizers across a range of problem classes (Horváth et al., 2022, Abdukhakimov et al., 24 Aug 2025, Oikonomou et al., 2024).

7. Limitations and Ongoing Research Directions

Challenges persist for Polyak-AdaSPS methods, notably:

Lack of universal global convergence guarantees in non-concave settings (notably RL); existing theory is largely limited to convex or strongly convex cases or relies on strong interpolation assumptions.
Sensitivity to poor initial guesses or high-variance environments, though the twin-model and slack strategies help moderate these issues.
For stochastic, non-interpolating, or non-smooth regimes, the effective step-size may be forced excessively small, or the method may converge only to neighborhoods determined by estimation error or suboptimal level values (Liu et al., 2023, Orabona et al., 26 May 2025).
The computation cost per step is increased due to duplicate forward/backward passes, but this is typically offset by faster or more reliable convergence.

Future research is focusing on more principled online estimation or updating of optimal-level surrogates, integration of advanced variance reduction or curvature estimation, direct extensions to off-policy RL and nonconvex/structured domains, and rigorous bounds for non-concave scenarios (Wu, 25 Nov 2025).

References:

(Li et al., 2024): Enhancing Policy Gradient with the Polyak Step-Size Adaption
(Horváth et al., 2022): Adaptive Learning Rates for Faster Stochastic Gradient Methods
(Liu et al., 2023): Accelerating Level-Value Adjustment for the Polyak Stepsize
(Abdukhakimov et al., 24 Aug 2025): Polyak Stepsize: Estimating Optimal Functional Values Without Parameters or Prior Knowledge
(Orabona et al., 26 May 2025): New Perspectives on the Polyak Stepsize: Surrogate Functions and Negative Results
(Wang et al., 2023): Generalized Polyak Step Size for First Order Optimization with Momentum
(Oikonomou et al., 2024): Stochastic Polyak Step-sizes and Momentum: Convergence Guarantees and Practical Performance.