Adaptive Optimizer

Updated 18 March 2026

Adaptive optimizers are algorithms that dynamically adjust learning rates and hyperparameters based on gradient feedback and loss landscape characteristics.
They employ techniques such as per-parameter moment tracking, memory-efficient statistics, and meta-adaptation to balance computational cost with convergence speed.
These optimizers are crucial in large-scale and heterogeneous settings, enabling robust performance in federated learning, neural architecture search, and other advanced applications.

An adaptive optimizer in machine learning is an algorithm or framework that adjusts its own update rules, learning rates, or internal hyperparameters on the fly, conditioned on feedback from the optimization dynamics, loss surface, gradient signals, or meta-learned adaptation modules. Adaptive optimizers encompass several lines of research: per-parameter step size adaptation (e.g., Adam, AdaBelief), memory-efficient adaptations (e.g., SM3, Adafactor), programmably or meta-learned update rules, deep learnable optimizer frameworks, and domain-specific adaptation strategies for federated learning, neural architecture search, black-box optimization, or automated hyperparameter search.

1. Fundamental Principles and Taxonomy

Adaptive optimizers achieve critic-aware or data-geometry-aware parameter updates by leveraging mechanisms that go beyond static learning schedules. Canonical instances of this class—such as Adam, RMSprop, and AdaGrad—maintain per-parameter running statistics (first and/or second moments) for robust preconditioning of the stochastic gradient, yielding scale-invariant or noise-adaptive step sizes (Anil et al., 2019).

The adaptive optimizer notion subsumes:

Elementwise Preconditioning: Algorithms estimate and utilize dynamic diagonally-dominated preconditioners, adjusting individual coordinates of the step by local gradient statistics.
Dynamic Memory: Higher-order optimizers may store and adapt multiple moment vectors (beyond Adam’s k=2), using projection-based or retrospective correction schemes (Szegedy et al., 2024).
Meta-adaptation: Optimizer rules, or their coefficients, may themselves be meta-learned or evolved, as seen in frameworks that produce new update laws through evolution or online adaptation (Carvalho et al., 2021).
Task-Specific Adaptivity: Optimizers may incorporate domain knowledge or meta-gradient learning to adapt to federated, distributed, or black-box optimization problems, replacing or learning operator heuristics (Wang et al., 29 Jan 2026, Hanzely, 2023).
Hybrid Schemes: Recent work emphasizes blending multiple optimizer behaviors (e.g., RMSProp with AdamW; SOAP with MUON) based on schedule, loss stage, or eigensubspace importance for improved performance-memory tradeoffs (Ubaidullah et al., 5 Jan 2026, Liu et al., 24 Feb 2025).

2. Algorithmic Archetypes and Key Methodologies

Several principled techniques instantiate adaptive optimizers:

First/Second Moment Tracking: Adam and its derivatives use $\{m_t, v_t\}$ to estimate bias-corrected means and variances per parameter, enabling local learning-rate adaptation.
Memory-Efficient Statistics: SM3 (Anil et al., 2019) and Adafactor (Glentis et al., 20 Jun 2025) reduce the per-parameter memory cost by maintaining statistics only over strategic covers (e.g., rows/columns in matrices), facilitating LLM-scale training.
Meta-Evolutionary Approaches: In AutoLR (Carvalho et al., 2021), grammars encode and evolve update rules (e.g., the ADES optimizer). Grammatical evolution allows for learned optimizer variants not accessible via analytic hand-engineering.
Learnable/Parameteric Operators: ABOM’s evolutionary modules (Wang et al., 29 Jan 2026) amass attention-MLP-based selection, recombination, and mutation operators whose parameters are continuously updated to adapt the optimizer mechanics directly to the problem population.
Hybrid and Blended Schedulers: COSMOS splits update computation: SOAP’s full-matrix adaptation is applied to the most significant eigensubspace, while lighter methods handle orthogonal complements, yielding a flexible memory-performance tradeoff (Liu et al., 24 Feb 2025). AWDR interpolates between RMSProp (variance smoothing, beneficial early) and AdamW (momentum and weight decay, stabilizing later) using time-scheduled convex combination (Ubaidullah et al., 5 Jan 2026).
Directional or Geometric Adaptation: HGM measures update-gradient directional alignment (cosine similarity) and uses it to adaptively accelerate or decelerate the learning rate in real time (Sarkar, 22 Jun 2025).
Parallel Meta-Optimization: Certain adaptive frameworks, especially for hyperparameter search, dynamically select among a portfolio of base optimizers using reward signals, population genetics, or ensemble learning (Sun, 2022).

3. Theoretical Guarantees and Stability Properties

Convergence analysis of adaptive optimizers is domain-dependent:

Online Convex Optimization: Most per-parameter adaptive methods (Adam, AdaBelief, SM3) maintain $O(\sqrt{T})$ regret bounds under standard convexity and bounded-gradient assumptions. For example, the foundational SM3 paper provides a time-varying diagonal-regularizer regret bound that relates the optimization error to the pathwise sum of coordinatewise maxima (Anil et al., 2019).
Nonconvex Optimization: Recent unified analyses, e.g., of Admeta and NOVAK, show that with decaying or properly rectified stepsize schedules, the mean norm of the gradient achieves $O(T^{-1/2})$ rates for smooth losses (Chen et al., 2023, Kavun, 11 Jan 2026).
Task-Free and Meta-Learning Settings: In ABOM, adaptation is entirely online from population statistics, and the operator parameters are trained with self-supervised loss, giving a closed-loop guarantee of convergence (statistical generalization is ensured by empirical performance and observed zero-shot transfer) (Wang et al., 29 Jan 2026).
Federated and Personalized Optimization: Personalized-loss formulations for federated learning permit accelerated, globally linear rates, with complexity scaling linearly with client count (Hanzely, 2023). Adaptive methods mitigate issues such as client-drift and convergence ruggedness in heterogeneous environments (Sun et al., 2023).
Stability via Suppressed Stepsize Range: Aida (Zhang et al., 2022) tightens the tails of the adaptive step-size distribution, yielding an optimizer with stability close to SGD-momentum but faster convergence, as shown via mutual projection arguments and empirical curves.

4. Memory, Computation, and Scalability Tradeoffs

Modern adaptive optimizers confront the tension between preconditioning sophistication and resource limitations:

Optimizer	State Memory per matrix	Adaptivity Mechanism	Typical Use-Case
Adam/AdaGrad	$O(d^2)$ (for $d\times d$ block)	Per-parameter 1st/2nd moments	General deep learning
Adafactor/SM3	$O(d)$	Row/column accumulators	LLMs, constrained RAM
SOAP	$O(d^2)$ (per-matrix full stats)	Full-matrix 2nd moments	Small/mid-size nets
Muon	$O(d^2)$	Orthogonalized updates (matrix)	Geometry-aware updates
COSMOS	$O(dk)$ (for $k\ll d$ )	Subspace-hybrid SOAP+Muon	LLMs, efficient precond.

SCALE (Glentis et al., 20 Jun 2025) demonstrates that, for LLMs, applying column-normalized gradients and restricting momentum to the output layer permits Adam-level performance at roughly 35–45% memory cost, outperforming other state-compressed adaptives and yielding state-of-the-art perplexity in large-scale settings.

5. Applications and Domain-Specific Adaptations

Adaptive optimizers are deployed across a variety of challenging machine learning and optimization domains:

Federated and Distributed Learning: Federated Local Adaptive Amended Optimizer (FedLADA) mitigates both rugged convergence and local overfitting in federated scenarios, utilizing local-global offset correction and achieving linear speedup under partial participation (Sun et al., 2023).
Black-Box Optimization: ABOM’s attention-based evolutionary operator adaptation enables zero-shot transfer to high-dimensional path planning and synthetic function minimization (Wang et al., 29 Jan 2026).
Database Query Optimization: AQORA integrates adaptive, reinforcement-learned query plan optimization with stage-level feedback in Spark SQL, yielding substantial end-to-end speedups compared to conventional LQO and AQP baselines (He et al., 12 Oct 2025).
Quantum-Classical Algorithms: iCANS allocates quantum measurement shots per-gradient component proportionally to gain per-shot, enabling significant measurement frugality and robustness under hardware noise for variational eigensolver tasks (Kübler et al., 2019).
Meta-Learning and Hyperparameter Search: Meta-evolved optimizers (AutoLR, genetic/Bayesian-ensemble schedulers) and portfolio-based hyperparameter search frameworks dynamically select or learn optimizer configurations for specific tasks or architectures (Sun, 2022, Carvalho et al., 2021).

6. Empirical Results and Benchmark Comparisons

Adaptive optimizers are consistently compared against classical baselines on benchmarks such as CIFAR-10/100, ImageNet, LLaMA LLM pretraining, WMT translation, and Penn Treebank language modeling. Select empirical observations:

SCALE matches Adam on LLaMA models while reducing total memory by over 50% relative to Adam and 15–20% compared to the best projection-based methods (Glentis et al., 20 Jun 2025).
COSMOS equates or surpasses the per-token efficiency of full-matrix SOAP while incurring only 20% of its memory load (Liu et al., 24 Feb 2025).
NOVAK achieves up to +19.98 percentage points top-1 accuracy gain over Adam on CIFAR-100 (ResNet-50) and demonstrates unique robustness on plain CNNs, outperforming 14 leading optimizers (Kavun, 11 Jan 2026).
Aida yields up to a 1.55% accuracy gain over AdamW and AdaBelief on challenging image and NLP tasks while suppressing instability due to extreme adaptive steps (Zhang et al., 2022).
In practical database workloads, AQORA yields up to 90% end-to-end reduction in query execution time compared to learned enumeration-based optimizers (He et al., 12 Oct 2025).
Domain-specific hybrids, such as AWDR and Admeta, demonstrate faster stability and final accuracy gains by blending phase-specific behaviors, as evidenced in early detection for epidemic diagnosis or general computer vision pipelines (Ubaidullah et al., 5 Jan 2026, Chen et al., 2023).

7. Future Directions and Open Challenges

The design of adaptive optimizers is being extended by:

Higher-Order and Nonlinear Memory: RLLC demonstrates the power of dynamically re-weighted multiple memory units; further theoretical and practical exploration is warranted (Szegedy et al., 2024).
Meta-Learned and Task-Free Adaptation: Models like ABOM suggest that end-to-end, online-adapted optimizer architectures can generalize broadly, but explicit generalization guarantees are a major open question.
Direct Geometry Adaptation: Optimizers that combine structured geometric constraints (e.g., Muon/AdaMuon’s polar orthogonalization) with fine-grained, variance-aware adaptation are likely to continue supplanting purely diagonal adaptives in deep learning (Si et al., 15 Jul 2025, Liu et al., 24 Feb 2025).
Plug-and-Play Integration: A focus on optimizer drop-in compatibility, minimal extra tuning requirements, and hybrid allocation of adaptive states promises scalability for next-generation large models (Glentis et al., 20 Jun 2025, Si et al., 15 Jul 2025).
Theoretical Analysis of Stability: Understanding the effects of adaptivity on generalization, catastrophic forgetting, or sharp minima is ongoing, with sharpened analyses in stochastic, time-varying, and federated contexts (Hanzely, 2023, Kim et al., 2024).