Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Closer Look at Learned Optimization: Stability, Robustness, and Inductive Biases (2209.11208v1)

Published 22 Sep 2022 in cs.LG, math.OC, and stat.ML

Abstract: Learned optimizers -- neural networks that are trained to act as optimizers -- have the potential to dramatically accelerate training of machine learning models. However, even when meta-trained across thousands of tasks at huge computational expense, blackbox learned optimizers often struggle with stability and generalization when applied to tasks unlike those in their meta-training set. In this paper, we use tools from dynamical systems to investigate the inductive biases and stability properties of optimization algorithms, and apply the resulting insights to designing inductive biases for blackbox optimizers. Our investigation begins with a noisy quadratic model, where we characterize conditions in which optimization is stable, in terms of eigenvalues of the training dynamics. We then introduce simple modifications to a learned optimizer's architecture and meta-training procedure which lead to improved stability, and improve the optimizer's inductive bias. We apply the resulting learned optimizer to a variety of neural network training tasks, where it outperforms the current state of the art learned optimizer -- at matched optimizer computational overhead -- with regard to optimization performance and meta-training speed, and is capable of generalization to tasks far different from those it was meta-trained on.

Citations (20)

Summary

  • The paper introduces a dynamical systems analysis of learned optimizers using a noisy quadratic model to assess stability and meta-training challenges.
  • It proposes specific modifications—incorporating a nominal optimizer, heavy weight decay, and output preconditioning—to improve stability and robustness.
  • Experiments with the STAR optimizer demonstrate faster meta-training, better final performance, and strong generalization across diverse models and tasks.

This paper investigates the stability and generalization problems often encountered with learned optimizers (LOs), which are neural networks trained to perform optimization (A Closer Look at Learned Optimization: Stability, Robustness, and Inductive Biases, 2022). While LOs can potentially accelerate machine learning model training, they frequently become unstable or perform poorly when applied to tasks or for training durations different from those they were meta-trained on.

The authors use tools from dynamical systems theory, specifically analyzing the optimization process in a noisy quadratic model (NQM), to understand the stability properties of LOs. The NQM involves minimizing a quadratic loss function L(ϕt)=12(ϕtμt)H(ϕtμt)L(\phi_t) = \frac{1}{2} (\phi_t - \mu_t)^\top H (\phi_t - \mu_t) where the minimum μt\mu_t is drawn i.i.d. from a distribution (e.g., N(0,Σμ)N(0, \Sigma_{\mu})) at each step tt. This setup models stochastic optimization with minibatches.

In the NQM, the parameter update is modeled as ϕt+1=ϕt(αt+Pt)\phi_{t+1} = \phi_t - (\alpha \nabla_t + P \nabla_t), where αt\alpha \nabla_t is a "nominal" hand-designed optimizer step (like scaled gradient descent) and PtP \nabla_t is the output of a linear learned optimizer represented by matrix PP. This leads to linear dynamics: ϕt+1=(I(αI+P)H)ϕt+(αI+P)Hμt\phi_{t+1} = (I - (\alpha I + P) H) \phi_t + (\alpha I + P)H \mu_t. Stability is determined by the eigenvalues of the dynamics matrix A=I(αI+P)HA = I - (\alpha I + P) H. The system is stable if the spectral radius ρ(A)=maxiλi(A)<1\rho(A) = \max_i |\lambda_i(A)| < 1. Instability (ρ(A)1\rho(A) \geq 1) leads to diverging losses and unstable meta-gradients, hindering meta-training.

Based on this analysis, the paper proposes several modifications to improve LO stability and inductive bias:

  1. Nominal Optimizer Term: Incorporating a hand-designed optimizer component (like Adam or AggMo) ensures a baseline descent direction, improving stability, especially early in meta-training. An additional learned magnitude controller modulates this nominal term.
    • Implementation: Add fg(zt)=β1exp(β2mg(zt))g(zt)f_g(z_t) = \beta_1 \exp(\beta_2 m_g(z_t)) g(z_t) to the update, where g(zt)g(z_t) is the nominal update (e.g., Adam step) and mg(zt)m_g(z_t) is a learned magnitude output.
  2. Heavy Weight Decay: Applying strong L2 regularization (L2L_2) to the LO's parameters during meta-training discourages large outputs from the learned component, pulling eigenvalues towards stability.
    • Implementation: Use AdamW meta-optimizer with a non-zero weight decay hyperparameter on the LO's network weights.
  3. Output Preconditioning: Normalizing the output of the learned component using an adaptive preconditioner (similar to Adam's normalization) makes the update magnitude less dependent on the problem's Hessian and improves robustness.
    • Implementation: Modify the blackbox term to fb(zt)=β3d(zt)v(zt)exp(β4mb(zt))f_b(z_t) = \beta_3 \frac{d(z_t)}{v(z_t)} \exp( \beta_4 m_b(z_t) ), where v(zt)v(z_t) is a preconditioner term (e.g., RMS of gradients like in Adam).
  4. Stable Hidden States: Using stable update rules (like EMA) for the LO's internal state prevents internal dynamics from causing instability.

These modifications are incorporated into an existing efficient, elementwise MLP-based learned optimizer (small_fc_lopt) resulting in the "Stabilized Through Ample Regularization" (STAR) optimizer. The STAR optimizer adds only a few parameters (for the nominal magnitude controller) compared to the baseline.

Experiments show that STAR:

  • Meta-trains faster and achieves better final performance on meta-training tasks (MLP on Fashion MNIST, CNN on CIFAR10) compared to the purely blackbox baseline and a hyperparameter-controller variant.
  • Remains stable and performs well when run for significantly more steps (e.g., 10k steps) than used during meta-training (2k steps), unlike the baseline blackbox LO which often diverges.
  • Generalizes remarkably well to diverse, unseen tasks (different architectures like ResNet, LSTM, Transformer; different datasets like ImageNet, LM1B) even when only meta-trained on a small MLP/Fashion MNIST task. It often matches or outperforms tuned Adam on these tasks, while the baseline blackbox LO diverges.

The key takeaway is that explicitly incorporating stability-promoting inductive biases, guided by dynamical systems analysis, significantly improves the robustness, performance, and generalization capabilities of learned optimizers, making them more practical for real-world applications.