Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 60 tok/s

Gemini 2.5 Pro 54 tok/s Pro

GPT-5 Medium 30 tok/s Pro

GPT-5 High 35 tok/s Pro

GPT-4o 99 tok/s Pro

Kimi K2 176 tok/s Pro

GPT OSS 120B 448 tok/s Pro

Claude Sonnet 4.5 37 tok/s Pro

2000 character limit reached

Meta-Learned Optimizers

Updated 2 October 2025

Meta-learned optimizers are learning-to-learn algorithms that replace handcrafted update rules with data-trained functions, boosting convergence and task-specific performance.
They utilize architectures such as LSTMs, hierarchical RNNs, MLPs, and attention mechanisms to achieve scalable, adaptive parameter updates.
Empirical evaluations reveal faster convergence, improved meta-generalization, and robustness across diverse neural network tasks and optimization challenges.

Meta-learned optimizers, also known as learned optimizers or “learning-to-learn” algorithms, are optimization algorithms whose update rules are not hand-designed but instead are learned from data using meta-learning techniques. The fundamental objective is to construct optimizers that can exploit structure in optimization tasks for improved convergence, adaptability, or task-specific performance. This approach reframes the design of optimization algorithms as a learning problem, typically employing recurrent architectures or other parameterized functions that update the parameters of a target optimizee based on observed gradients, loss signals, and other contextual features.

1. Meta-Learned Optimization: Problem Formulation and Key Principles

In conventional optimization, algorithms such as stochastic gradient descent (SGD) or Adam perform parameter updates according to fixed, handcrafted rules, e.g.,  $\theta_{t+1} = \theta_t - \alpha \nabla f(\theta_t)$ . Meta-learned optimizers replace the explicit update rule with a trainable function $g_t$ parameterized by meta-parameters $\phi$ , giving updates of the form $\theta_{t+1} = \theta_t + g_t(\nabla f(\theta_t), \phi)$ . The update function typically operates coordinate-wise or per-tensor—many architectures use shared weights across parameters for memory and compute efficiency.

The meta-learning objective is defined over task trajectories: $\mathcal{L}(\phi) = \mathbb{E}_f \left[ \sum_{t=1}^T w_t f(\theta_t) \right]$ subject to the recursive update equations: $\theta_{t+1} = \theta_t + g_t, \quad [g_t, h_{t+1}] = m(\nabla_t, h_t, \phi)$ where $g_t$ is the meta-learned update, $h_t$ is the recurrent hidden state, and $m$ implements the optimizer (e.g., via an LSTM or MLP) (Andrychowicz et al., 2016).

The meta-learning loop is typically bi-level: an inner loop applies the learned optimizer to train a target network on an optimization task, and an outer loop updates $\phi$ to minimize the overall meta-objective, often through gradient-based or evolution strategies (ES) methods (Andrychowicz et al., 2016, Metz et al., 2022).

2. Meta-Learned Optimizer Architectures and Design Features

Meta-learned optimizers use a range of neural architectures:

Coordinatewise/LSTM-based: Early work adopted per-parameter LSTM networks, with shared weights and independent state per coordinate. For example, (Andrychowicz et al., 2016) used a 2-layer LSTM with 20 hidden units, coordinatewise preprocessing of gradients, and rescaling of outputs.
Hierarchical RNNs: To improve scalability and capture inter-parameter dependencies, hierarchical designs (Parameter RNN * Tensor RNN * Global RNN) have been proposed, where higher-level RNNs aggregate and distribute context to lower-level update rules (Wichrowska et al., 2017).

Architecture Layer	Role	Sharing
Parameter RNN	Processes local info per parameter	Shared weights per coordinate
Tensor RNN	Aggregates info per tensor (e.g. layer)	One per tensor group
Global RNN	Coordinates across the entire network	Single instance

MLP and Attention-Based: Later methods employ multi-layer perceptrons (MLP) operating on rich features per parameter (e.g., gradients, momentum at multiple timescales, norm/statistics) (Metz et al., 2019, Metz et al., 2022). Population-based designs include attention mechanisms (feature-level, sample-level) for swarm-based optimization (Cao et al., 2019).
Hybrid and Scheduler Architectures: Some methods decouple a global scheduler (e.g., LSTM-based) from per-parameter MLP update rules, providing dynamic step-size control in addition to local parameter updates (Moudgil et al., 22 Jan 2025).
Context-Aware/Hypernetwork-Based: Context-aware optimizers modify their update rules dynamically via input-conditioned hypernetworks, e.g., via SVD-based weight factorization and HyperNetworks that generate eigenvalues as a function of the input data (Kuo et al., 2020).

The update equations often generalize or extend classical optimizers: $\theta_{t+1} = \theta_t + g_t = \theta_t + \text{MLP}(\text{features}(\nabla f, \theta_t, \text{state});\,\phi)$ Rescalings (e.g., via $\exp$ terms, or norm-adjustments) enable stability and meta-generalization (Metz et al., 2022, Moudgil et al., 22 Jan 2025, Thérien et al., 31 May 2024).

3. Meta-Training Objectives and Methodologies

Optimizers are meta-trained using a bi-level setup:

Inner Loop: Applies the candidate optimizer to optimize neural network parameters on a sampled task (e.g., image classification, scientific computing, black-box function); trajectories are typically unrolled for $T$ steps.
Outer Loop: Adjusts the optimizer’s meta-parameters by minimizing a meta-objective, such as the sum of losses over the inner trajectory, or final loss (Andrychowicz et al., 2016, Metz et al., 2022).

Meta-objective examples: $\mathcal{L}_{\text{meta}}(\phi) = \mathbb{E}_{\text{task}} \left[ \frac{1}{T} \sum_{t=1}^T f(\theta_t) \ \middle| \ \theta_{t+1} = \theta_t + g_t(\cdot, \phi) \right]$

Gradient estimation is handled through:

Truncated Backpropagation Through Time (TBPTT): Standard when memory allows (Andrychowicz et al., 2016).
Evolutionary Strategies (ES/PES): Robust for long unrolls and non-differentiable outer losses (Metz et al., 2022, Moudgil et al., 22 Jan 2025).
Task Augmentation: Random reparametrization or task scaling augments the effective meta-training distribution without incurring large compute costs (Moudgil et al., 22 Jan 2025).

The candidate optimizer is trained on a wide or procedurally-generated distribution of tasks to encourage meta-generalization and robustness (Metz et al., 2022, Moudgil et al., 22 Jan 2025).

4. Performance Evaluation and Practical Metrics

Meta-learned optimizers are evaluated using:

Training Loss Curves: Relative to baseline optimizers (SGD, Adam, RMSProp), meta-learned optimizers can achieve faster convergence, lower final loss, and improved performance on tasks for which they were meta-trained (Andrychowicz et al., 2016, Metz et al., 2022, Moudgil et al., 22 Jan 2025).
Meta-Generalization: Performance is measured both on in-distribution and diverse out-of-distribution tasks, including unseen datasets, larger/deeper architectures, and longer optimization horizons (Metz et al., 2022, Thérien et al., 31 May 2024, Moudgil et al., 22 Jan 2025).
Speedup and IQM: Metrics such as "normalized speedup" (ratio of baseline iteration count to learned optimizer's steps for equivalent performance) and interquartile mean (IQM) of normalized final loss or speedup, benchmark optimizer efficacy across testbeds (Metz et al., 2022, Moudgil et al., 22 Jan 2025).
Robustness: Evaluations can also target robustness to input corruption, including image noise or domain shifts, where learned optimizers demonstrate enhanced transfer (Metz et al., 2019).
Resource Trade-offs: Studies quantitatively analyze trade-offs between memory overhead, computational cost, and achieved loss, establishing Pareto frontiers over a variety of optimizer architectures (Metz et al., 2022, Moudgil et al., 22 Jan 2025).

5. Meta-Generalization: Challenges and Advances

A central challenge is meta-generalization—the ability to transfer learned optimization behavior to new task distributions and model scales:

Width and Depth Scaling: Standard parameterization schemes produce learned optimizers that generalize poorly to wider or deeper models due to discrepancies in activation and gradient statistics. Maximal Update Parametrization ( $\mu$ P) addresses this by scaling initializations, activations, and updates such that both small and large models remain matched in distribution throughout training (Thérien et al., 31 May 2024).

Update rule for hidden layers under $\mu$ P: $w^{(t+1)}_i = w^{(t)}_i - \frac{1}{\text{fan-in}} [\lambda_1 d_\phi \exp(\lambda_2 m_\phi)]$

Task Distribution and Curriculum: The diversity and representativeness of the meta-training task distribution critically influence generalization (Metz et al., 2022, Moudgil et al., 22 Jan 2025).
Scheduler Decoupling and Task Augmentation: Separating step-size scheduling (global LSTM) from per-parameter updates (MLPs), and augmenting tasks by parameter rescaling, results in improved generalization and robustness to reparametrization (Moudgil et al., 22 Jan 2025).
Out-of-Distribution Robustness: Domains such as PINNs for PDEs demonstrate transfer across different equations, with learned optimizers outperforming Adam on unseen physics tasks (Bihlo, 2023).

6. Scalability, Implementation, and Open Source Infrastructure

Recent work focuses on making meta-learned optimizers scalable and accessible:

Scalability: Hierarchical decompositions, coordinatewise operations, and hypernetwork-based parameterization enable efficient scaling to large models (e.g., tens of millions of parameters), with performance maintained through input/statistics normalization and update rescaling (Wichrowska et al., 2017, Metz et al., 2022, Metz et al., 2022).
Practical Integration: Libraries such as PyLO supply optimized, CUDA-accelerated learned optimizer implementations (e.g., small_fc_lopt), with support for HuggingFace Hub model weight sharing, seamless integration with learning rate schedules, weight decay, and standard trainer APIs (Janson et al., 12 Jun 2025).

Implementation Module	Functionality
Optimization Module	State management, optimizer forward pass
Meta-Model Architectures	Encapsulate learned optimizer parameters and computation
CUDA Acceleration	Kernel-level speedup for large parameter counts

Open-Source Projects: Open-sourcing of optimizer code, meta-training pipelines, pre-trained weights, and benchmarks (e.g., VeLOdrome) facilitates reproducibility and further research (Metz et al., 2022, Moudgil et al., 22 Jan 2025, Janson et al., 12 Jun 2025).

7. Practical Impact and Research Trajectories

Applications of meta-learned optimizers include:

Efficient Neural Network Training: Accelerating training convergence, reducing the number of required training steps, and facilitating hyperparameter-free training (Andrychowicz et al., 2016, Metz et al., 2022, Moudgil et al., 22 Jan 2025).
Robustness to Distribution Shift and Noise: Training models robust to data corruptions or domain shifts (Metz et al., 2019).
Black-box and Population-based Optimization: Learning optimizers for derivative-free tasks; population-based optimizer meta-learning for hyperparameter tuning and non-differentiable search (TV et al., 2019, Gomes et al., 2021, Cao et al., 2019).
Physics-informed Neural Networks: Enhanced training of PINNs for scientific computing (Bihlo, 2023).
Large-Scale Distributed and Communication-Efficient Learning: Meta-learned aggregation functions for federated/local SGD in distributed settings (Joseph et al., 2023).
Reinforcement Learning: RL-specific meta-learned optimizers address nonstationarity, plasticity, and exploration (Goldie et al., 9 Jul 2024).

Open questions and research directions involve the extension of meta-learned optimizers to broader task domains, improving meta-generalization for highly overparameterized or specialized tasks, balancing stability and flexibility (“symmetry breaking” in optimizer updates (Sobotka et al., 2023)), integrating hybrid and interpretable update rules, and further reducing computational barriers for meta-training at scale.

In summary, meta-learned optimizers adapt the process of designing optimization algorithms to a data-driven meta-learning paradigm. Through recurrent and hierarchical architectures, careful meta-training on diverse tasks, and explicit parameterization strategies such as $\mu$ P, learned optimizers now demonstrate strong empirical performance, improved generalization to new settings, and increasing practicality as drop-in alternatives to classical optimizers. Their ongoing development, evaluation, and deployment are central to advancing both the theoretical understanding and real-world efficiency of machine learning optimization.