Papers
Topics
Authors
Recent
Search
2000 character limit reached

Meta-SGD: A Meta-Learning Algorithm

Updated 19 March 2026
  • Meta-SGD is a meta-learning algorithm that jointly learns initialization parameters and elementwise step-sizes to enable swift adaptation in few-shot scenarios.
  • It extends the MAML framework by incorporating per-parameter learning, enhancing flexibility and performance across regression, classification, and control tasks.
  • The method employs inner-loop adaptations and meta-gradient backpropagation, demonstrating effective generalization in both supervised and reinforcement learning settings.

Meta-SGD is a meta-learning algorithm that enables rapid adaptation to new tasks by learning a parameter initialization and a per-parameter update rule within an SGD-like framework. Designed for few-shot learning, Meta-SGD combines the computational simplicity of model-agnostic meta-learning (MAML) with the flexibility of learning both step-sizes and update directions. It supports both supervised and reinforcement learning scenarios, demonstrating efficacy across regression, classification, and control tasks (Li et al., 2017).

1. Meta-Learning Objective and Formalism

Meta-SGD operates under a meta-learning paradigm where a distribution over tasks, p(T)p(\mathcal{T}), is assumed. Each task T\mathcal{T} provides a small training set, train(T)\mathrm{train}(\mathcal{T}), and a corresponding test set, test(T)\mathrm{test}(\mathcal{T}). The empirical losses are defined as

Ltrain(T)(θ)=1∣train(T)∣∑(x,y)∈train(T)ℓ(fθ(x),y),\mathcal{L}_{\mathrm{train}(\mathcal{T})}(\theta) = \frac{1}{|\mathrm{train}(\mathcal{T})|}\sum_{(x, y)\in\mathrm{train}(\mathcal{T})}\ell(f_\theta(x), y),

Ltest(T)(θ)=1∣test(T)∣∑(x,y)∈test(T)ℓ(fθ(x),y).\mathcal{L}_{\mathrm{test}(\mathcal{T})}(\theta) = \frac{1}{|\mathrm{test}(\mathcal{T})|}\sum_{(x, y)\in\mathrm{test}(\mathcal{T})}\ell(f_\theta(x), y).

Meta-SGD introduces two meta-parameters: an initialization vector θ∈RD\theta \in \mathbb{R}^D and an element-wise step-size vector α∈RD\alpha \in \mathbb{R}^D. The adaptation for a sampled task is

θ′=θ−α⊙∇θLtrain(T)(θ),\theta' = \theta - \alpha \odot \nabla_\theta \mathcal{L}_{\mathrm{train}(\mathcal{T})}(\theta),

where ⊙\odot denotes element-wise multiplication. The meta-objective is to minimize the expected test loss after this adaptation:

T\mathcal{T}0

2. Meta-SGD Update Derivation

a. Inner-Loop Adaptation

For each task, the adaptation step is computed using the current meta-parameters:

T\mathcal{T}1

T\mathcal{T}2

The vector T\mathcal{T}3 is learned to encode both directionality and magnitude of the update.

b. Outer-Loop Meta-Update

Performance is evaluated on the test set for each task after the inner-loop adaptation. Aggregating over a batch of tasks, the meta-parameters are updated as follows:

T\mathcal{T}4

where T\mathcal{T}5 is the meta-step size.

c. Meta-Gradient Through Adaptation

Gradients are backpropagated through the inner-loop adaptation using the chain rule, yielding:

T\mathcal{T}6

T\mathcal{T}7

This approach leverages automatic differentiation frameworks for efficient meta-gradient computation.

3. Comparisons to Alternative Meta-Learners

The distinguishing characteristics of Meta-SGD, MAML, and Meta-LSTM are summarized as follows:

Algorithm Learned Parameters Update Dynamics Computational Complexity
MAML T\mathcal{T}8 Fixed scalar T\mathcal{T}9 (hyperparameter) Simple, end-to-end, global rate
Meta-LSTM Via RNN: train(T)\mathrm{train}(\mathcal{T})0, update RNN-parameterized Flexible, but high cost
Meta-SGD train(T)\mathrm{train}(\mathcal{T})1, elementwise train(T)\mathrm{train}(\mathcal{T})2 SGD-like, per-coordinate rate & direction Simple, efficient, per-coordinate
  • MAML learns only the initialization train(T)\mathrm{train}(\mathcal{T})3, with a global, hand-chosen step-size hyperparameter train(T)\mathrm{train}(\mathcal{T})4. While broadly applicable, its capacity is constrained by the fixed update structure.
  • Meta-LSTM employs an RNN or LSTM to generate parameter updates, enabling high optimizer flexibility at the cost of significant computational overhead, scalability challenges, and complicated training.
  • Meta-SGD simultaneously learns the initialization and a vector of step-sizes (including possible sign changes), providing higher capacity than MAML while retaining the implementation simplicity and scalability.

4. Algorithmic Implementation

The meta-learning loop for supervised few-shot learning is as follows (using LaTeX pseudocode):

test(T)\mathrm{test}(\mathcal{T})1

At meta-test time, adaptation on a new task involves one application of the learned update.

5. Experimental Protocols and Performance

Experiments on regression, classification, and reinforcement learning underline the adaptability and empirical superiority of Meta-SGD in few-shot regimes.

Regression: Sine-Wave Fitting

  • Task: Fit train(T)\mathrm{train}(\mathcal{T})5 for train(T)\mathrm{train}(\mathcal{T})6 sampled uniformly.
  • Network: 1–40–40–1 (ReLU).
  • Inner adaptation: one-shot.
  • Meta-SGD (element-wise train(T)\mathrm{train}(\mathcal{T})7) outperforms MAML (fixed train(T)\mathrm{train}(\mathcal{T})8) in MSE.
Meta-train Model 5-shot test 20-shot test
5-shot MAML 1.13 ± 0.18 0.71 ± 0.12
Meta-SGD 0.90 ± 0.16 0.50 ± 0.10
20-shot MAML 1.29 ± 0.20 0.48 ± 0.08
Meta-SGD 1.01 ± 0.17 0.31 ± 0.05

Classification: Omniglot and MiniImageNet

  • Encoder: 4-layer (conv3×3–BN–ReLU–pool).
  • Both 1-shot/5-shot, 5-way/20-way experiments.
Model Omniglot 5-way 1-shot Omniglot 20-way 5-shot MiniImageNet 5-way 1-shot MiniImageNet 20-way 5-shot
Matching 98.1% 98.5% 43.6% 22.7%
MAML 98.7% 98.9% 48.7% 19.3%
Meta-LSTM — — 43.4% 26.1%
Meta-SGD 99.53% 98.97% 50.5% 28.9%

Reinforcement Learning: 2D Navigation

  • Policy: Gaussian action output.
  • Task: Navigation in train(T)\mathrm{train}(\mathcal{T})9, fixed and varying start.
  • Outer-loop: TRPO.
Model Fixed start Varying start
MAML −9.12 ± 0.66 −10.71 ± 0.76
Meta-SGD −8.64 ± 0.68 −10.15 ± 0.62

6. Limitations and Open Problems

  • Computational Expense: Meta-training entails numerous simulated inner-loop updates across tasks, which is computationally intensive for large models or extended unrolled adaptation.
  • Generalization Beyond test(T)\mathrm{test}(\mathcal{T})0: Performance may degrade on test tasks that differ substantially from training distribution, suggesting limited extrapolation capability.
  • Scalability to Many-Shot Regimes: Single-step adaptation may underfit when tasks have rich support; multi-step or hierarchical meta-learning approaches may be required.
  • Task Heterogeneity: Effective handling of highly diverse families of tasks (e.g., spanning multiple modalities) with a single meta-learner is an unresolved challenge.

These issues remain significant avenues for the development of more scalable and robust meta-learning algorithms (Li et al., 2017).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Meta-SGD.