Meta-SGD: A Meta-Learning Algorithm

Updated 19 March 2026

Meta-SGD is a meta-learning algorithm that jointly learns initialization parameters and elementwise step-sizes to enable swift adaptation in few-shot scenarios.
It extends the MAML framework by incorporating per-parameter learning, enhancing flexibility and performance across regression, classification, and control tasks.
The method employs inner-loop adaptations and meta-gradient backpropagation, demonstrating effective generalization in both supervised and reinforcement learning settings.

Meta-SGD is a meta-learning algorithm that enables rapid adaptation to new tasks by learning a parameter initialization and a per-parameter update rule within an SGD-like framework. Designed for few-shot learning, Meta-SGD combines the computational simplicity of model-agnostic meta-learning (MAML) with the flexibility of learning both step-sizes and update directions. It supports both supervised and reinforcement learning scenarios, demonstrating efficacy across regression, classification, and control tasks (Li et al., 2017).

1. Meta-Learning Objective and Formalism

Meta-SGD operates under a meta-learning paradigm where a distribution over tasks, $p(\mathcal{T})$ , is assumed. Each task $\mathcal{T}$ provides a small training set, $\mathrm{train}(\mathcal{T})$ , and a corresponding test set, $\mathrm{test}(\mathcal{T})$ . The empirical losses are defined as

$\mathcal{L}_{\mathrm{train}(\mathcal{T})}(\theta) = \frac{1}{|\mathrm{train}(\mathcal{T})|}\sum_{(x, y)\in\mathrm{train}(\mathcal{T})}\ell(f_\theta(x), y),$

$\mathcal{L}_{\mathrm{test}(\mathcal{T})}(\theta) = \frac{1}{|\mathrm{test}(\mathcal{T})|}\sum_{(x, y)\in\mathrm{test}(\mathcal{T})}\ell(f_\theta(x), y).$

Meta-SGD introduces two meta-parameters: an initialization vector $\theta \in \mathbb{R}^D$ and an element-wise step-size vector $\alpha \in \mathbb{R}^D$ . The adaptation for a sampled task is

$\theta' = \theta - \alpha \odot \nabla_\theta \mathcal{L}_{\mathrm{train}(\mathcal{T})}(\theta),$

where $\odot$ denotes element-wise multiplication. The meta-objective is to minimize the expected test loss after this adaptation:

$\mathcal{T}$ 0

2. Meta-SGD Update Derivation

a. Inner-Loop Adaptation

For each task, the adaptation step is computed using the current meta-parameters:

$\mathcal{T}$ 1

$\mathcal{T}$ 2

The vector $\mathcal{T}$ 3 is learned to encode both directionality and magnitude of the update.

b. Outer-Loop Meta-Update

Performance is evaluated on the test set for each task after the inner-loop adaptation. Aggregating over a batch of tasks, the meta-parameters are updated as follows:

$\mathcal{T}$ 4

where $\mathcal{T}$ 5 is the meta-step size.

c. Meta-Gradient Through Adaptation

Gradients are backpropagated through the inner-loop adaptation using the chain rule, yielding:

$\mathcal{T}$ 6

$\mathcal{T}$ 7

This approach leverages automatic differentiation frameworks for efficient meta-gradient computation.

3. Comparisons to Alternative Meta-Learners

The distinguishing characteristics of Meta-SGD, MAML, and Meta-LSTM are summarized as follows:

Algorithm	Learned Parameters	Update Dynamics	Computational Complexity
MAML	$\mathcal{T}$ 8	Fixed scalar $\mathcal{T}$ 9 (hyperparameter)	Simple, end-to-end, global rate
Meta-LSTM	Via RNN: $\mathrm{train}(\mathcal{T})$ 0, update	RNN-parameterized	Flexible, but high cost
Meta-SGD	$\mathrm{train}(\mathcal{T})$ 1, elementwise $\mathrm{train}(\mathcal{T})$ 2	SGD-like, per-coordinate rate & direction	Simple, efficient, per-coordinate

MAML learns only the initialization $\mathrm{train}(\mathcal{T})$ 3, with a global, hand-chosen step-size hyperparameter $\mathrm{train}(\mathcal{T})$ 4. While broadly applicable, its capacity is constrained by the fixed update structure.
Meta-LSTM employs an RNN or LSTM to generate parameter updates, enabling high optimizer flexibility at the cost of significant computational overhead, scalability challenges, and complicated training.
Meta-SGD simultaneously learns the initialization and a vector of step-sizes (including possible sign changes), providing higher capacity than MAML while retaining the implementation simplicity and scalability.

4. Algorithmic Implementation

The meta-learning loop for supervised few-shot learning is as follows (using LaTeX pseudocode):

$\mathrm{test}(\mathcal{T})$ 1

At meta-test time, adaptation on a new task involves one application of the learned update.

5. Experimental Protocols and Performance

Experiments on regression, classification, and reinforcement learning underline the adaptability and empirical superiority of Meta-SGD in few-shot regimes.

Regression: Sine-Wave Fitting

Task: Fit $\mathrm{train}(\mathcal{T})$ 5 for $\mathrm{train}(\mathcal{T})$ 6 sampled uniformly.
Network: 1–40–40–1 (ReLU).
Inner adaptation: one-shot.
Meta-SGD (element-wise $\mathrm{train}(\mathcal{T})$ 7) outperforms MAML (fixed $\mathrm{train}(\mathcal{T})$ 8) in MSE.

Meta-train	Model	5-shot test	20-shot test
5-shot	MAML	1.13 ± 0.18	0.71 ± 0.12
	Meta-SGD	0.90 ± 0.16	0.50 ± 0.10
20-shot	MAML	1.29 ± 0.20	0.48 ± 0.08
	Meta-SGD	1.01 ± 0.17	0.31 ± 0.05

Classification: Omniglot and MiniImageNet

Encoder: 4-layer (conv3×3–BN–ReLU–pool).
Both 1-shot/5-shot, 5-way/20-way experiments.

Model	Omniglot 5-way 1-shot	Omniglot 20-way 5-shot	MiniImageNet 5-way 1-shot	MiniImageNet 20-way 5-shot
Matching	98.1%	98.5%	43.6%	22.7%
MAML	98.7%	98.9%	48.7%	19.3%
Meta-LSTM	—	—	43.4%	26.1%
Meta-SGD	99.53%	98.97%	50.5%	28.9%

Policy: Gaussian action output.
Task: Navigation in $\mathrm{train}(\mathcal{T})$ 9, fixed and varying start.
Outer-loop: TRPO.

Model	Fixed start	Varying start
MAML	−9.12 ± 0.66	−10.71 ± 0.76
Meta-SGD	−8.64 ± 0.68	−10.15 ± 0.62

6. Limitations and Open Problems

Computational Expense: Meta-training entails numerous simulated inner-loop updates across tasks, which is computationally intensive for large models or extended unrolled adaptation.
Generalization Beyond $\mathrm{test}(\mathcal{T})$ 0: Performance may degrade on test tasks that differ substantially from training distribution, suggesting limited extrapolation capability.
Scalability to Many-Shot Regimes: Single-step adaptation may underfit when tasks have rich support; multi-step or hierarchical meta-learning approaches may be required.
Task Heterogeneity: Effective handling of highly diverse families of tasks (e.g., spanning multiple modalities) with a single meta-learner is an unresolved challenge.

These issues remain significant avenues for the development of more scalable and robust meta-learning algorithms (Li et al., 2017).

Markdown Report Issue Upgrade to Chat

References (1)

Meta-SGD: Learning to Learn Quickly for Few-Shot Learning (2017)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Meta-SGD.

Meta-SGD: A Meta-Learning Algorithm

1. Meta-Learning Objective and Formalism

2. Meta-SGD Update Derivation

a. Inner-Loop Adaptation

b. Outer-Loop Meta-Update

c. Meta-Gradient Through Adaptation

3. Comparisons to Alternative Meta-Learners

4. Algorithmic Implementation

5. Experimental Protocols and Performance

Regression: Sine-Wave Fitting

Classification: Omniglot and MiniImageNet

Reinforcement Learning: 2D Navigation

6. Limitations and Open Problems

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Meta-SGD: A Meta-Learning Algorithm

1. Meta-Learning Objective and Formalism

2. Meta-SGD Update Derivation

a. Inner-Loop Adaptation

b. Outer-Loop Meta-Update

c. Meta-Gradient Through Adaptation

3. Comparisons to Alternative Meta-Learners

4. Algorithmic Implementation

5. Experimental Protocols and Performance

Regression: Sine-Wave Fitting

Classification: Omniglot and MiniImageNet

Reinforcement Learning: 2D Navigation

6. Limitations and Open Problems

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research