Meta-SGD: A Meta-Learning Algorithm
- Meta-SGD is a meta-learning algorithm that jointly learns initialization parameters and elementwise step-sizes to enable swift adaptation in few-shot scenarios.
- It extends the MAML framework by incorporating per-parameter learning, enhancing flexibility and performance across regression, classification, and control tasks.
- The method employs inner-loop adaptations and meta-gradient backpropagation, demonstrating effective generalization in both supervised and reinforcement learning settings.
Meta-SGD is a meta-learning algorithm that enables rapid adaptation to new tasks by learning a parameter initialization and a per-parameter update rule within an SGD-like framework. Designed for few-shot learning, Meta-SGD combines the computational simplicity of model-agnostic meta-learning (MAML) with the flexibility of learning both step-sizes and update directions. It supports both supervised and reinforcement learning scenarios, demonstrating efficacy across regression, classification, and control tasks (Li et al., 2017).
1. Meta-Learning Objective and Formalism
Meta-SGD operates under a meta-learning paradigm where a distribution over tasks, , is assumed. Each task provides a small training set, , and a corresponding test set, . The empirical losses are defined as
Meta-SGD introduces two meta-parameters: an initialization vector and an element-wise step-size vector . The adaptation for a sampled task is
where denotes element-wise multiplication. The meta-objective is to minimize the expected test loss after this adaptation:
0
2. Meta-SGD Update Derivation
a. Inner-Loop Adaptation
For each task, the adaptation step is computed using the current meta-parameters:
1
2
The vector 3 is learned to encode both directionality and magnitude of the update.
b. Outer-Loop Meta-Update
Performance is evaluated on the test set for each task after the inner-loop adaptation. Aggregating over a batch of tasks, the meta-parameters are updated as follows:
4
where 5 is the meta-step size.
c. Meta-Gradient Through Adaptation
Gradients are backpropagated through the inner-loop adaptation using the chain rule, yielding:
6
7
This approach leverages automatic differentiation frameworks for efficient meta-gradient computation.
3. Comparisons to Alternative Meta-Learners
The distinguishing characteristics of Meta-SGD, MAML, and Meta-LSTM are summarized as follows:
| Algorithm | Learned Parameters | Update Dynamics | Computational Complexity |
|---|---|---|---|
| MAML | 8 | Fixed scalar 9 (hyperparameter) | Simple, end-to-end, global rate |
| Meta-LSTM | Via RNN: 0, update | RNN-parameterized | Flexible, but high cost |
| Meta-SGD | 1, elementwise 2 | SGD-like, per-coordinate rate & direction | Simple, efficient, per-coordinate |
- MAML learns only the initialization 3, with a global, hand-chosen step-size hyperparameter 4. While broadly applicable, its capacity is constrained by the fixed update structure.
- Meta-LSTM employs an RNN or LSTM to generate parameter updates, enabling high optimizer flexibility at the cost of significant computational overhead, scalability challenges, and complicated training.
- Meta-SGD simultaneously learns the initialization and a vector of step-sizes (including possible sign changes), providing higher capacity than MAML while retaining the implementation simplicity and scalability.
4. Algorithmic Implementation
The meta-learning loop for supervised few-shot learning is as follows (using LaTeX pseudocode):
1
At meta-test time, adaptation on a new task involves one application of the learned update.
5. Experimental Protocols and Performance
Experiments on regression, classification, and reinforcement learning underline the adaptability and empirical superiority of Meta-SGD in few-shot regimes.
Regression: Sine-Wave Fitting
- Task: Fit 5 for 6 sampled uniformly.
- Network: 1–40–40–1 (ReLU).
- Inner adaptation: one-shot.
- Meta-SGD (element-wise 7) outperforms MAML (fixed 8) in MSE.
| Meta-train | Model | 5-shot test | 20-shot test |
|---|---|---|---|
| 5-shot | MAML | 1.13 ± 0.18 | 0.71 ± 0.12 |
| Meta-SGD | 0.90 ± 0.16 | 0.50 ± 0.10 | |
| 20-shot | MAML | 1.29 ± 0.20 | 0.48 ± 0.08 |
| Meta-SGD | 1.01 ± 0.17 | 0.31 ± 0.05 |
Classification: Omniglot and MiniImageNet
- Encoder: 4-layer (conv3×3–BN–ReLU–pool).
- Both 1-shot/5-shot, 5-way/20-way experiments.
| Model | Omniglot 5-way 1-shot | Omniglot 20-way 5-shot | MiniImageNet 5-way 1-shot | MiniImageNet 20-way 5-shot |
|---|---|---|---|---|
| Matching | 98.1% | 98.5% | 43.6% | 22.7% |
| MAML | 98.7% | 98.9% | 48.7% | 19.3% |
| Meta-LSTM | — | — | 43.4% | 26.1% |
| Meta-SGD | 99.53% | 98.97% | 50.5% | 28.9% |
Reinforcement Learning: 2D Navigation
- Policy: Gaussian action output.
- Task: Navigation in 9, fixed and varying start.
- Outer-loop: TRPO.
| Model | Fixed start | Varying start |
|---|---|---|
| MAML | −9.12 ± 0.66 | −10.71 ± 0.76 |
| Meta-SGD | −8.64 ± 0.68 | −10.15 ± 0.62 |
6. Limitations and Open Problems
- Computational Expense: Meta-training entails numerous simulated inner-loop updates across tasks, which is computationally intensive for large models or extended unrolled adaptation.
- Generalization Beyond 0: Performance may degrade on test tasks that differ substantially from training distribution, suggesting limited extrapolation capability.
- Scalability to Many-Shot Regimes: Single-step adaptation may underfit when tasks have rich support; multi-step or hierarchical meta-learning approaches may be required.
- Task Heterogeneity: Effective handling of highly diverse families of tasks (e.g., spanning multiple modalities) with a single meta-learner is an unresolved challenge.
These issues remain significant avenues for the development of more scalable and robust meta-learning algorithms (Li et al., 2017).