FOMAML: Efficient First-Order Meta-Learning

Updated 7 February 2026

FOMAML is a meta-learning algorithm that approximates full MAML by removing second-order derivative computations, thus reducing computational overhead.
It achieves near-equivalent few-shot performance to MAML while substantially lowering memory and compute requirements, making it practical for large-scale models.
Recent variants like Sign-MAML further optimize bias reduction and convergence, showcasing enhanced performance and broader applicability in meta-learning.

First-Order Model-Agnostic Meta-Learning (FOMAML) is a computationally efficient algorithm for meta-learning that extends the Model-Agnostic Meta-Learning (MAML) framework by eliminating all second-order derivative computations during meta-optimization. FOMAML targets fast adaptation to new tasks by learning an initialization which can be fine-tuned using a small number of gradient steps, but unlike full MAML, it only requires first-order gradients for both inner- and outer-loop updates. This design tradeoff preserves most of MAML’s empirical performance in few-shot settings while substantially reducing memory and compute requirements for modern deep learning frameworks (Nichol et al., 2018).

1. Meta-Learning Objective and Formalism

FOMAML is defined over a distribution of tasks, each with a task-specific loss $L_\tau$ . Given task $\tau \sim p(\tau)$ with respective training ( $A$ ) and validation ( $B$ ) subsets, the meta-objective is

$\min_\theta \mathbb{E}_{\tau \sim p(\tau)}\big[L_{\tau,B}\big(U^k_{\tau,A}(\theta)\big)\big]$

where $U^k_{\tau,A}$ denotes applying $k$ inner-loop gradient steps (with step size $\alpha$ ) on the training split. If $L_{\tau,A}$ is twice continuously differentiable, then a Taylor argument shows that first-order methods can capture primary aspects of the meta-gradient when $\alpha$ is small (Nichol et al., 2018). The meta-learner thus optimizes for an initialization $\theta$ such that a small adaptation computed via first-order SGD leads to high validation performance.

2. Algorithmic Structure and Update Rules

Standard MAML vs. FOMAML

The core difference between MAML and FOMAML lies in the way the gradient with respect to the meta-parameters $\theta$ is computed through the inner-loop adaptation. In standard (second-order) MAML, the update is:

Inner: $\theta' = \theta - \alpha \nabla_\theta L_{\tau,A}(\theta)$
Outer: $\theta \leftarrow \theta - \beta \nabla_\theta L_{\tau,B}(\theta')$

Due to the chain rule, $\nabla_\theta L_{\tau,B}(\theta')$ involves Hessian-vector products:

$\nabla_\theta L_{\tau,B}(\theta') = \left(I - \alpha\nabla_\theta^2 L_{\tau,A}(\theta)\right)\nabla_{\theta'} L_{\tau,B}(\theta')$

FOMAML omits these second-order terms:

Inner: $\theta' = \theta - \alpha \nabla_\theta L_{\tau,A}(\theta)$
Outer: $g := \nabla_{\theta'} L_{\tau,B}(\theta')$
Meta-update: $\theta \leftarrow \theta - \beta g$

No Hessians are computed or backpropagated (Nichol et al., 2018). This basic form applies equally to minibatch extensions and to both supervised and reinforcement learning instantiations (Wang et al., 2020).

Pseudocode

A typical meta-iteration in FOMAML is as follows (Nichol et al., 2018):

initialize θ
repeat:
    sample batch of tasks {τ₁,...,τ_B}
    for each τᵢ in batch:
        θ'_i = θ - α ∇_θ L_{τᵢ,Aᵢ}(θ)       # Inner adaptation
        g_i = ∇_{θ'_i} L_{τᵢ,Bᵢ}(θ'_i)       # First-order meta-gradient
    g = (1/B) ∑_i g_i
    θ ← θ - β g                            # Meta-update
until convergence

The data splits $A$ (inner loop) and $B$ (outer loop) are kept disjoint when possible to maintain unbiasedness, although practical variants may sample them with replacement (Hendryx et al., 2019). Unlike Reptile, which moves $\theta$ towards $\theta'$ directly, FOMAML always computes an explicit validation-set gradient at $\theta'$ .

3. Theoretical Analysis: Approximation, Global Optimality, and Limitations

The approximation in FOMAML consists of ignoring the Hessian terms in the meta-gradient. For sufficiently small $\alpha$ , Nichol et al. show via Taylor expansion that FOMAML retains core desiderata of MAML: (a) minimizing joint task loss and (b) encouraging “gradient agreement” between different batches of the same task (Nichol et al., 2018). The meta-gradient in FOMAML is

$G_{\text{FOMAML}} = \text{AvgGrad} - \alpha\,\text{AvgGradInner} + O(\alpha^2)$

where the dropped $O(\alpha^2)$ corrections are theoretically minor for small $\alpha$ .

However, there is a fundamental accuracy floor. It was shown that FOMAML, unlike full MAML, cannot converge (in expectation) to an arbitrarily small first-order stationary point. Precisely, for task heterogeneity measured by task-gradient variance $\sigma^2$ , the criterion

$\liminf_{k \to \infty} \mathbb{E} \bigl[\|\nabla F(w_k)\|\bigr] \ge C \alpha \sigma$

holds for FOMAML, where $C$ is a constant and $\alpha$ the inner-loop step size. Thus, even with infinite meta-batch and data batches, FOMAML cannot find an $\epsilon$ -FOSP for any $\epsilon < C \alpha \sigma$ (1908.10400). This bias is due to dropping the Hessian, which leads to a persistent discrepancy in descent direction that cannot be reduced below the $\alpha\sigma$ threshold.

By contrast, full MAML and Hessian-Free MAML (using finite-difference Hessian-vector products) achieve convergence to any desired $\epsilon$ at the expense of higher per-step computational cost (1908.10400).

There is complementary positive theory: provided the model class has sufficient expressivity and task similarity (e.g., with overparameterized neural nets or linear feature spaces), approximate global optimality of FOMAML stationary points can be guaranteed. That is, the global gap decomposes into (a) stationarity error, (b) representation power (ability of the feature span to approximate “task-ratio” functions), and (c) linearization error (vanishes for large enough width) (Wang et al., 2020). In favorable regimes and especially for low-data adaptation, FOMAML is globally near-optimal up to small, quantifiable error terms.

4. Empirical Performance and Case Studies

FOMAML achieves performance nearly equivalent to full MAML on various few-shot image classification and segmentation benchmarks, with accuracy gaps typically within 1–2 percentage points (Nichol et al., 2018). For example, on Mini-ImageNet (5-way), MAML achieves 63.11 ± 0.92% (5-shot with transduction), while FOMAML reaches 63.15 ± 0.91%; for 1-shot, 48.70 ± 1.84% (MAML) vs. 48.07 ± 1.75% (FOMAML).

In image segmentation, FOMAML (with an EfficientNet-based architecture and regularization) produces state-of-the-art mean IoU scores on FSS-1000: 75.87 ± 1.10% for 1-shot, 79.89 ± 0.98% for 5-shot, improving to 81.36 ± 0.80% after meta-test-time hyperparameter optimization (Hendryx et al., 2019). Similar patterns are observed in the FP-k benchmark, where meta-learned FOMAML initializations offer substantial gains in the low- $k$ regime ( $k \le 5$ ) but the advantage over transfer learning vanishes as $k \gg 10$ .

On a range of datasets—Omniglot, Mini-ImageNet, FS-CIFAR100—FOMAML consistently matches or slightly trails full MAML, only diverging when high-precision meta-stationarity or performance on highly heterogeneous task distributions is required.

5. Implementation Details, Hyperparameter Sensitivity, and Practical Issues

FOMAML is valued for computational simplicity: its meta-gradient requires only ordinary (first-order) backpropagation, with no need for higher-order autodiff or extra memory for Hessians. This enables scaling to high-capacity architectures (e.g., EfficientLab for segmentation) and complex domains while maintaining the speed and simplicity of vanilla deep learning pipelines (Nichol et al., 2018 Hendryx et al., 2019).

Notable implementation considerations include:

Data splits: Overlap between the inner-loop and meta-validation batches can degrade generalization by underestimating distribution shift. Empirically, selecting $A$ and $B$ disjointly or via replacement affects performance by several points of IoU in segmentation (Hendryx et al., 2019).
Inner-loop optimizer: Using momentum (e.g., Adam with $\beta_1>0$ ) in the inner loop reduces “gradient agreement” and degrades meta-level learning. Zeroing inner-loop momentum is recommended (Nichol et al., 2018).
Step sizes and adaptation: The inner-loop learning rate $\alpha$ controls adaptation strength and the meta-bias floor: smaller $\alpha$ reduces asymptotic bias but may impair adaptation. Meta-batch size and data splits should scale with task-gradient variance to control gradient noise (1908.10400).
Hyperparameter optimization: Optimal step size, number of adaptation steps, dropout, and regularization are highly architecture- and task-dependent. Test-time adaptation hyperparameters can be profitably tuned via Bayesian optimization for optimal generalization (Hendryx et al., 2019).
Practical limitations: FOMAML cannot achieve arbitrarily high accuracy on highly diverse task distributions without extremely small $\alpha$ (potentially undermining adaptation speed). Use of gradient agreement is weaker compared to full MAML.

6. Modern First-Order Variants and Convergence Guarantees

Several recent works seek to address FOMAML’s bias and convergence limitations:

Sign-MAML replaces standard SGD with SignSGD in the inner-loop, enforcing a vanishing meta Jacobian and eliminating Hessian involvement “by construction.” Sign-MAML outperforms FOMAML in accuracy (often by 1–8%) and comes within or above full MAML’s scores at similar or better computational cost. For instance, on FS-CIFAR100: 5-way 1-shot, MAML 35.8 ± 1.4% (0.058 s), FOMAML 32.7 ± 1.3% (0.032 s), Sign-MAML 37.5 ± 1.4% (0.032 s) (Fan et al., 2021).
FO-B-MAML leverages finite-difference approximations of the meta-gradient, yielding an unbiased estimator of the MAML gradient (up to solve precision). FO-B-MAML does not require any Hessian-vector products and, with normalized or clipped gradient descent, achieves convergence to stationary points of the MAML objective itself. Explicitly, the outer-loop update uses

$g_i = -\lambda \frac{\theta_{i,\nu}^*(\theta) - \theta_{i,0}^*(\theta)}{\nu}$

and

$\theta \leftarrow \theta - \eta \frac{\widehat{\nabla}F(\theta)}{\beta + \|\widehat{\nabla}F(\theta)\|}$

with theoretical guarantees matching full MAML under generalized gradient-dependent smoothness (Chayti et al., 2024).

These advances provide more robust convergence analysis, demonstrate that gradient clipping or normalization is theoretically justified for the bilevel meta-objective, and empirically show bias reduction compared to FOMAML.

7. Summary Table: Comparison of First-Order Meta-Learning Algorithms

Algorithm	Hessian Use	Theoretical Bias	Key Empirical Result
MAML	Required	Unbiased	Highest accuracy if feasible
FOMAML	None	Bias $O(\alpha\sigma)$	Matches MAML within 1–2 pts
Sign-MAML	None (via SignSGD)	Unbiased (special case)	Outperforms FOMAML by 1–8%
FO-B-MAML	None (finite diff)	Vanishing w/ solve precision	Converges to MAML stationary pt

FOMAML is a computationally efficient and empirically effective meta-learning algorithm, with substantial supporting theory characterizing both its advantages and inherent limitations. It remains widely used in applied settings where second-order derivatives are prohibitively costly, but recent variants further refine its convergence properties and scope of applicability (Nichol et al., 2018, 1908.10400, Fan et al., 2021, Chayti et al., 2024).