Reptile: First-Order Meta-Learning

Updated 7 February 2026

Reptile is a gradient-based meta-learning algorithm that efficiently finds a shared initialization for rapid adaptation across tasks.
It uses a first-order approach by updating the initialization towards task-specific parameters, balancing zero-shot and post-adaptation performance.
Variants like Batch Reptile and Eigen-Reptile enhance stability and robustness, making the method effective in noisy, low-data regimes across various applications.

The Reptile algorithm is a first-order, gradient-based meta-learning method that seeks a shared initialization for rapid adaptation to new tasks sampled from a distribution. Designed for few-shot learning, Reptile operates across task families, efficiently leveraging task similarities while eschewing the need for higher-order meta-gradients. It is widely used in low-data regimes for supervised and reinforcement learning, and its variants have proven effective in adversarial and noisy environments, as well as in neuroscience and scientific computing applications (Nichol et al., 2018, Huisman et al., 2023, Jain, 2023, Chen et al., 2022, Ribeiro et al., 2021, Liu et al., 2021, Berdyshev et al., 2024, Saunshi et al., 2020).

1. Mathematical Formulation and Algorithmic Structure

Reptile aims to obtain an initialization $\theta$ such that a learner, after a small number of gradient steps on a new task, achieves low loss for that task. The meta-objective is

$\min_\theta\; \mathbb{E}_{T \sim p(T)}\left[ \sum_{t=0}^{T-1} \mathbb{E}_{(x, y) \sim p_T} \mathcal{L}(\theta_T^{(t)}) \right],$

where the inner loop consists of $T$ steps of stochastic gradient descent:

$\theta_j^{(t+1)} = \theta_j^{(t)} - \alpha \nabla_{\theta_j^{(t)}} \mathcal{L}_{(x, y) \sim p_j}(\theta_j^{(t)}),\quad \theta_j^{(0)} = \theta.$

The meta-update moves $\theta$ towards the task-adapted parameters after $T$ steps:

$\theta \leftarrow \theta + \varepsilon (\theta_j^{(T)} - \theta).$

This mechanism generalizes to batch and multi-task settings as

$\theta \leftarrow \theta + \varepsilon \frac{1}{n}\sum_{i=1}^n (\theta_{T_i}^{(T)} - \theta)$

for $n$ tasks per step (Huisman et al., 2023, Nichol et al., 2018).

Canonical pseudocode:

Initialize θ
while not converged:
    Sample task T_j ∼ p(T)
    θ_j^{(0)} ← θ
    for t in 0…T−1:
        Sample mini-batch (x, y) ∼ p_j(x, y)
        θ_j^{(t+1)} ← θ_j^{(t)} − α ∇_{θ_j^{(t)}} 𝓛(x, y; θ_j^{(t)})
    Meta-update: θ ← θ + ε (θ_j^{(T)} − θ)
return θ

(Huisman et al., 2023, Nichol et al., 2018).

2. Theoretical Properties and Underlying Mechanics

Reptile acts as a stochastic optimization of the squared distance to each task’s solution manifold, moving the initialization toward regions with minimal adaptation distance across tasks (Nichol et al., 2018). First-order Taylor expansion reveals that, like MAML, Reptile implicitly promotes fast adaptation by encouraging alignment of gradients between task minibatches. Specifically, for $k$ inner steps,

$g_{\mathrm{reptile}} = (g_1 + \dots + g_k) - \alpha H_k g_1 + O(\alpha^2),$

where $g_i$ are gradients and $H_k$ Hessians. The negative multiple of cross-batch gradient inner products ( $\mathbb{E}[g_1 \cdot g_2]$ ) promotes learning directions that generalize within a task (Nichol et al., 2018).

In certain regimes, notably non-convex (overparameterized) models, Reptile achieves task sample complexity $\mathcal{O}(1)$ , outperforming any convex initialization-based meta-learner (which requires $\Omega(d)$ samples for input dimension $d$ ) due to its ability to align optimization trajectories to the correct task subspace (Saunshi et al., 2020).

Reptile and MAML both perform inner-loop gradient-based adaptation and outer-loop meta-update of initialization, but differ fundamentally:

Inner loop: Both use T-step SGD/Adam on each task.
Outer loop: MAML minimizes the final loss on post-adaptation parameters, requiring (in full generality) second-order derivatives, e.g., $\theta \leftarrow \theta - \beta \nabla_{\theta} \mathcal{L}(\theta_j^{(T)})$ . Reptile, by contrast, simply shifts the initialization toward the adapted point (first-order only, no second derivatives).
Loss perspective: MAML focuses solely on post-adaptation performance, whereas Reptile averages adaptation performance across all steps, implicitly regularizing for both zero-shot and post-adaptation performance (Huisman et al., 2023, Nichol et al., 2018).
Implementation: Reptile is notably easier to implement and scale, especially when higher-order differentiation is impractical (Nichol et al., 2018, Huisman et al., 2023).

4. Adaptation Dynamics and Specialization to Low-Data Settings

Reptile is particularly effective when per-task data is scarce, driven by two mechanisms:

Output-layer parameterization: The algorithm meta-learns all parameters (including output layer), resulting in larger initial adaptation gradients and faster inner-loop progress, as opposed to random output-layer initialization in standard finetuning. Experiments on miniImageNet and CUB show steeper learning curves and higher few-shot accuracy when starting from a meta-learned output layer (Huisman et al., 2023).
Noisy training regime: Training inner-loops on extremely small support sets leads Reptile to produce robust initializations that tolerate stochasticity induced by data scarcity. If per-episode sample size (shots) is increased during meta-training, test-time few-shot performance drops, indicating an implicit regularization for data-efficient rapid adaptation (Huisman et al., 2023).

This effect has been exploited in RL domains—for example, in Super Mario Bros, meta-training on limited episode samples enables a trained agent (RAMario) to achieve higher test performance after few updates compared to standard RL algorithms (PPO, DQN) (Jain, 2023).

5. Empirical Performance, Representation Analysis, and Applications

Empirical studies validate the efficacy of Reptile in classic few-shot learning (Omniglot, miniImageNet), reinforcement learning, neural dialogue systems, physics-informed neural networks, and brain–computer interface tasks:

Application Domain	Reptile Outperforms/Matches	Scenario	Details
Few-shot classification	FO-MAML, MAML	5-way N-shot, Conv-4	Transductive Reptile: 49.97% (1-shot), 65.99% (5-shot)
Reinforcement Learning	PPO, DQN	Super Mario Bros, 1M episodes	Distance: RAMario 2300 vs DQN 1840 vs PPO 1732
Dialogue Systems	Transfer learning (DiKTNet baseline)	MultiWOZ, 1‑10% target data	3% data, Reptile: BLEU/F1 > DiKTNet at 10%
PINNs	Standard/random init	Poisson, Burgers, Schödinger eqs	NRPINN converges 5–50× faster, lower MAE by ×10–100
BCI (EEG)	Standard transfer	EEGNet/FBCNet/EEG-Inception on BCI IV	Improves both zero/few-shot across multiple architectures

(Nichol et al., 2018, Huisman et al., 2023, Jain, 2023, Ribeiro et al., 2021, Liu et al., 2021, Berdyshev et al., 2024).

Representation analysis using "joint classification accuracy" indicates that meta-learned features under Reptile are more specialized and less diverse than those obtained via standard pretraining/finetuning; strong few-shot performance can come at the cost of generalization to new domains (Huisman et al., 2023).

6. Algorithmic Extensions and Robustness

Significant variants and enhancements include:

Batch Reptile: Simultaneous updates over multiple tasks via averaging, improving stability and sample efficiency (Nichol et al., 2018, Berdyshev et al., 2024).
Eigen-Reptile: Instead of simple vector averaging, Eigen-Reptile updates parameters along the primary principal direction spanned by the entire trajectory of inner-loop iterates. Using the principal eigenvector of the scatter matrix of intermediate weights, this method enhances robustness to both sampling and label noise, theoretically retaining invariance to isotropic noise and empirically outperforming Reptile in few-shot and noisy-label regimes (Chen et al., 2022).
NRPINN: Application of Reptile-inspired initialization to physics-informed neural networks, incorporating supervised, unsupervised (PDE residual), and semi-supervised tasks, thereby accelerating convergence and boosting accuracy for scientific computing problems (Liu et al., 2021).
EEG-Reptile: Domain adaptation to per-subject BCI tasks, meta-learning over subjects via automated hyperparameter tuning, layer-wise meta-rates, and robust initialization via outlier removal, achieving superior generalization in both zero- and few-shot settings (Berdyshev et al., 2024).

7. Limitations and Practical Considerations

Reptile's main advantage is computational simplicity: no need for higher-order derivatives, support for arbitrary model architectures, and natural extensibility to RL and scientific computing settings. Key limitations include:

Domain specialization: Features learned by Reptile are highly tuned to the data distribution seen during meta-training, resulting in poor generalization to out-of-distribution tasks, as shown via joint accuracy probes and OOD benchmarks (Huisman et al., 2023).
Sensitivity to inner-loop batch size: Meta-training batch/shot size must match the expected test regime; larger meta-batch sizes during training may degrade few-shot test performance (Huisman et al., 2023).
No explicit separation of support/query: Unlike MAML, Reptile does not require separate train/test splits, simplifying implementation but potentially weakening the post-adaptation loss focus (Nichol et al., 2018).

Recommended practice includes careful tuning of inner- and outer-loop hyperparameters, explicit matching of meta- and test-time conditions, and, when facing noisy/meta-adversarial settings, potential adoption of Eigen-Reptile or related variants (Chen et al., 2022, Huisman et al., 2023, Berdyshev et al., 2024).

In summary, Reptile is a first-order, gradient-based meta-learning algorithm that efficiently finds a parameter initialization amenable to rapid adaptation using a few gradient steps on new tasks. Its tractability and generality underlie successful applications in classification, reinforcement learning, representation learning, scientific computing, and neuroscience. However, its tendency to overfit to the meta-training distribution imposes challenges for domain adaptation, advocating for alternatives such as straightforward pretraining and fine-tuning in high-capacity and out-of-distribution settings (Huisman et al., 2023, Nichol et al., 2018, Berdyshev et al., 2024).