Averaged Gradient Episodic Memory (A-GEM)

Updated 12 February 2026

A-GEM is a continual learning algorithm that mitigates catastrophic forgetting by using a single projection constraint based on the average past gradient.
It achieves competitive accuracy, e.g., around 89.1% on MNIST, while drastically reducing computational and memory costs compared to GEM.
The method employs reservoir sampling for episodic memory and a strict two-stream evaluation protocol to ensure robustness in single-pass lifelong learning.

Averaged Gradient Episodic Memory (A-GEM) is a continual learning algorithm designed to balance computational efficiency, memory economy, and resistance to catastrophic forgetting in single-pass lifelong learning scenarios. A-GEM is an advancement over Gradient Episodic Memory (GEM), offering similar or superior accuracy with dramatically lower computational and memory costs by introducing a novel projection constraint on the average past gradient (Chaudhry et al., 2018).

1. Lifelong Learning Setup and Evaluation Protocols

In lifelong learning (LLL), the objective is to learn a predictor $f_\theta : \mathcal{X} \times \mathcal{T} \to \mathcal{Y}$ —for example, a neural network parameterized by $\theta \in \mathbb{R}^P$ —over a sequence of $T$ tasks. Each task $k$ is associated with a dataset $D_k = \{(x_i^k, t_i^k, y_i^k)\}_{i=1}^{n_k}$ , where $x_i^k$ is the input, $t_i^k$ is a task descriptor, and $y_i^k$ is the label. The learner observes each instance exactly once, with all tasks presented in sequence.

To mitigate catastrophic forgetting, methods maintain a small episodic memory $M_k \subset D_k$ , typically much smaller than the task dataset ( $|M_k| \ll n_k$ ). The union of all past task memories before task $t$ is denoted $M = \bigcup_{k<t} M_k$ .

A-GEM evaluations are conducted via a two-stream protocol:

$D^{CV}$ : A held-out stream for hyper-parameter optimization, allowing arbitrary replay.
$D^{EV}$ : An evaluation stream processed in a single pass with fixed hyper-parameters.

This separation prevents information leakage from evaluation tasks during hyper-parameter search and enforces a strictly single-pass regime for reporting metrics.

2. Evaluation Metrics

A-GEM is evaluated using several metrics that quantify accuracy and the dynamics of knowledge retention and acquisition:

(a) Final Average Accuracy $A_T$ :

$A_k = \frac{1}{k} \sum_{j=1}^k a_{k, B_k, j}$

Here, $a_{k, B_k, j}$ is the test accuracy on task $j$ after training on all minibatches of task $k$ ; $A_T$ is the terminal metric.

(b) Forgetting $F_T$ :

$F_k = \frac{1}{k-1} \sum_{j=1}^{k-1} \left[ \max_{l \leq k-1} a_{l, B_l, j} - a_{k, B_k, j} \right]$

This quantifies the deterioration in performance on previous tasks due to new learning.

(c) Learning Curve Area (LCA $_\beta$ ):

$Z_b = \frac{1}{T} \sum_{k=1}^T a_{k,b,k} \qquad LCA_\beta = \frac{1}{\beta+1}\sum_{b=0}^\beta Z_b$

LCA evaluates both few-shot and progressive learning by averaging accuracy up to $\beta$ training steps.

3. From GEM to A-GEM: Mathematical Formulation

GEM constrains gradient updates to avoid loss increases on any previous task's memory, projecting the current gradient $g$ onto the intersection of half-spaces: $\text{Find } \tilde g = \arg\min_{z} \frac{1}{2}\|g-z\|^2 \quad \text{s.t.} \ \langle z, g_k \rangle \geq 0 \ \forall k < t$ with $g_k = \nabla_\theta \ell(f_\theta, M_k)$ . This requires solving a quadratic program with $t-1$ constraints and storing all $g_k$ .

A-GEM simplifies the constraint to a single condition on the average past gradient $g_\mathrm{ref}$ , computed from a mini-batch sampled from $M$ : $\tilde g = \arg\min_z \frac{1}{2}\|g - z\|^2 \quad \text{s.t.}\ \langle z, g_\mathrm{ref} \rangle \geq 0$ If $\langle g, g_\mathrm{ref} \rangle \geq 0$ , $g$ is untouched; else, the projection has a closed form: $\tilde g = g - \frac{\langle g, g_\mathrm{ref} \rangle}{\langle g_\mathrm{ref}, g_\mathrm{ref} \rangle} g_\mathrm{ref}$ This single-constraint projection reduces computational complexity and storage, enabling scalability to longer task sequences and larger networks.

4. Algorithmic Implementation and Complexity

A-GEM maintains a global episodic memory $M$ with reservoir sampling to ensure a uniform selection from all encountered data. At each training step, a mini-batch from $M$ provides $g_\mathrm{ref}$ . The update is as follows:

for (x, y) in D_t:
    if M ≠ ∅:
        sample (x_ref, y_ref) ~ M
        g_ref ← ∇_θ ℓ(f_θ(x_ref, t), y_ref)
    else:
        g_ref ← 0
    g ← ∇_θ ℓ(f_θ(x, t), y)
    if ⟨g, g_ref⟩ < 0:
        g ← g − (⟨g, g_ref⟩ / ⟨g_ref, g_ref⟩) · g_ref
    θ ← θ − η · g
Update M with reservoir sampling from D_t

Inputs: training stream

D

, test sets

D^{test}

, learning rate

\eta

. Outputs: final

\theta

, accuracy matrix

A

Complexity Table

Method	Time (per step)	Memory
Vanilla	$O(P)$	$O(P + BH)$
EWC	$O(P)$ + diag-updates	$O(4P + BH)$
GEM	$O(P t + \text{QP})$	$O(P t + (B+M) H)$
A-GEM	$O(P + M H + 1)$ (≈ $O(P)$ )	$O(2P + (B+M) H)$

Here $P$ =#parameters, $B$ =mini-batch size, $H$ =activation size, $M$ =episodic memory size. In practice, A-GEM is approximately $100\times$ faster and $10\times$ more memory efficient than GEM on MNIST/CIFAR.

5. Empirical Results and Benchmark Performance

Experiments evaluate A-GEM on Permuted MNIST, Split CIFAR-100, Split CUB, and Split AWA, using MLP and ResNet architectures. A-GEM's final accuracy ( $A_T$ ) matches or slightly trails GEM (e.g., 89.1% vs 89.5% on MNIST) while outperforming all regularization-based baselines (EWC, PI, MAS, RWalk) in the single-pass regime (e.g., EWC: 68%, A-GEM: 89% on MNIST). Forgetting $F_T$ remains lowest among methods with bounded memory.

Incorporating compositional task descriptors with a joint-embedding model ("–je" variant) further improves $A_T$ , 0-shot performance (LCA $_0$ ), and learning speed for A-GEM and other methods.

Normalized summary (Permuted MNIST, Split CIFAR):

Method	$A_T$ (%) ↑	LCA $_{10}$ ↑	Time ↓	Mem ↓
Vanilla	47.9	0.26	0.06	0.06
EWC	68.3	0.27	0.14	0.14
GEM	89.5	0.23	1.00	1.00
A-GEM	89.1	0.29	0.14	0.11

A-GEM is thus Pareto-optimal in the joint space of accuracy, forgetting, LCA, time, and memory.

6. Ablations, Sensitivity Analyses, and Algorithmic Variants

A-GEM projections are only required on a small fraction of steps, in contrast to GEM's frequent constraints as tasks accumulate. The "Stochastic GEM" (s-GEM) variant, which randomly samples past constraints, is still more costly and slightly less effective than A-GEM.

EWC's efficacy is highly sensitive to the number of epochs and model capacity; in single-pass and small-network settings, it only marginally outperforms vanilla SGD. Only with over-parameterized models and multiple passes does EWC approach A-GEM's performance.

Hyper-parameter search spaces and selected settings are detailed in the appendix of (Chaudhry et al., 2018).

7. Key Insights, Limitations, and Future Directions

A-GEM achieves the core objectives of lifelong learning—retaining prior knowledge and enabling forward transfer—while being computationally and memory efficient. The Learning Curve Area (LCA) metric, introduced alongside A-GEM, provides a finer quantification of few-shot learning dynamics.

Observed limitations include the residual gap between single-pass continual learning (even with A-GEM) and the multi-task upper bound (IID setting). Differences in LCA among advanced continual methods converge when catastrophic forgetting is controlled; the field thus requires strategies for enhancing positive backward transfer.

Natural extensions include applying A-GEM to unsupervised, reinforcement, or streaming non-i.i.d. learning settings. The open-source codebase is provided at https://github.com/facebookresearch/agem (Chaudhry et al., 2018).

Markdown Report Issue Upgrade to Chat

References (1)

Efficient Lifelong Learning with A-GEM (2018)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Averaged GEM (A-GEM).