Papers
Topics
Authors
Recent
Search
2000 character limit reached

Averaged Gradient Episodic Memory (A-GEM)

Updated 12 February 2026
  • A-GEM is a continual learning algorithm that mitigates catastrophic forgetting by using a single projection constraint based on the average past gradient.
  • It achieves competitive accuracy, e.g., around 89.1% on MNIST, while drastically reducing computational and memory costs compared to GEM.
  • The method employs reservoir sampling for episodic memory and a strict two-stream evaluation protocol to ensure robustness in single-pass lifelong learning.

Averaged Gradient Episodic Memory (A-GEM) is a continual learning algorithm designed to balance computational efficiency, memory economy, and resistance to catastrophic forgetting in single-pass lifelong learning scenarios. A-GEM is an advancement over Gradient Episodic Memory (GEM), offering similar or superior accuracy with dramatically lower computational and memory costs by introducing a novel projection constraint on the average past gradient (Chaudhry et al., 2018).

1. Lifelong Learning Setup and Evaluation Protocols

In lifelong learning (LLL), the objective is to learn a predictor fθ:X×TYf_\theta : \mathcal{X} \times \mathcal{T} \to \mathcal{Y}—for example, a neural network parameterized by θRP\theta \in \mathbb{R}^P—over a sequence of TT tasks. Each task kk is associated with a dataset Dk={(xik,tik,yik)}i=1nkD_k = \{(x_i^k, t_i^k, y_i^k)\}_{i=1}^{n_k}, where xikx_i^k is the input, tikt_i^k is a task descriptor, and yiky_i^k is the label. The learner observes each instance exactly once, with all tasks presented in sequence.

To mitigate catastrophic forgetting, methods maintain a small episodic memory MkDkM_k \subset D_k, typically much smaller than the task dataset (Mknk|M_k| \ll n_k). The union of all past task memories before task tt is denoted M=k<tMkM = \bigcup_{k<t} M_k.

A-GEM evaluations are conducted via a two-stream protocol:

  • DCVD^{CV}: A held-out stream for hyper-parameter optimization, allowing arbitrary replay.
  • DEVD^{EV}: An evaluation stream processed in a single pass with fixed hyper-parameters.

This separation prevents information leakage from evaluation tasks during hyper-parameter search and enforces a strictly single-pass regime for reporting metrics.

2. Evaluation Metrics

A-GEM is evaluated using several metrics that quantify accuracy and the dynamics of knowledge retention and acquisition:

(a) Final Average Accuracy ATA_T:

Ak=1kj=1kak,Bk,jA_k = \frac{1}{k} \sum_{j=1}^k a_{k, B_k, j}

Here, ak,Bk,ja_{k, B_k, j} is the test accuracy on task jj after training on all minibatches of task kk; ATA_T is the terminal metric.

(b) Forgetting FTF_T:

Fk=1k1j=1k1[maxlk1al,Bl,jak,Bk,j]F_k = \frac{1}{k-1} \sum_{j=1}^{k-1} \left[ \max_{l \leq k-1} a_{l, B_l, j} - a_{k, B_k, j} \right]

This quantifies the deterioration in performance on previous tasks due to new learning.

(c) Learning Curve Area (LCAβ_\beta):

Zb=1Tk=1Tak,b,kLCAβ=1β+1b=0βZbZ_b = \frac{1}{T} \sum_{k=1}^T a_{k,b,k} \qquad LCA_\beta = \frac{1}{\beta+1}\sum_{b=0}^\beta Z_b

LCA evaluates both few-shot and progressive learning by averaging accuracy up to β\beta training steps.

3. From GEM to A-GEM: Mathematical Formulation

GEM constrains gradient updates to avoid loss increases on any previous task's memory, projecting the current gradient gg onto the intersection of half-spaces: Find g~=argminz12gz2s.t. z,gk0 k<t\text{Find } \tilde g = \arg\min_{z} \frac{1}{2}\|g-z\|^2 \quad \text{s.t.} \ \langle z, g_k \rangle \geq 0 \ \forall k < t with gk=θ(fθ,Mk)g_k = \nabla_\theta \ell(f_\theta, M_k). This requires solving a quadratic program with t1t-1 constraints and storing all gkg_k.

A-GEM simplifies the constraint to a single condition on the average past gradient grefg_\mathrm{ref}, computed from a mini-batch sampled from MM: g~=argminz12gz2s.t. z,gref0\tilde g = \arg\min_z \frac{1}{2}\|g - z\|^2 \quad \text{s.t.}\ \langle z, g_\mathrm{ref} \rangle \geq 0 If g,gref0\langle g, g_\mathrm{ref} \rangle \geq 0, gg is untouched; else, the projection has a closed form: g~=gg,grefgref,grefgref\tilde g = g - \frac{\langle g, g_\mathrm{ref} \rangle}{\langle g_\mathrm{ref}, g_\mathrm{ref} \rangle} g_\mathrm{ref} This single-constraint projection reduces computational complexity and storage, enabling scalability to longer task sequences and larger networks.

4. Algorithmic Implementation and Complexity

A-GEM maintains a global episodic memory MM with reservoir sampling to ensure a uniform selection from all encountered data. At each training step, a mini-batch from MM provides grefg_\mathrm{ref}. The update is as follows:

1
2
3
4
5
6
7
8
9
10
11
for (x, y) in D_t:
    if M ≠ ∅:
        sample (x_ref, y_ref) ~ M
        g_ref ← ∇_θ ℓ(f_θ(x_ref, t), y_ref)
    else:
        g_ref ← 0
    g ← ∇_θ ℓ(f_θ(x, t), y)
    if ⟨g, g_ref⟩ < 0:
        g ← g − (⟨g, g_ref⟩ / ⟨g_ref, g_ref⟩) · g_ref
    θ ← θ − η · g
Update M with reservoir sampling from D_t
Inputs: training stream DD, test sets DtestD^{test}, learning rate η\eta. Outputs: final θ\theta, accuracy matrix AA.

Complexity Table

Method Time (per step) Memory
Vanilla O(P)O(P) O(P+BH)O(P + BH)
EWC O(P)O(P) + diag-updates O(4P+BH)O(4P + BH)
GEM O(Pt+QP)O(P t + \text{QP}) O(Pt+(B+M)H)O(P t + (B+M) H)
A-GEM O(P+MH+1)O(P + M H + 1) (≈O(P)O(P)) O(2P+(B+M)H)O(2P + (B+M) H)

Here PP=#parameters, BB=mini-batch size, HH=activation size, MM=episodic memory size. In practice, A-GEM is approximately 100×100\times faster and 10×10\times more memory efficient than GEM on MNIST/CIFAR.

5. Empirical Results and Benchmark Performance

Experiments evaluate A-GEM on Permuted MNIST, Split CIFAR-100, Split CUB, and Split AWA, using MLP and ResNet architectures. A-GEM's final accuracy (ATA_T) matches or slightly trails GEM (e.g., 89.1% vs 89.5% on MNIST) while outperforming all regularization-based baselines (EWC, PI, MAS, RWalk) in the single-pass regime (e.g., EWC: 68%, A-GEM: 89% on MNIST). Forgetting FTF_T remains lowest among methods with bounded memory.

Incorporating compositional task descriptors with a joint-embedding model ("–je" variant) further improves ATA_T, 0-shot performance (LCA0_0), and learning speed for A-GEM and other methods.

Normalized summary (Permuted MNIST, Split CIFAR):

Method ATA_T (%) ↑ LCA10_{10} Time ↓ Mem ↓
Vanilla 47.9 0.26 0.06 0.06
EWC 68.3 0.27 0.14 0.14
GEM 89.5 0.23 1.00 1.00
A-GEM 89.1 0.29 0.14 0.11

A-GEM is thus Pareto-optimal in the joint space of accuracy, forgetting, LCA, time, and memory.

6. Ablations, Sensitivity Analyses, and Algorithmic Variants

A-GEM projections are only required on a small fraction of steps, in contrast to GEM's frequent constraints as tasks accumulate. The "Stochastic GEM" (s-GEM) variant, which randomly samples past constraints, is still more costly and slightly less effective than A-GEM.

EWC's efficacy is highly sensitive to the number of epochs and model capacity; in single-pass and small-network settings, it only marginally outperforms vanilla SGD. Only with over-parameterized models and multiple passes does EWC approach A-GEM's performance.

Hyper-parameter search spaces and selected settings are detailed in the appendix of (Chaudhry et al., 2018).

7. Key Insights, Limitations, and Future Directions

A-GEM achieves the core objectives of lifelong learning—retaining prior knowledge and enabling forward transfer—while being computationally and memory efficient. The Learning Curve Area (LCA) metric, introduced alongside A-GEM, provides a finer quantification of few-shot learning dynamics.

Observed limitations include the residual gap between single-pass continual learning (even with A-GEM) and the multi-task upper bound (IID setting). Differences in LCA among advanced continual methods converge when catastrophic forgetting is controlled; the field thus requires strategies for enhancing positive backward transfer.

Natural extensions include applying A-GEM to unsupervised, reinforcement, or streaming non-i.i.d. learning settings. The open-source codebase is provided at https://github.com/facebookresearch/agem (Chaudhry et al., 2018).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Averaged GEM (A-GEM).