Lifelong Imitation Learning

Updated 5 January 2026

Lifelong imitation learning is defined as the continuous adaptation of agents that acquire, retain, and expand skills from demonstrations without catastrophic forgetting.
State-of-the-art methods combine behavior cloning, tokenized skill scaling, and generative replay to facilitate efficient forward and backward knowledge transfer.
Practical applications span robotic manipulation, autonomous driving, and optimization, demonstrating the approach's scalability and safety in dynamic environments.

Lifelong imitation learning is the study and engineering of agents that acquire, retain, and continually expand diverse skills from demonstrations, enabling persistent adaptation across evolving tasks and domains without catastrophic forgetting. Unlike conventional imitation learning—which assumes static, batch training over all skills—lifelong frameworks integrate continual learning protocols with imitation to preserve prior knowledge and efficiently scale to novel skill acquisition, typically under strict data and compute constraints. Formulations span behavior cloning, inverse reinforcement learning, distillation, generative replay, and attention-based architectures, with applications in robotic manipulation, autonomous control, combinatorial optimization, and beyond.

1. Formal Problem Statement and Objectives

Lifelong imitation learning (LIL) generalizes classic imitation learning to a sequence of tasks $\{T^1, T^2, \dots, T^K\}$ , each defined by states $S^k$ , actions $A^k$ , and expert demonstration sets $D^k = \{\tau_i^k\}$ . The primary objective is to learn a single policy $\pi$ with persistent competence on all previously observed tasks—updating incrementally with new data $D^k$ —while incurring minimal degradation on earlier skills (“catastrophic forgetting”). Formally, at step $k$ , the policy $\pi^k$ must satisfy:

$\forall j < k,\quad \mathbb{E}_{(s, a) \sim D^j} \left[ \mathcal{L}(\pi^k(a | s)) \right] \approx \mathbb{E}_{(s, a) \sim D^j} \left[ \mathcal{L}(\pi^j(a | s)) \right]$

while assimilating new skills efficiently. Typical loss functions mix behavioral cloning:

$\mathcal{L}_{BC} = \frac{1}{K} \sum_{k=1}^{K} \mathbb{E}_{(o_t, a_t) \sim D_k} \left\| a_t - \hat{a}_t(o_{\leq t}; \pi) \right\|^2$

with regularization, knowledge distillation, or task-specific adaptations. Inverse reinforcement learning alternatives focus on obtaining reward functions through demonstration and enabling forward & reverse transfer via shared latent bases (Mendez et al., 2022).

2. Architectures and Lifelong Mechanisms

Modern LIL systems deploy structured neural architectures augmented for continual adaptation, including:

Tokenized Skill Scaling (T2S): Model parameters are decomposed into key- and value-token banks (KP, VP). Input observations query tokens via cross-attention, enabling scalable skill expansion: new skill acquisition requires appending new tokens, while shared tokens enable knowledge transfer. A language-guided token selection module connects natural-language instructions to relevant tokens, promoting parameter sharing and detachment for efficient training (Zhang et al., 2 Aug 2025).
Hierarchical Skill Codebooks (SPECI): Skills are stored as parameterized vectors in a dynamic codebook, accessed via attention-driven mechanisms that mix shared and task-specific codes. Mode approximation via tensor decompositions isolates task-level parameters while leveraging common factors, promoting scalable composition and bidirectional transfer (Xu et al., 22 Apr 2025).
Multi-Modal Distillation (M2Distill): Latent representations across modalities (vision, language, proprioception) are regularized for consistency between consecutive policy versions. Cross-modal losses (vision, language, joint/gripper, actions) inhibit latent drift, and GMM-based policy heads align action distributions. Small replay buffers anchor past skills (Roy et al., 2024).
Generative Replay (CRIL): Past task knowledge is replayed via synthetic trajectories generated by a GAN (first-frame) and a learned dynamics predictor, ensuring the policy never requires real data storage. This approach produces privacy-safe replay and avoids buffer explosion (Gao et al., 2021).
Elastic Weight Consolidation/Distillation (LiMIP): Critical parameters are regularized via Fisher diagonals, and knowledge distillation enforces stability on archived states. Bipartite graph attention networks embed combinatorial domains for transfer and catastrophic forgetting mitigation (Manchanda et al., 2022).
Online Bayesian Imitation: Conservative Bayesian updating with a diminishing query rate to the demonstrator provides safe learning in non-resetting, non-episodic environments, with finite-time bounds on distributional divergence from the demonstrator (Cohen et al., 2021).

3. Knowledge Transfer, Skill Expansion, and Catastrophic Forgetting Mitigation

Efficient LIL mandates mechanisms for both forward transfer (accelerated learning of new tasks) and backward transfer (retention or improvement of prior knowledge):

Token Banks and Masking: In T2S, shared tokens, selected by cosine similarity to task-specific embeddings, are masked and included in the forward pass with gradient detachment. This prevents overwriting and achieves near-zero negative backward transfer (NBT ≈ 1.0% over 30 tasks) (Zhang et al., 2 Aug 2025).
Attention-driven Skill Reuse: SPECI's codebook and transformer mechanisms allow relevant skills to be activated across tasks, supporting bidirectional transfer (negative NBT observed) and approaching multitask oracle performance (Xu et al., 22 Apr 2025).
Latent Regularization: M2Distill leverages multi-modal latent penalties, with ablations revealing vision distillation as critical (AUC drops ≥20pts if removed), followed by policy and language regularizers (Roy et al., 2024).
Generative Replay: CRIL enables skill retention by generating pseudo-trajectories for all past tasks at each step. Empirically, this approach closely matches rehearsal upper bounds (Ω_all ≈ 0.85, success ≈ 0.86) without raw data storage (Gao et al., 2021).
Distillation and EWC: For domains like MIP branching, combined knowledge distillation and EWC reliably preserve past performance, limiting forgetting to <5% and yielding up to 50% improvement over naive fine-tuning (Manchanda et al., 2022).
Gradient Projection: In LLPL, projection-based updates guarantee the new gradient does not harm old episodic memories, producing monotonic improvement in control policies for autonomous driving (Gong et al., 2024).

4. Training Algorithms and Evaluation Protocols

Training algorithms in LIL frameworks synchronize skill growth with stability:

Initialize token-pool KP, VP; global mask M_G = 0
for k = 1 ... K tasks:
    Encode language l^k -> embedding e^k
    Compute cosine scores s_i = cos(e^k, KP_i)
    Select top-j tokens: M_P^k = Top-j(s, j)
    Split into shared (detached) and new (trainable) tokens
    Update mask M_G
    for each training epoch:
        Forward pass: Cross-attention using shared + new tokens
        Compute BC loss; backprop only on new tokens
    Evaluate success rates using corresponding masks

(Zhang et al., 2 Aug 2025)

Standard metrics for lifelong evaluation include:

Forward Transfer (FWT): $\frac{1}{M} \sum_{m=1}^M r_{m,m}$
Negative Backward Transfer (NBT): measures skill forgetting across sequences
Area Under Curve (AUC): summary of success rates over all tasks

Empirical benchmarks utilize manipulation suites (LIBERO-OBJECT, LIBERO-GOAL, LIBERO-SPATIAL), simulation (MetaWorld), and real robot deployments. For instance, T2S attains FWT = 77.7%, NBT = 1.0%, and performs comparably or superior to prior SOTA (M2Distill) (Zhang et al., 2 Aug 2025, Roy et al., 2024). CRIL approaches the rehearsal method in both simulated and physical manipulation domains (Gao et al., 2021), and SPECI demonstrates negative NBT—i.e., improvement on old tasks during new-skill acquisition (Xu et al., 22 Apr 2025).

5. Applications, Benchmarks, and Domain Extensions

LIL has been validated across diverse domains:

Robot Manipulation: Hierarchical methods (SPECI, T2S, M2Distill) on LIBERO suites show consistent improvements in bidirectional transfer, skill retention, and overall performance, with architectures supporting multimodal perception and hierarchical execution (Xu et al., 22 Apr 2025, Zhang et al., 2 Aug 2025, Roy et al., 2024).
Autonomous Driving: LLPL combines inverse-dynamics IL pretraining with episodic gradient projection, achieving >40% RMSE reduction compared to single-task IL, and provably safe continuous improvement (Gong et al., 2024).
Mixed Integer Programming: LiMIP’s GAT encoder and distillation/EWC approach yield robust lifelong branching heuristics, outperforming fine-tuning and buffer-replay (Manchanda et al., 2022).
Online Adaptive Agents: Conservative Bayesian online imitation achieves finite-time safety and diminishing demonstrator queries across non-resetting environments (Cohen et al., 2021).
Generative Replay for Skill Retention: CRIL’s GAN-based replay mechanism attains near-rehearsal continual learning scores, supporting privacy-safe deployment (Gao et al., 2021).

6. Limitations, Open Questions, and Future Directions

Despite significant advances, open challenges persist:

Parameter Efficiency and Scaling: While tokenized architectures (T2S/SPECI) and low-rank mode approximation allow efficient scaling, extreme task volumes (>1000) and multi-modal expansion may require further sparsification and adaptive resource allocation.
Replay-free Continual Learning: Memory-free distillation (e.g., vision/action regularizers) remains under-explored for robustness against noisy or adversarial incremental data (Roy et al., 2024).
Active Learning and Demonstration Selection: Lifelong IRL and meta-imitation frameworks could benefit from active demonstration queries to maximally inform the latent skill basis (Mendez et al., 2022, Singh et al., 2020).
Guarantees in Function Approximation: Extending theoretical bounds from countable hypotheses to deep neural function spaces for safety and transferability remains unresolved (Cohen et al., 2021).
Generalization and Out-of-the-Box Adaptation: Attention-equipped imitator networks (e.g., DAAC) offer promising “one-demo” adaptation (Chen et al., 2023), but inference costs and unseen state behaviors pose scalability and reliability obstacles.

A plausible implication is that further hybridization of distillation, attention-based modularity, efficient replay, and dynamic task decomposition will increasingly enable LIL agents to approach lifelong open-world skill mastery under strict operational and resource constraints.