Parameter-Efficient Continual Learning
- Parameter-efficient continual learning methods update only a small subset of parameters to acquire new tasks while preserving previously learned knowledge.
- Techniques like LoRA, CLoRA, and PIECE employ adaptive rank selection and selective updating to minimize computational and memory overhead in resource-constrained environments.
- Modular architectures in PECL, including shared and hierarchical PET modules, facilitate scalable deployment in on-device, multimodal, and federated learning applications.
Parameter-efficient continual learning (PECL) is a class of methods designed to support the sequential acquisition of new tasks in neural networks while mitigating catastrophic forgetting, focusing on minimizing the number of trainable parameters and reducing computational, memory, or energy overhead. PECL is motivated by both the limitations of full-model retraining—especially in deployed and resource-constrained environments—and the need to preserve knowledge from earlier tasks without expanding model capacity unboundedly. This paradigm plays an instrumental role in real-world applications involving large foundation models, scalable multimodal systems, on-device learning, federated scenarios, and specialized domains such as few-shot and language adaptation.
1. Problem Formulation and Motivating Principles
Continual learning (CL) requires training a model on a sequence of tasks , where each introduces a disjoint set of labels or a novel domain, and the system must maintain recognition or reasoning for union without revisiting prior data. Catastrophic forgetting—loss of performance on old tasks due to overwriting model weights—is the principal challenge. In many practical settings (edge/robotics, foundation models), compute, memory, and energy resources are constrained, making full fine-tuning impractical. PECL approaches address these constraints by restricting updates to a small, efficiently-organized parameter subset per task (Muralidhara et al., 26 Jul 2025, Wang et al., 19 Nov 2025).
Parameter-efficient continual learning typically targets:
- Reducing trainable parameters per task (often 0.1–5% of total parameters).
- Avoiding duplication or linear growth of parameters with the number of tasks.
- Preserving consolidated (pre-trained) knowledge in the frozen backbone.
- Supporting plug-and-play reuse of pre-trained features for downstream adaptation in vision, language, and multimodal tasks.
2. Low-Rank Adaptation and LoRA-based PECL
Low-Rank Adaptation (LoRA) is foundational in many recent PECL methods (Muralidhara et al., 26 Jul 2025, Bhat et al., 17 May 2025). LoRA reparameterizes a frozen weight with a learnable low-rank update , , , . Inference and training use . LoRA modules are typically inserted into attention or projection layers of transformers and can be merged with for inference.
CLoRA (Muralidhara et al., 26 Jul 2025) extends LoRA by sharing a single pair across all tasks, updating only these and the lightweight decoder/classifier while keeping frozen. This achieves extreme parameter efficiency (e.g., tuning only 1% of ViT encoder parameters for class-incremental semantic segmentation), and avoids the merging conflicts or task-ID inference overhead present in task-specific adapters.
PEARL (Bhat et al., 17 May 2025) introduces dynamic rank allocation for the LoRA adapters: for each task and layer, the rank is chosen adaptively according to the proximity between task weights and reference weights. Singular value decomposition (SVD) on the parameter difference guides rank selection to guarantee sufficient plasticity for new tasks while minimizing parameter cost. Empirically, PEARL demonstrates significant gains, outperforming rehearsal-free and exemplar-based baselines on ImageNet-R, CIFAR-100, and TinyImageNet with only a few million parameters added per task.
3. Parameter Importance and Selective Updating
Selective updating based on parameter importance is central in scalable foundation-model PECL methods. PIECE (Wang et al., 19 Nov 2025) adopts two principle estimators:
- PIECE-F: Uses Fisher Information to score sensitivity of each parameter, updating only the top most relevant parameters (e.g., $0.1$% in large models).
- PIECE-S: Employs a second-order score combining gradient and curvature, enabling normalized importance-driven selection.
For each new task, PIECE masks and tunes only the highest-importance parameters, freezing the remainder. This approach minimizes destructive interference with general capabilities of foundation models, and helps retain prior knowledge such as code generation or cross-modal reasoning. Empirical evaluations on Gemma2-2B, Llama3-8B, Qwen3-14B, Qwen3-VL-4B, and LLaVA-1.5-7B establish state-of-the-art performance and significant reductions in forgetting without data replay or architecture growth (Wang et al., 19 Nov 2025).
4. Shared and Hierarchical Modular Architectures
Modern PECL frameworks articulate modular architectures for supporting scalability and knowledge transfer. Key design themes:
- Shared Module Approaches: CLoRA uses a single LoRA block for all tasks; SAPT (Zhao et al., 2024) applies a shared attentive selection and learning mechanism where task-specific PET blocks are blended during training and routed via attention at inference, elegantly balancing catastrophic forgetting with knowledge reuse.
- Hierarchical Decomposition: HiDe-PET (Wang et al., 2024) formalizes the CL objective decomposed into within-task prediction (WTP), task-identity inference (TII), and task-adaptive prediction (TAP). HiDe-PET combines task-shared PET modules (adapted slowly/sparingly) with task-specific PETs, optimizing for each component. This dual strategy, especially with LoRA/adapters instead of prompts, achieves robust performance on fine-grained and distribution-shifted streams.
- General Frameworks: LAE (Gao et al., 2023) provides a unified framework for adapters, LoRA, and prefix-tuning, aligning adaptation speeds, accumulating knowledge via momentum, and ensembling online/offline PETs at inference. The approach is compatible with multiple PET methods and achieves strong results in split CIFAR-100 and ImageNet-R settings.
5. Regularization, Orthogonalization, and Gradient Control
PECL systems often integrate advanced regularization and gradient projection to balance plasticity and stability. Notable advances:
- NTK-CL (Liu et al., 2024) analyzes PEFT-CL via neural tangent kernel theory, demonstrating that generalization gaps (forgetting) are controlled by training sample size, task-feature orthogonality, and explicit regularization. The framework fuses prompt and LoRA-like adapters with cross-feature augmentation, orthogonality penalties, and adaptive knowledge retention, achieving superior continual benchmarks and providing theoretical guidelines for triple feature augmentation and SVD-based orthogonality enforcement.
- Gradient-projection frameworks (Qiao et al., 2024) formalize anti-forgetting conditions for prompt, prefix, adapter, and LoRA modules via orthogonal projection of the parameter gradients onto the null space of prior task features. This approach achieves zero forgetting for previously encountered inputs, supporting all PET paradigms with small overhead and unified anti-forgetting guarantees.
- SAFE (Zhao et al., 2024) separately trains "slow" PET modules in the first session to inherit generic PTM knowledge using transfer loss, and "fast" PET modules in incremental sessions for adaptable learning. A cross-classification loss with feature alignment and entropy-based aggregation at test time achieves state-of-the-art task-incremental results on seven benchmarks.
6. Multimodal, Federated, and Sparse Extensions
PECL methods are increasingly extended to specialized settings:
- Multimodal CL: CPE-CLIP (D'Alessandro et al., 2023) modifies CLIP with lightweight multimodal prompts in both text and vision branches, plus cross-modal projections and session-aware gradient scaling, setting new few-shot class-incremental baselines with under 0.3% overhead.
- Federated CL: pMAE (He et al., 2024) applies masked autoencoder prompts for federated continual learning. Clients locally train reconstructive and discriminative prompts; server reconstructs prior-task images and globally fine-tunes prompts/classifiers, optimizing only 0.3% of parameters and substantially outperforming prompt-only baselines.
- Sparse CL: EsaCL (Ren et al., 2024) enables efficient continual learning for sparse models via sharpness-driven directional pruning and intelligent data selection, one-shot pruning per task, and minimizes retraining while keeping the model size small for each CL phase.
7. Merging, Selection, and Model Composition
Recent research investigates compositional properties and scaling methods for PECL modules:
- Closed-form Merging: LoRM (Salami et al., 2024) develops alternating closed-form updates for LoRA factors in federated CL. It solves for a unique solution for merged factors by alternating optimization, ensuring output alignment across clients and tasks. LoRM achieves state-of-the-art accuracy on federated splits of CIFAR-100, ImageNet-R, and EuroSAT, and extends to other low-rank adaptation types.
- Selector and Mixture-of-Experts: Dynamic ConPET (2309.14763) maintains a set of PET modules per task and dynamically selects the optimal module by a lightweight selector during inference. The approach scales to arbitrarily large task streams without exceeding (PET-size) training or inference complexity.
- Parameter Mining and Freezing: Efficient parameter mining (Menezes et al., 2024) leverages layer-level importance statistics (mean, variance, entropy, median) to freeze high-utility network layers post-task, preserving features associated with old tasks and reducing catastrophic forgetting without extra parameter overhead.
8. Evaluation Metrics, Empirical Results, and Limitations
PECL methods are evaluated using metrics such as final average accuracy, forgetting score, backward transfer (BWT), parameter count, and resource efficiency metrics (e.g., NetScore combining accuracy, model size, and compute cost (Muralidhara et al., 26 Jul 2025)). Key findings:
- CLoRA achieves 1% parameter updates per task with mIoU comparable to state-of-the-art baselines and NetScore improvements of over naive fine-tuning (Muralidhara et al., 26 Jul 2025).
- PIECE obtains the highest overall performance (OP) and near-zero BWT across language and multimodal foundation models, outperforming replay and regularization methods (Wang et al., 19 Nov 2025).
- SAPT, HiDe-PET, LAE, and PEARL yield consistent accuracy gains over prompt-pool and adapter-pool CL methods, especially as the number of tasks scales (Zhao et al., 2024, Wang et al., 2024, Gao et al., 2023, Bhat et al., 17 May 2025).
- Dynamic rank adaptation and orthogonalized update directions are shown to preserve stability with minimal incremental parameter cost.
Limitations include static rank choices (CLoRA), per-task parameter growth (PEARL, ConPET), and overhead from importance computation (PIECE). Future directions include adaptive rank scheduling, quantization-aware modules (QLoRA), hybrid structured/unstructured selection, and online streaming variants for embedded on-device learning (Muralidhara et al., 26 Jul 2025, Wang et al., 19 Nov 2025).
References:
- (Muralidhara et al., 26 Jul 2025, Bhat et al., 17 May 2025, Wang et al., 19 Nov 2025, Zhao et al., 2024, D'Alessandro et al., 2023, He et al., 2024, Gao et al., 2023, Salami et al., 2024, Ren et al., 2024, Selvaraj et al., 2023, Menezes et al., 2024, Liu et al., 2024, Qiao et al., 2024, Zhao et al., 2024, Wang et al., 2024, 2309.14763, Palit et al., 2024, Coleman et al., 18 Apr 2025).