Papers
Topics
Authors
Recent
Search
2000 character limit reached

Continual Fine-Tuning Overview

Updated 4 July 2026
  • Continual fine-tuning is the sequential adaptation of pre-trained models to new tasks while preserving previously acquired capabilities despite the risks of catastrophic forgetting.
  • It leverages methods like parameter-efficient techniques (LoRA, adapters, prompts) and replay-based strategies to balance model plasticity and knowledge retention.
  • Detailed analyses show that mitigating gradient interference and representational drift—through strategies like freezing key layers—significantly reduces forgetting across diverse domains.

Continual fine-tuning is the sequential adaptation of a pre-trained model to a stream of tasks, domains, or data distributions while attempting to preserve performance on earlier capabilities that can no longer be fully revisited. In the standard formulation, a model is updated on T1,T2,T_1, T_2, \dots, or more generally D1,D2,,DTD_1, D_2, \dots, D_T, and is re-evaluated after each stage on prior tasks; the central difficulty is catastrophic forgetting, namely the degradation of earlier behavior under later updates. The term is used across full-parameter and parameter-efficient adaptation, including LoRA, adapters, prompts, and replay-based or rehearsal-free procedures, and spans task-incremental, class-incremental, domain-incremental, and online continual settings (Imanov, 26 Jan 2026, Coleman et al., 18 Apr 2025).

1. Problem formulation and operational regimes

A common formalization writes sequential adaptation as Mi=Update(Mi1,D(i))M_i = \text{Update}(M_{i-1}, D^{(i)}), where stage ii fine-tunes the current model on a newly arrived dataset. In the strict rehearsal-free regime emphasized in mechanistic LLM work, the model is fine-tuned on task T1T_1, then T2T_2, and so on, without revisiting earlier task data; performance is re-evaluated on all earlier tasks after each step (Imanov, 26 Jan 2026). More generally, continual learning taxonomies distinguish task-incremental learning, class-incremental learning, domain-incremental learning, and online continual learning, all of which can be instantiated as continual fine-tuning when adaptation begins from a pre-trained backbone rather than from random initialization (Coleman et al., 18 Apr 2025).

This setting is typically positioned between two extremes. In speech self-supervised learning, for example, the contrast is explicit: one may freeze the encoder and train only a task head, or fully fine-tune the whole encoder on the downstream loss. Continual fine-tuning is introduced there as a middle ground, with the “previous task” interpreted as the self-supervised pretraining objective whose structure should be retained during downstream adaptation (Zaiem et al., 2024). In multilingual LLM adaptation, a closely related two-phase formulation distinguishes an English-only fine-tuning phase that builds predominantly task ability from a multilingual phase that builds predominantly language ability; the second phase is then analyzed as continual fine-tuning of the first (Aggarwal et al., 2024).

The same conceptual move appears in vision. Deep Linear Continual Fine-Tuning begins from a strong pre-trained representation and treats incremental learning as continued adaptation of that representation, rather than sequential training from scratch (Shon et al., 2022). ConFiT likewise studies a practical regime in which a pre-trained model is fine-tuned sequentially on a task stream, with a shared feature extractor and task-specific heads in the multi-head setting (Jie et al., 2022). Across modalities, continual fine-tuning therefore denotes not a single algorithm but a deployment regime: a pre-trained model remains live, receives new data over time, and must balance plasticity against retention.

2. Forgetting mechanisms and internal dynamics

The most detailed mechanistic account currently available for transformer LLMs identifies three primary drivers of forgetting during sequential fine-tuning: gradient interference in attention weights, representational drift in intermediate layers, and loss landscape flattening around prior-task minima. In a study spanning six LLMs, twelve task sequences, and 24 NLP tasks across seven domains, forgetting severity correlated strongly with task similarity at Pearson r=0.87r = 0.87, p<0.001p < 0.001, while first-epoch gradient alignment correlated with final forgetting at r=0.79r = -0.79. Freezing attention layers reduced forgetting by 64%64\%, compared with D1,D2,,DTD_1, D_2, \dots, D_T0 for feedforward layers and D1,D2,,DTD_1, D_2, \dots, D_T1 for output layers; roughly D1,D2,,DTD_1, D_2, \dots, D_T2 to D1,D2,,DTD_1, D_2, \dots, D_T3 percent of attention heads underwent severe disruption, concentrated in lower layers. Intermediate-layer CKA dropped by D1,D2,,DTD_1, D_2, \dots, D_T4–D1,D2,,DTD_1, D_2, \dots, D_T5 in the most affected layers, dominant principal components rotated by D1,D2,,DTD_1, D_2, \dots, D_T6–D1,D2,,DTD_1, D_2, \dots, D_T7 degrees, the maximum Hessian eigenvalue for the first task fell from D1,D2,,DTD_1, D_2, \dots, D_T8 to D1,D2,,DTD_1, D_2, \dots, D_T9, and the loss landscape linearity index rose from Mi=Update(Mi1,D(i))M_i = \text{Update}(M_{i-1}, D^{(i)})0 to Mi=Update(Mi1,D(i))M_i = \text{Update}(M_{i-1}, D^{(i)})1. The same work further reports that when gradient cosine similarity falls below Mi=Update(Mi1,D(i))M_i = \text{Update}(M_{i-1}, D^{(i)})2, forgetting rates become Mi=Update(Mi1,D(i))M_i = \text{Update}(M_{i-1}, D^{(i)})3 higher than when similarity is above Mi=Update(Mi1,D(i))M_i = \text{Update}(M_{i-1}, D^{(i)})4 (Imanov, 26 Jan 2026).

Other modalities expose structurally similar but not identical mechanisms. ConFiT argues that forgetting in continual fine-tuning is not limited to penultimate-layer feature drift: intermediate representational shift (IRS) also matters because it disrupts Batch Normalization. In that account, even task-specific stored BN moments become stale when the intermediate representation itself has shifted, so BN normalizes with inaccurate moments and amplifies forgetting (Jie et al., 2022). DOC, in turn, argues that regularization-based methods fail in long-term LLM continual learning because the functional directions used to protect old tasks are not stable: as the model moves through parameter space, the local geometry changes and historical directions drift, making constraints based on stale directions increasingly ineffective (Zhang et al., 28 Sep 2025).

These analyses align with application-specific evidence that fine-tuning can erase the structure induced by pretraining. In speech SSL, the strongest ASR generalization methods—especially LoRA, EWC, and LS-replay—are also the methods with the lowest self-supervised loss after downstream fine-tuning, which the paper interprets as direct evidence that better downstream generalization is linked to less forgetting of the SSL objective (Zaiem et al., 2024). In cross-lingual adaptation, naïve fine-tuning of mBERT on English tasks worsens MLM perplexity, reduces cross-lingual sentence retrieval, and degrades zero-shot transfer, consistent with the view that source-language supervision can overwrite multilingual structure (Liu et al., 2020).

3. Algorithmic strategies for mitigating forgetting

The method space is usually organized into regularization-based, replay-based, architecture-based, and optimization-based continual-learning families, overlaid with parameter-efficient fine-tuning mechanisms such as adapters, prompts, LoRA, masking, and bias-only tuning. In survey form, this intersection is termed Parameter-Efficient Continual Fine-Tuning (PECFT): the goal is to adapt large pre-trained models with small task-specific modifications while preserving earlier capabilities (Coleman et al., 18 Apr 2025).

Replay remains the most direct mitigation mechanism. Continual-T0 trains instruction-tuned LLMs sequentially on eight new generation tasks while replaying only a small sampled subset of previous tasks; with Mi=Update(Mi1,D(i))M_i = \text{Update}(M_{i-1}, D^{(i)})5 rehearsal, T0 zero-shot performance is described as nearly stationary, and CT0pp reaches Mi=Update(Mi1,D(i))M_i = \text{Update}(M_{i-1}, D^{(i)})6 of the upper bound overall, with no task losing more than Mi=Update(Mi1,D(i))M_i = \text{Update}(M_{i-1}, D^{(i)})7 for the 11B model (Scialom et al., 2022). MSSR reframes replay as a scheduling problem: it maintains sample-level memory strength Mi=Update(Mi1,D(i))M_i = \text{Update}(M_{i-1}, D^{(i)})8, stability Mi=Update(Mi1,D(i))M_i = \text{Update}(M_{i-1}, D^{(i)})9, and adaptive replay probabilities ii0, while replay intervals expand over time and replay ratio decays. Across three backbone models and 11 sequential tasks, MSSRii1 consistently outperforms fixed, loss-based, and accuracy-based replay baselines, with only about ii2–ii3 wall-clock and ii4–ii5 peak-memory overhead relative to fixed replay (Lu et al., 10 Mar 2026). In multilingual fine-tuning, generative replay and English replay are used to make the second phase look more like the first, again exploiting replay as a task-alignment mechanism rather than only as memory restoration (Aggarwal et al., 2024).

Constraint-based methods target parameter drift more directly. GEM casts fine-tuning as a two-task continual-learning problem—preserve MLM, XSR, or both while learning a downstream task—and constrains updates so that loss on the prior capability does not increase. In zero-shot cross-lingual POS and NER, GEM w/ Both reaches average scores of ii6 and ii7, outperforming naïve fine-tuning and multitask fine-tuning baselines (Liu et al., 2020). DLCFT provides a stronger theoretical reinterpretation: after linearizing a pre-trained network around its initial parameters and replacing cross-entropy with MSE, the task objective becomes quadratic in parameters, and quadratic parameter regularization becomes the optimal continual-learning policy in that setting (Shon et al., 2022). ConFiT combines representational stabilization with architectural correction through cross-convolution batch normalization and hierarchical fine-tuning, explicitly aiming to reduce IRS and BN mismatch (Jie et al., 2022).

Parameter-efficient variants occupy a large fraction of current practice. In speech SSL fine-tuning, freezing-based and replay-based continual-learning methods generally outperform full fine-tuning; LoRA, EWC, and LS-replay are the strongest methods overall, whereas adapters underperform and sometimes fail badly (Zaiem et al., 2024). CURLoRA modifies LoRA by replacing the usual low-rank update with a CUR-style factorization in which only the small matrix ii8 is trained, ii9 is zero-initialized, and columns and rows are chosen with inverted probabilities. On Mistral 7B, standard LoRA suffers severe forgetting and a WikiText-2 perplexity explosion, while CURLoRA keeps WikiText-2 perplexity unchanged at T1T_10 throughout sequential fine-tuning (Fawi, 2024). DOC extends the orthogonality idea to continual LoRA by dynamically tracking historical function directions with online PCA and projecting new-task LoRA gradients orthogonally to those directions, improving average accuracy and reducing negative backward transfer relative to O-LoRA on LLaMA-7B, LLaMA-13B, and T5-Large (Zhang et al., 28 Sep 2025).

4. Modular, retrieval-based, and compositional paradigms

A distinct line of work attempts to avoid destructive overwriting by allocating or retrieving task-specific computation rather than repeatedly updating one shared parameter set. SEE exemplifies the expert-ensemble approach. It adds a new LoRA expert for each task, freezes old experts, and eliminates the need for a separate router by letting each expert output either T1T_11 or T1T_12. On SuperNI continual instruction tuning, SEE achieves near-zero or zero forgetting with BWT close to T1T_13; on ten tasks, SEE(10%) reaches AR T1T_14, beating MTL’s T1T_15, and routes T1T_16 of MMLU instances to the base model in SEE(10%) (Wang et al., 9 Apr 2025).

TRACE treats continual fine-tuning as task-specific parameter discovery via adaptation-aware probing. A short warm-start fine-tune—best at 1 epoch—exposes an adaptation trace, after which parameters are ranked by L2 change plus Fisher information or by cosine-similarity specificity. Only the top T1T_17 are updated for the active task. On LLaMA3-8B, TRACE-CS reaches an average of T1T_18, versus T1T_19 for the best baseline; on Qwen2.5-14B, TRACE-LF and TRACE-CS reach T2T_20 and T2T_21, respectively (Han et al., 29 May 2026).

ProCL builds memory directly inside LoRA. It partitions the adapter into program slots, routes inputs to slot compositions by attention during training, combines them with a frozen stable adapter T2T_22, and consolidates routed updates back into the persistent adapter via T2T_23. The routing machinery is discarded at inference, so the method incurs no additional inference cost. On a QA curriculum BoolQ T2T_24 SQuAD T2T_25 AdversarialQA, ProCL reaches an average of T2T_26 versus T2T_27 for DEAL, and forgetting on BoolQ falls as low as T2T_28 on Flan-T5-Base (Le et al., 13 May 2026).

Retrieval-centric approaches attempt to preserve adaptability without training a continually updated retriever. PROTEUS learns task-specific LoRA modules together with clustering-based representation signatures, modeled as multi-key Gaussian components T2T_29, and retrieves the task/component pair that minimizes the negative log-likelihood score. Its theoretical result bounds retrieval error by r=0.87r = 0.870, linking accurate task retrieval to cluster separation in representation space. Empirically it reaches final average accuracies of r=0.87r = 0.871 on CIFAR-100, r=0.87r = 0.872 on ImageNet-R, r=0.87r = 0.873 on ImageNet-A, r=0.87r = 0.874 on VTAB5T-Large, r=0.87r = 0.875 on VTAB5T-Small, and r=0.87r = 0.876 on VTAB-Sim50, with retrieval accuracies up to r=0.87r = 0.877 (Le et al., 28 Jan 2026).

A more algebraic alternative is Tangent Model Composition. TMC linearizes a model around a shared pre-trained point and treats each task-specific adaptation as a tangent vector r=0.87r = 0.878, so component models can be added, scaled, or subtracted exactly at inference time. This removes sequential bias because each task is learned independently from the same reference model, supports parallel and federated training, and remains replay-free. Across 13 continual-learning experiments on Caltech-256, MIT-67, and OxfordPets, TMC outperforms recently published continual fine-tuning methods almost uniformly, despite not using any replay buffer (Liu et al., 2023).

5. Evaluation protocols, diagnostics, and empirical breadth

Continual fine-tuning is evaluated with a wider metric set than ordinary fine-tuning. One mechanistic definition uses forgetting magnitude as the absolute decrease in performance from the immediate post-training value to the current value on a previous task (Imanov, 26 Jan 2026). Many LLM studies also report backward transfer, for example

r=0.87r = 0.879

where negative BWT indicates forgetting (Wang et al., 9 Apr 2025). DOC additionally reports

p<0.001p < 0.0010

and forward transfer, while vision continual fine-tuning often uses dataset-level quantities such as p<0.001p < 0.0011 and forgetting p<0.001p < 0.0012 (Zhang et al., 28 Sep 2025, Jie et al., 2022). In self-supervised continual fine-tuning, Kaizen distinguishes Final Accuracy (FA), Continual Accuracy (CA), Forgetting (F), and Forward Transfer (FT) to capture both end-state performance and behavior throughout the continual stream (Tang et al., 2023).

The empirical regimes are correspondingly diverse. Transformer LLM studies have used twelve task sequences from 24 NLP tasks across sentiment analysis, QA, summarization, translation, code generation, factual knowledge, and reasoning (Imanov, 26 Jan 2026). Continual instruction tuning is evaluated on 5-task and 10-task SuperNI sequences (Wang et al., 9 Apr 2025). Multilingual CFT measures task ability with IFEval, Alpaca Eval, MMLU, and HellaSwag, and language ability with MLQA, XQuAD, and XLSUM (Aggarwal et al., 2024). Speech work evaluates English and Danish ASR under low-resource and out-of-domain conditions, with held-out metrics including error rates and probes on the original SSL loss (Zaiem et al., 2024).

Vision and biosignal settings further broaden the scope. ConFiT studies ImageNet-pretrained ResNet18 on CIFAR100, CUB200, Caltech101, and Flowers102 (Jie et al., 2022). FeTT evaluates class-incremental continual fine-tuning on CIFAR100, CUB200, ImageNet-A, ImageNet-R, ObjectNet, and VTAB, emphasizing average and last accuracy under biased class marginals (Qiang et al., 2024). Kaizen benchmarks continual self-supervised fine-tuning on split CIFAR-100 and ImageNet-100 (Tang et al., 2023). In longitudinal EEG motor-imagery decoding, causal continual fine-tuning is assessed on a 61-subject dataset with 7–11 sessions per subject, using trial-wise accuracy and cosine distance between task vectors as a stability measure (Wimpff et al., 5 Feb 2025). The breadth of these settings suggests that continual fine-tuning is better understood as a cross-domain systems problem than as a benchmark-specific variant of fine-tuning.

6. Limits, misconceptions, security risks, and open directions

A recurring misconception is that high task similarity is automatically safe. Mechanistic LLM analysis shows the opposite can occur: higher similarity can sometimes lead to worse forgetting because apparently aligned tasks still create conflicts in particular parameter subsets, especially in attention query/key pathways (Imanov, 26 Jan 2026). The two-phase multilingual study reaches a related but narrower conclusion: when Phase 1 and Phase 2 datasets encode similar tasks, Phase 2 can improve language ability without hurting task ability, whereas dissimilar phase-wise datasets often cause strong degradation in English task ability (Aggarwal et al., 2024). Together, these results suggest that “similarity” must be resolved at the level of parameter compatibility rather than only semantic relatedness.

Another misconception is that PEFT alone solves forgetting. In multilingual CFT, LoRA with rank 64 does not eliminate forgetting, and on LLaMA-3-8B + Instruct p<0.001p < 0.0013 MultiAlpaca it performs poorly enough to show severe forgetting (Aggarwal et al., 2024). CURLoRA’s comparison with standard LoRA makes the same point more sharply: standard low-rank adaptation can still destroy base-model perplexity under sequential updates unless the update subspace is explicitly constrained (Fawi, 2024). Speech experiments likewise show that frozen features perform worst overall, full fine-tuning is often not the best, and continual-learning-based fine-tuning is only a strong practical compromise when methods are tuned carefully (Zaiem et al., 2024).

Scalability remains unsettled. SEE notes two explicit scaling concerns: the number of parameters grows as more tasks arrive, and rehearsal data also grows proportionally with task count (Wang et al., 9 Apr 2025). DOC observes that if there are hundreds of tasks, the principal-component pool used to track functional directions may grow and increase computation (Zhang et al., 28 Sep 2025). ProCL depends on routing specialization and can face slot-capacity bottlenecks if tasks are too similar or task count grows too large (Le et al., 13 May 2026). TMC assumes tasks remain local to a pre-trained embedding so that first-order linearization is meaningful (Liu et al., 2023). PROTEUS is most naturally suited to class-incremental or task-sequential settings where task boundaries are known, and does not claim to resolve truly task-free continual adaptation (Le et al., 28 Jan 2026).

Continual fine-tuning also introduces a security problem: anti-forgetting mechanisms can preserve malicious behavior. In a post-deployment threat model, P-Trojan optimizes trigger tokens so that poisoned and clean gradients align on token embeddings from the final transformer layer. On Qwen2.5 and LLaMA3 families, it achieves p<0.001p < 0.0014–p<0.001p < 0.0015 persistence through cleanup fine-tuning and cross-task fine-tuning while preserving clean-task accuracy; data replay and FREEZE, which are intended to preserve benign knowledge, also preserve the backdoor (Cui et al., 12 Dec 2025). This implies that retention mechanisms cannot be evaluated only for utility preservation.

Open directions therefore concern both capability and governance. The PECFT survey emphasizes model merging, multimodal continual adaptation, reasoning-heavy tasks beyond classification, and more realistic online or weakly supervised settings as central future research directions (Coleman et al., 18 Apr 2025). A plausible implication is that the field is moving from generic anti-forgetting heuristics toward mechanism-specific continual systems: systems that diagnose gradient conflict early, localize or route task-relevant computation, preserve curvature or representation geometry when necessary, and treat retrieval, security, and deployment efficiency as first-class components rather than afterthoughts.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (20)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Continual Fine-Tuning.