K&G RFT: Retain & Adapt in Fine-Tuning

Updated 14 September 2025

K&G RFT is a fine-tuning strategy that balances task-specific adaptation with retention of pre-trained knowledge, mitigating catastrophic forgetting.
It employs techniques such as Knowledge Distillation and LoRA modules to interleave task adaptation losses with regularization for robust performance.
Empirical results show enhanced in-domain and out-of-domain metrics, with smoother loss landscapes and improved transferability across tasks.

Knowledge and Generalization Retention Fine-Tuning (K&G RFT) refers to a family of fine-tuning strategies designed to allow foundation models to acquire new, task-specific capabilities while simultaneously preserving—or even enhancing—the broad, general-purpose skills and knowledge acquired during pre-training. This paradigm is applicable to LLMs, vision-LLMs (VLMs), reinforcement learning (RL) agents, and medical imaging backbones, and addresses the central challenge of catastrophic forgetting: the tendency for fine-tuned models to overwrite or lose previously acquired abilities.

1. Core Principles and Mechanisms of K&G RFT

K&G RFT aims to balance two competing objectives: (i) adapting a model to new domains or tasks via exposure to limited, possibly domain-specific data, and (ii) retaining the valuable generalization capacity built during pre-training. The canonical K&G RFT protocols achieve this balance by interleaving task adaptation losses with regularization or knowledge preservation terms, and/or by using architectural modifications that constrain parameter drift.

Key mechanisms include:

Knowledge Distillation (KD): A "student" model is penalized for deviating, on selected data, from the outputs of a "teacher" model capturing the pre-trained (or previously fine-tuned) state. In sequential fine-tuning contexts, this is often operationalized via a loss such as $L_{kd} = \mathrm{MSE}(E_{t}(x), E_{t-1}(x))$ (where $E_{t}$ and $E_{t-1}$ are current and previous encoders, respectively).
Low-Rank Adaptation (LoRA) with Selective Activation: Instead of fine-tuning all or even a fixed subset of model weights, lightweight LoRA modules are appended and activated only in a learnable and sparsity-controlled manner. Post-adaptation, these low-rank residuals can be merged back into the base model as $W' = W + BA$ , preserving the bulk of pre-trained weights.
Buffered/Representative Data Selection: Algorithms such as Maximum Data Similarity (MDS) select downstream data samples maximally similar to the pre-training distribution as KD anchors, focusing knowledge retention where it is most likely to generalize (Ye et al., 7 Sep 2025).
Loss Interpolations and Weighted Objectives: Practical instantiations often employ an aggregate objective $L_\text{total} = L_\text{task} + \lambda_\text{kd} L_\text{kd} + \lambda_\text{lora} L_\text{lora}$ , with careful tuning of $\lambda$ coefficients to navigate the tradeoff between adaptation and overfitting/generalization loss.

These approaches are often realized in a multi-stage or sequential protocol where adaptation to a new task is interleaved with distillation or constrained residual tuning.

2. Sequential and LoRA-based Knowledge Distillation in MedSeqFT

MedSeqFT is a representative sequential fine-tuning framework specifically designed for 3D medical image segmentation (Ye et al., 7 Sep 2025). Its K&G RFT implementation introduces a two-phase protocol:

Phase 1: Standard fine-tuning on a new segmentation task (with Dice and cross-entropy losses), but accompanied by a KD loss ensuring the output of the new encoder on buffered, representative data is aligned (in the MSE sense) to outputs of the previous encoder.
Phase 2: The base encoder is frozen, and LoRA modules are inserted in every linear layer. Only the LoRA parameters are optimized using another MSE-based refinement loss. Post-optimization, LoRA modules are merged back into the main backbone, ensuring only a low-rank residual is added per layer with minimal impact on the high-dimensional pre-trained weights.

The process can be formulated as: $W' = W + \Delta W = W + BA,$ with $A \in \mathbb{R}^{r \times k}$ and $B \in \mathbb{R}^{d \times r}$ (low rank $r \ll \min(d, k)$ ). This ensures the parameter update is "soft," affecting only the subspace necessary for the new task while minimizing loss of general representations.

The KD loss is applied on a buffer of samples selected to maximize similarity to the pre-training distribution (via MDS), which directly targets generalization preservation.

3. Empirical Performance, Transferability, and Robustness

K&G RFT, as instantiated in MedSeqFT and related protocols, achieves consistently higher in-domain and out-of-domain performance when compared with both:

Full Fine-Tuning (FFT): Where all weights are updated and catastrophic forgetting is frequent,
Parameter-Efficient Alternatives (e.g., standard LoRA/adapters): Where only a small parameter subset is tuned on task data but with lower task performance.

For instance, MedSeqFT achieved an average Dice score improvement of 3.0% and a reduction in 95th percentile Hausdorff Distance by 10mm over FFT in a 5-task CT segmentation benchmark (Ye et al., 7 Sep 2025). Qualitative improvements include clearer tumor delineation and improved alignment with expert-annotated ground truth.

Moreover, models refined via K&G RFT protocols exhibit superior transferability to unseen tasks. For example, after sequential adaptation to 10 different 3D segmentation tasks, evaluation on COVID-19-20 lung and KiTS kidney segmentation benchmarks showed higher Dice scores and more robust generalization versus direct fine-tuning from the original pre-trained model.

Loss landscape and parameter variation analyses further reveal that K&G RFT-trained models converge to smoother minima and require only minimal modifications in deeper network layers. This constrains excessive parameter drift—a hallmark cause of forgetting.

4. Comparison with Parallel, Multi-task, and Full-Tuning Approaches

Traditional parallel fine-tuning, where a separate "branch" is maintained per task, fails to capitalize on shared knowledge and scales poorly with the number of tasks. Joint multi-task fine-tuning leverages shared information but requires simultaneous access to all datasets, which is infeasible in continually evolving application settings and can cause optimization conflicts.

Full fine-tuning approaches, although straightforward, overwrite earlier model knowledge with features tailored to the current dataset, resulting in severe generalization loss, especially under distributional or domain shifts.

In contrast, K&G RFT mechanisms—especially when used with sequential pipelines—preserve both the flexibility to incorporate new tasks incrementally and the ability to retain and reuse general, reusable representations. These benefits are empirically substantiated by improved Dice/HD95 metrics and superior robustness across both familiar and unseen domains (Ye et al., 7 Sep 2025).

5. Architectural and Algorithmic Features Enabling Knowledge Retention

Several architectural and algorithmic features are critical to the success of K&G RFT:

LoRA Module Freezing and Residualization: Limiting adaptation to low-rank subspaces prevents widespread modification of core features.
Two-Stage KD Loss: Initial "soft" alignment via MSE encourages gentle adaptation, while "hard" LoRA refinement locks the extent of permissible weight change.
Buffered Knowledge Distillation: Only a subset of data, maximally similar to pre-training samples, is used for KD, ensuring stability and robustness.
Layer-Wise Analysis: Only a small subset of network parameters, typically in deeper layers, undergo significant adaptation, enabling both task specialization and generalized knowledge retention with minimal interference.

The effectiveness of these features is reflected in observed performance improvements, both in the metrics cited above and in qualitative robustness to loss landscape perturbations.

6. Practical Considerations and Limitations

K&G RFT protocols, while empirically effective, entail modest increases in training time due to the knowledge distillation and LoRA tuning phases (e.g., an additional ∼5.9 hours over FFT in MedSeqFT (Ye et al., 7 Sep 2025)). The computational and hyperparameter tuning overheads (including regularization coefficient selection and buffer design) are offset by gains in both segmentation performance and generalization, as well as the capacity for continual, incremental domain adaptation.

The approach is generally model-agnostic and compatible with other parameter-efficient fine-tuning methods but may require further optimization for large-scale multi-modal or real-time applications.

7. Future Directions and Broader Implications

The demonstrated success of K&G RFT indicates substantial promise for its use in domains where continual adaptation to new data or tasks is required. For clinical medical image segmentation, incremental domain updates are a routine necessity, and K&G RFT offers an effective strategy to maintain up-to-date models without full retraining. Future research may:

Develop more efficient selective refinement techniques to identify precisely which network layers and parameters require adaptation;
Extend the paradigm to multi-modal settings or broader classes of tasks beyond segmentation, including classification, detection, and retrieval;
Optimize selection of buffered data for knowledge distillation, potentially leveraging automated data similarity or representational drift monitoring.

K&G RFT represents an advance in fine-tuning methodology, achieving an effective tradeoff between adaptation and generalization, and enabling practical, robust deployment of continually evolving foundation models in complex application domains (Ye et al., 7 Sep 2025).

PDF Markdown Chat (Pro)

References (1)

MedSeqFT: Sequential Fine-tuning Foundation Models for 3D Medical Image Segmentation (2025)

Follow Topic

Get notified by email when new papers are published related to Knowledge and Generalization Retention Fine-Tuning (K&G RFT).