FL2T: Forget Less by Learning Together

Updated 12 January 2026

FL2T is a continual learning framework that enables order-agnostic, concept-incremental learning, mitigating catastrophic forgetting in diffusion models and federated environments.
It employs a set-invariant proxy module with transformer decoders to capture inter-concept interactions through prompt-conditioned attention and contrastive regularization.
FL2T’s dual regularization and dynamic parameter partitioning preserve prior knowledge efficiently, as shown by improved image and text alignment metrics on benchmarks.

Forget Less by Learning Together (FL2T) is a continual learning framework addressing catastrophic forgetting in models tasked with incrementally acquiring new knowledge without compromising previously learned skills. Originally proposed for custom diffusion models (CDMs), FL2T facilitates concurrent, order-agnostic concept learning by leveraging inter-concept interactions and dynamic parameter partitioning, thereby maintaining knowledge retention in both centralized and federated settings (Kaushik et al., 5 Jan 2026, Guo et al., 17 Mar 2025). The FL2T paradigm is supported by a set-invariant module that guides feature selection using learned proxies, and is broadly applicable across multimodal tasks, including instruction tuning in federated learning.

1. Order-Agnostic Concept-Incremental Learning

FL2T is motivated by the limitations of standard continual learning protocols, which typically require a fixed, sequential ordering of tasks. FL2T, by contrast, explicitly operates in an order-agnostic and concept-incremental regime: given $G$ customization tasks $T = \{T_1,\ldots,T_G\}$ , new concepts can be introduced in any permutation $\pi(T_1,\ldots,T_G)$ . For each task $g$ , a dataset $T_g = \{(x_g^k, p_g^k, y_g^k)\}$ is provided, where $x_g^k$ denotes sample images, $p_g^k$ are textual prompts, and $y_g^k$ are associated concept tokens. Three critical constraints are imposed:

Distinct concepts: $Y_g \cap (\bigcup_{i<g} Y_i) = \emptyset$ for all $g$ ; cross-task concept duplication is disallowed.
Order-agnosticity: Task order can be arbitrarily permuted during training.
No replay: Past task data are not stored or retrievable in future iterations.

This formulation is particularly suitable for CDMs and federated learning scenarios where the arrival order of new tasks/concepts is typically uncontrolled and privacy considerations preclude data sharing (Kaushik et al., 5 Jan 2026, Guo et al., 17 Mar 2025).

2. Set-Invariant Inter-Concept Interaction via Proxies

Central to FL2T is a permutation-invariant inter-concept learning module integrating transformer decoders with proxy embeddings. For each previously learned concept, a stable embedding $C_i$ is maintained. Proxies $P_i$ (initialized as $P_i \gets C_i$ ) capture contextualized relevance conditioned on the current task. The process involves:

Self-attention over proxies $\{P_i\}$ and cross-attention between proxies (query) and the target prompt's embedding $c_g$ (key/value) through stacked transformer decoder layers.
Refinement: The output, denoted $\{P_i'\}$ , encodes prompt-conditioned inter-concept dependencies.
Concept representation: Each refined proxy concatenated with its original concept embedding is passed through an MLP $f$ to yield $S_i = f([C_i; P_i'])$ for $i \neq g$ .
Contrastive regularization: To prevent over-coupling (“rank collapse”), a set-level contrastive loss is imposed:

$R_3 = \frac{1}{G} \sum_{i=1}^G -\log\frac{\exp(\text{sim}(S_i, S_i)/\tau)}{\sum_{j=1}^G \exp(\text{sim}(S_i, S_j)/\tau)}$

where $\text{sim}(\cdot, \cdot)$ is cosine similarity and $\tau$ is a temperature hyperparameter.

Proxies are used to learn prompt-dependent relevance weights $\lambda_i = C_g \cdot S_i^\top$ for all $i \neq g$ , determining which prior concepts most effectively guide knowledge transfer and retention for the current learning episode (Kaushik et al., 5 Jan 2026).

3. Regularized Knowledge Consolidation and Parameter Partitioning

FL2T generalizes concept consolidation by combining task-specific and task-shared regularization:

Task-Specific Knowledge (TSP): Orthogonality is enforced between the low-rank LoRA subspaces $A_g^l$ (new task) and $A_i^l$ (previous tasks), weighted by learned $\lambda_i$ :

$R'_1 = \sum_{i\ne g} \sum_{l=1}^L \lambda_i \mathrm{Tr}[A_i^l (A_g^l)^\top]$

Task-Shared Knowledge (TSH): Each task's subspace update $\Delta W_i^l$ is encouraged to align with a global shared subspace $W_*^l$ via projection $H_i^l$ :

$R_2 = \sum_{i=1}^g \sum_{l=1}^L \|\Delta W_i^l - H_i^l W_*^l\|_F^2$

Full objective: The total loss for task $g$ is:

$\mathcal{L}(\theta'_g) = \mathbb{E}_{\cdots}\left[ \|\epsilon - \epsilon_{\theta'_g}(z_t | c_g^k, t)\|_2^2 \right] + R'_1 + \gamma_1 R_2 + \gamma_2 R_3$

with $\gamma_1, \gamma_2$ as trade-off coefficients ($0.1$ empirically).

This dual regularization realizes flexible knowledge partitioning. In federated continual instruction tuning, dynamic parameter partitioning is executed by maintaining a cache of low-rank LoRA subspaces indexed per task and gated at inference via subspace selective activation (SSA), using identity tokens and cosine relevance scores (Guo et al., 17 Mar 2025).

4. Training and Inference Procedure

FL2T training is performed in two main stages:

Independent concept adaptation: For each concept, fine-tune a copy of the base model (e.g., UNet for diffusion; LMM for instruction tuning) with a dedicated LoRA adapter, then extract the stable embedding $C_i$ .
Order-agnostic aggregation: For any incoming task $g$ $g$ :
- Initialize proxies $P_i \gets C_i$ .
- Apply the transformer-based proxy module over $\{P_i\}$ conditioned on $c_g$ .
- Compute relevance weights and aggregate loss contributions according to proxy-guided regularization.
- Update the LoRA weights $\Delta \theta_g$ (and, if applicable, global subspaces and projections) via gradient descent.

Because the aggregation step is permutation-invariant, FL2T is robust to task order, and the modular design enables direct applicability to federated scenarios, where per-task knowledge is stored in disentangled subspaces and activated selectively per input (Guo et al., 17 Mar 2025).

At inference, SSA is used to combine task-specific subspaces via relevance-weighted gating:

$\Delta W_{\mathrm{SSA}} = \sum_{i=1}^T \alpha_i (B_i A_i)$

with $\alpha_i$ set by softmax-normalized cosine similarity between the test instruction and each task's identity token.

5. Empirical Results and Comparative Analysis

Empirical evaluations substantiate FL2T’s effectiveness at mitigating catastrophic forgetting and improving retention:

On the CIFC benchmark (10 concepts), FL2T with proxy guidance improves CLIP Image Alignment (IA) scores by +1.8 points (78.0→79.8) and Text Alignment (TA) by +0.6 (74.8→75.4) over CIDM, which was the state-of-the-art sequential method at the time (Kaushik et al., 5 Jan 2026).
On CelebA and ImageNet, similar or greater improvements are observed in both IA and TA metrics.
In ablation studies, "proxy-guidance" consistently outperforms both "no guidance" and cosine-only alternatives, indicating the importance of contextually learned relevance. Fewer reference images and lower LoRA ranks are required for FL2T to match or exceed baseline performance, confirming parameter efficiency.
In federated continual instruction tuning, the DISCO framework, an instantiation of FL2T with Dynamic Knowledge Organization (DKO) and SSA, achieves Last/Avg test scores of 55.47%/62.07% (Task-related Hom-FCIT, β=1.0), a +4.57%/+1.89% gain over the best baseline (O-LoRA). Benefits increase under higher data heterogeneity and larger numbers of tasks (Guo et al., 17 Mar 2025).

6. Theoretical Properties and Scalability

Theoretical analysis demonstrates that proxy-guided relevance weights $\lambda_i$ can provably reduce one-step model drift (i.e., the aggregate shift in parameter space when acquiring new knowledge) relative to uniform aggregation schemes, thus better preserving previous task performance (Kaushik et al., 5 Jan 2026). The set-invariant mechanism ensures scalability as the number of concepts/tasks increases, with stable improvement in knowledge retention observed for up to 30 concurrent tasks. Careful selection of transformer module depth (two layers empirically optimal) mitigates risks of rank collapse and over-aggregation.

7. Applications and Extensions

FL2T’s conceptual and algorithmic framework is directly applicable to:

Customization of diffusion models, enabling order-agnostic, scalable incremental concept learning with minimal forgetting (Kaushik et al., 5 Jan 2026).
Federated multimodal instruction tuning, where distributed clients must acquire new skills while respecting privacy and communication constraints (Guo et al., 17 Mar 2025).
Any domain requiring parameter-efficient continual learning with protection against interference, including large language and vision models, provided that knowledge modularization and prompt-conditioning are feasible.

A plausible implication is that proxy-mediated, set-invariant parameter composition and selective subspace gating could generalize to other forms of continual and federated learning, beyond generative models.

References:

"Forget Less by Learning Together through Concept Consolidation" (Kaushik et al., 5 Jan 2026)
"Federated Continual Instruction Tuning" (Guo et al., 17 Mar 2025)

Markdown Report Issue Upgrade to Chat

References (2)

Forget Less by Learning Together through Concept Consolidation (2026)

Federated Continual Instruction Tuning (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Forget Less by Learning Together (FL2T).