Papers
Topics
Authors
Recent
Search
2000 character limit reached

FL2T: Forget Less by Learning Together

Updated 12 January 2026
  • FL2T is a continual learning framework that enables order-agnostic, concept-incremental learning, mitigating catastrophic forgetting in diffusion models and federated environments.
  • It employs a set-invariant proxy module with transformer decoders to capture inter-concept interactions through prompt-conditioned attention and contrastive regularization.
  • FL2T’s dual regularization and dynamic parameter partitioning preserve prior knowledge efficiently, as shown by improved image and text alignment metrics on benchmarks.

Forget Less by Learning Together (FL2T) is a continual learning framework addressing catastrophic forgetting in models tasked with incrementally acquiring new knowledge without compromising previously learned skills. Originally proposed for custom diffusion models (CDMs), FL2T facilitates concurrent, order-agnostic concept learning by leveraging inter-concept interactions and dynamic parameter partitioning, thereby maintaining knowledge retention in both centralized and federated settings (Kaushik et al., 5 Jan 2026, Guo et al., 17 Mar 2025). The FL2T paradigm is supported by a set-invariant module that guides feature selection using learned proxies, and is broadly applicable across multimodal tasks, including instruction tuning in federated learning.

1. Order-Agnostic Concept-Incremental Learning

FL2T is motivated by the limitations of standard continual learning protocols, which typically require a fixed, sequential ordering of tasks. FL2T, by contrast, explicitly operates in an order-agnostic and concept-incremental regime: given GG customization tasks T={T1,,TG}T = \{T_1,\ldots,T_G\}, new concepts can be introduced in any permutation π(T1,,TG)\pi(T_1,\ldots,T_G). For each task gg, a dataset Tg={(xgk,pgk,ygk)}T_g = \{(x_g^k, p_g^k, y_g^k)\} is provided, where xgkx_g^k denotes sample images, pgkp_g^k are textual prompts, and ygky_g^k are associated concept tokens. Three critical constraints are imposed:

  1. Distinct concepts: Yg(i<gYi)=Y_g \cap (\bigcup_{i<g} Y_i) = \emptyset for all gg; cross-task concept duplication is disallowed.
  2. Order-agnosticity: Task order can be arbitrarily permuted during training.
  3. No replay: Past task data are not stored or retrievable in future iterations.

This formulation is particularly suitable for CDMs and federated learning scenarios where the arrival order of new tasks/concepts is typically uncontrolled and privacy considerations preclude data sharing (Kaushik et al., 5 Jan 2026, Guo et al., 17 Mar 2025).

2. Set-Invariant Inter-Concept Interaction via Proxies

Central to FL2T is a permutation-invariant inter-concept learning module integrating transformer decoders with proxy embeddings. For each previously learned concept, a stable embedding CiC_i is maintained. Proxies PiP_i (initialized as PiCiP_i \gets C_i) capture contextualized relevance conditioned on the current task. The process involves:

  • Self-attention over proxies {Pi}\{P_i\} and cross-attention between proxies (query) and the target prompt's embedding cgc_g (key/value) through stacked transformer decoder layers.
  • Refinement: The output, denoted {Pi}\{P_i'\}, encodes prompt-conditioned inter-concept dependencies.
  • Concept representation: Each refined proxy concatenated with its original concept embedding is passed through an MLP ff to yield Si=f([Ci;Pi])S_i = f([C_i; P_i']) for igi \neq g.
  • Contrastive regularization: To prevent over-coupling (“rank collapse”), a set-level contrastive loss is imposed:

R3=1Gi=1Glogexp(sim(Si,Si)/τ)j=1Gexp(sim(Si,Sj)/τ)R_3 = \frac{1}{G} \sum_{i=1}^G -\log\frac{\exp(\text{sim}(S_i, S_i)/\tau)}{\sum_{j=1}^G \exp(\text{sim}(S_i, S_j)/\tau)}

where sim(,)\text{sim}(\cdot, \cdot) is cosine similarity and τ\tau is a temperature hyperparameter.

Proxies are used to learn prompt-dependent relevance weights λi=CgSi\lambda_i = C_g \cdot S_i^\top for all igi \neq g, determining which prior concepts most effectively guide knowledge transfer and retention for the current learning episode (Kaushik et al., 5 Jan 2026).

3. Regularized Knowledge Consolidation and Parameter Partitioning

FL2T generalizes concept consolidation by combining task-specific and task-shared regularization:

  • Task-Specific Knowledge (TSP): Orthogonality is enforced between the low-rank LoRA subspaces AglA_g^l (new task) and AilA_i^l (previous tasks), weighted by learned λi\lambda_i:

R1=igl=1LλiTr[Ail(Agl)]R'_1 = \sum_{i\ne g} \sum_{l=1}^L \lambda_i \mathrm{Tr}[A_i^l (A_g^l)^\top]

  • Task-Shared Knowledge (TSH): Each task's subspace update ΔWil\Delta W_i^l is encouraged to align with a global shared subspace WlW_*^l via projection HilH_i^l:

R2=i=1gl=1LΔWilHilWlF2R_2 = \sum_{i=1}^g \sum_{l=1}^L \|\Delta W_i^l - H_i^l W_*^l\|_F^2

  • Full objective: The total loss for task gg is:

L(θg)=E[ϵϵθg(ztcgk,t)22]+R1+γ1R2+γ2R3\mathcal{L}(\theta'_g) = \mathbb{E}_{\cdots}\left[ \|\epsilon - \epsilon_{\theta'_g}(z_t | c_g^k, t)\|_2^2 \right] + R'_1 + \gamma_1 R_2 + \gamma_2 R_3

with γ1,γ2\gamma_1, \gamma_2 as trade-off coefficients ($0.1$ empirically).

This dual regularization realizes flexible knowledge partitioning. In federated continual instruction tuning, dynamic parameter partitioning is executed by maintaining a cache of low-rank LoRA subspaces indexed per task and gated at inference via subspace selective activation (SSA), using identity tokens and cosine relevance scores (Guo et al., 17 Mar 2025).

4. Training and Inference Procedure

FL2T training is performed in two main stages:

  1. Independent concept adaptation: For each concept, fine-tune a copy of the base model (e.g., UNet for diffusion; LMM for instruction tuning) with a dedicated LoRA adapter, then extract the stable embedding CiC_i.
  2. Order-agnostic aggregation: For any incoming task gg:
    • Initialize proxies PiCiP_i \gets C_i.
    • Apply the transformer-based proxy module over {Pi}\{P_i\} conditioned on cgc_g.
    • Compute relevance weights and aggregate loss contributions according to proxy-guided regularization.
    • Update the LoRA weights Δθg\Delta \theta_g (and, if applicable, global subspaces and projections) via gradient descent.

Because the aggregation step is permutation-invariant, FL2T is robust to task order, and the modular design enables direct applicability to federated scenarios, where per-task knowledge is stored in disentangled subspaces and activated selectively per input (Guo et al., 17 Mar 2025).

At inference, SSA is used to combine task-specific subspaces via relevance-weighted gating:

ΔWSSA=i=1Tαi(BiAi)\Delta W_{\mathrm{SSA}} = \sum_{i=1}^T \alpha_i (B_i A_i)

with αi\alpha_i set by softmax-normalized cosine similarity between the test instruction and each task's identity token.

5. Empirical Results and Comparative Analysis

Empirical evaluations substantiate FL2T’s effectiveness at mitigating catastrophic forgetting and improving retention:

  • On the CIFC benchmark (10 concepts), FL2T with proxy guidance improves CLIP Image Alignment (IA) scores by +1.8 points (78.0→79.8) and Text Alignment (TA) by +0.6 (74.8→75.4) over CIDM, which was the state-of-the-art sequential method at the time (Kaushik et al., 5 Jan 2026).
  • On CelebA and ImageNet, similar or greater improvements are observed in both IA and TA metrics.
  • In ablation studies, "proxy-guidance" consistently outperforms both "no guidance" and cosine-only alternatives, indicating the importance of contextually learned relevance. Fewer reference images and lower LoRA ranks are required for FL2T to match or exceed baseline performance, confirming parameter efficiency.
  • In federated continual instruction tuning, the DISCO framework, an instantiation of FL2T with Dynamic Knowledge Organization (DKO) and SSA, achieves Last/Avg test scores of 55.47%/62.07% (Task-related Hom-FCIT, β=1.0), a +4.57%/+1.89% gain over the best baseline (O-LoRA). Benefits increase under higher data heterogeneity and larger numbers of tasks (Guo et al., 17 Mar 2025).

6. Theoretical Properties and Scalability

Theoretical analysis demonstrates that proxy-guided relevance weights λi\lambda_i can provably reduce one-step model drift (i.e., the aggregate shift in parameter space when acquiring new knowledge) relative to uniform aggregation schemes, thus better preserving previous task performance (Kaushik et al., 5 Jan 2026). The set-invariant mechanism ensures scalability as the number of concepts/tasks increases, with stable improvement in knowledge retention observed for up to 30 concurrent tasks. Careful selection of transformer module depth (two layers empirically optimal) mitigates risks of rank collapse and over-aggregation.

7. Applications and Extensions

FL2T’s conceptual and algorithmic framework is directly applicable to:

  • Customization of diffusion models, enabling order-agnostic, scalable incremental concept learning with minimal forgetting (Kaushik et al., 5 Jan 2026).
  • Federated multimodal instruction tuning, where distributed clients must acquire new skills while respecting privacy and communication constraints (Guo et al., 17 Mar 2025).
  • Any domain requiring parameter-efficient continual learning with protection against interference, including large language and vision models, provided that knowledge modularization and prompt-conditioning are feasible.

A plausible implication is that proxy-mediated, set-invariant parameter composition and selective subspace gating could generalize to other forms of continual and federated learning, beyond generative models.


References:

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Forget Less by Learning Together (FL2T).