Forget Less by Learning from Parents
- FLLP is a hierarchical learning framework preventing catastrophic forgetting in diffusion models by using parent-child relationships in hyperbolic spaces.
- Utilizes Lorentzian geometry to embed concept hierarchies, facilitating structured transfer and maintaining previously learned knowledge.
- Demonstrates consistent improvements in knowledge retention and generalization across synthetic and real-world datasets.
Forget Less by Learning from Parents (FLLP) is a hierarchical continual learning framework designed to address catastrophic forgetting in Custom Diffusion Models (CDMs) by leveraging parent–child relationships among learned concepts within a hyperbolic embedding space. FLLP mitigates destructive interference that arises when new concepts are learned sequentially by modeling positive inter-concept transfer, defining explicit entailment cones in Lorentzian geometry to regulate how novel “child” concept representations align with those of previously learned “parents.” The approach demonstrates consistent improvements in both knowledge retention and generalization across synthetic and real-world datasets (Kaushik et al., 5 Jan 2026).
1. Catastrophic Forgetting in Sequential Concept Learning
Catastrophic forgetting occurs in CDMs when the sequential introduction of concepts , each with limited reference data, causes gradients from new concepts to overwrite parameterizations for previously acquired . Representations for text-to-image diffusion reside across the U-Net backbone, cross-attention mechanisms, and learned token embeddings, making these systems particularly vulnerable. Conventional continual learning (CL) strategies—Elastic Weight Consolidation (EWC), knowledge distillation, and related regularization—focus exclusively on suppressing interference, handling concepts as independent. Such methods fail to capitalize on meaningful conceptual relationships, particularly the compositional and hierarchical structure inherent in natural categories or human-generated labels. FLLP reframes the problem: previously learned concepts provide constructive supervision for adapting to new concepts, effectively serving as inductive biases that can be formally modeled.
2. Hyperbolic Embeddings via the Lorentz Model
FLLP utilizes a negatively curved (hyperbolic) space to encode concept hierarchies, specifically embedding image attention maps into the Lorentz (hyperboloid) model of hyperbolic geometry. This framework naturally accommodates tree-like data structures, as hyperbolic spaces can isometrically embed exponentially expanding graphs. In -dimensional Lorentzian space of curvature , the ambient representation is
with the hyperboloid defined by
Distance between embeddings and is given by the Lorentzian geodesic
The exponential map at the origin can project Euclidean vectors into the hyperbolic manifold.
This suggests that the choice of Lorentzian geometry is critical for encoding entailment and efficiently modeling exponentially branching concept taxonomies inherent in continual concept learning.
3. Hierarchical Parent–Child Guidance in Concept Embedding
Within the hyperbolic space, each learned concept embedding defines an “entailment cone,” parameterized by its half-aperture: Given a child concept embedding , the exterior angle from the parent's cone axis is computed as: FLLP enforces that the child embedding lies within its parent’s cone, up to a slack : This constraint formalizes the notion that a newly acquired concept should generalize from, but not drift excessively beyond, the scope defined by relevant parent concepts in the learned hierarchy.
4. Loss Formulation and Training Dynamics
The overall FLLP objective integrates three terms: the standard diffusion reconstruction loss, a parent entailment penalty over image-attention maps, and a consolidation loss on LoRA adapter parameters (as in CIDM):
Training consists of projecting reference attention maps for each concept into the Lorentzian manifold, computing a parent chain for every novel concept (by iterative nearest-neighbor search in hyperbolic distance, discounting pathological self-loops), and then aggregating the entailment error along this chain. Gradients are jointly back-propagated through the U-Net’s LoRA-adapted layers, cross-attention mechanisms, and hyperbolic projections.
5. Architectural Design and Adaptations
FLLP extends a pretrained Stable Diffusion v1.5 U-Net backbone with LoRA adapters in each transformer layer for efficient personalization. Timestep-weighted cross-attention maps are extracted and summarized per concept as . These attention summaries are lifted into hyperbolic space for entailment-based regularization. Notably, no novel network architectural components are introduced beyond the LoRA adapters, so the core U-Net structure remains invariant apart from injected personalization weights.
The methodology circumvents extensive storage or computation overheads, as only low-rank adapter parameters and attention summaries are maintained. Hyperparameters are selected as follows: learning rate for tokens at , for UNet at , curvature is learnable and initialized at 1, , and is tuned per concept.
6. Experimental Protocols and Comparative Results
FLLP is benchmarked on three datasets—CIFC (synthetic concepts), CelebA (face identities), and an ImageNet subset—each featuring low-shot, sequential concept addition. Peer methods include direct fine-tuning, EWC, LwF, C-LoRA, L2DM, Textual Inversion (TI), and CIDM. Evaluation metrics are CLIP Image Alignment (IA) and Text Alignment (TA), aggregating statistics over 20 prompts and 50 generations per concept.
Key performance improvements over CIDM are observed:
| Dataset | Δ IA | Δ TA |
|---|---|---|
| CIFC | +2.0 | +1.3 |
| CelebA | +4.4 | +2.0 |
| ImageNet | +1.1 | +0.5 |
Across 10 concepts:
- CIFC: IA 78.0 → 80.0, TA 74.8 → 76.1
- CelebA: IA 73.3 → 77.7, TA 58.8 → 60.8
- ImageNet: IA 81.2 → 82.3, TA 78.5 → 79.0
Ablation analyses indicate that constraining image-attention maps (rather than directly regularizing LoRA weights) achieves superior retention/generalization trade-offs. FLLP remains effective when scaling to 35 concepts (as in CustomConcept101, +2.1 IA, +1.0 TA). Parameter drift, measured via LoRA Frobenius norm change, is reduced by 22% compared to CIDM.
7. Qualitative Observations and Knowledge Transfer
Qualitative experiments demonstrate that FLLP preserves previously learned identities and concept-specific features. For example, learning “Dog2” after “Cat1” and “Duck” produces generations where the new dog concept is structurally and texturally anchored by a hyperbolic parent chain , avoiding distortions and artifact propagation that afflict TI and CIDM generations. Attention maps remain both focused and interpretable, with minimal erasure or deformation of prior concepts.
The formalization of parent–child hyperbolic guidance turns catastrophic forgetting from a destructive phenomenon into an opportunity for structured, compositional positive transfer. The result is enhanced state-of-the-art performance in retention and adaptation for continual concept learning in diffusion models (Kaushik et al., 5 Jan 2026).