CLMU-Net: Modality-Agnostic 3D Brain Segmentation
- CLMU-Net is a continual learning architecture that adapts to sequential, heterogeneous MRI modalities without predefined channels, effectively mitigating catastrophic forgetting.
- It integrates domain-conditioned textual guidance by fusing frozen BioBERT embeddings with 3D U-Net bottleneck features, thereby enhancing segmentation with global context.
- A dual-criterion, lesion-aware experience replay buffer strategically selects both prototypical and challenging cases, yielding significant improvements in Dice scores.
CLMU-Net is a continual learning (CL) framework for 3D brain lesion segmentation designed to operate modality-agnostically across sequentially arriving, heterogeneous multi-modal MRI datasets. Built on a 3D U-Net backbone, CLMU-Net integrates three primary innovations: modality-flexible channel inflation, domain-conditioned textual guidance, and a compact, lesion-aware experience replay buffer. This architecture enables a single model to adapt to arbitrary and variable MRI modality inputs, injects explicit cohort-level global priors via frozen BioBERT textual embeddings at the bottleneck, and strategically rehearses both prototypical and challenging samples to mitigate catastrophic forgetting. Experiments with diverse brain lesion datasets demonstrate that CLMU-Net significantly outperforms conventional CL baselines, especially under stringent memory budgets and heterogeneous modality conditions (Sadegheih et al., 20 Jan 2026).
1. Network Structure and Data Flow
CLMU-Net employs a dynamically adaptable 3D U-Net architecture that ingests patches of size voxels, with an input channel dimension matching the maximum number of modalities () encountered during training up to episode . The input layer receives zero-filled channels for unavailable modalities, allowing seamless on-the-fly expansion. The encoder consists of four Conv3D-ReLU blocks with spatial stride 2, doubling feature depth from 32 to 256. At the bottleneck, the latent feature tensor is reshaped for cross-attention with domain-conditioned textual embeddings. The decoder inverts this hierarchy with ConvTranspose3D blocks and skip connections, restoring full spatial resolution and outputting per-voxel lesion logits. Random Modality Drop (RMD) is applied during training by masking random subsets of available modalities to enforce robustness to missing sequences.
2. Modality-Flexible Channel Inflation
A central design principle is input layer inflation (termed ILI), permitting the network to accommodate any growing set of modalities without prior knowledge of the maximum. At each episode , the input channel set is expanded if new modalities are observed: pre-trained weights are copied for existing channels, and newly required channels are zero-initialized. An arbitrary subset is mapped as:
This approach ensures the model can ingest and fuse varying modality subsets in a unified tensor, directly addressing the limitations of fixed-modality or maximum-set methods (Sadegheih et al., 20 Jan 2026).
3. Domain-Conditioned Textual Guidance
To inject cohort and case-level context at the feature bottleneck, CLMU-Net incorporates domain-conditioned textual guidance (DCTG). For each patch, a prompt string encodes the lesion type and available modalities (e.g., "Lesion=[tumor], Modalities=[T1, FLAIR]"). This prompt is embedded using frozen BioBERT to obtain token representations , which are linearly projected to . The bottleneck feature map is reshaped into a token sequence and fused with textual features via multi-head cross-attention, integrating global domain priors with local image features. The fused representation is reprojected and reshaped for the decoder, introducing an explicit mechanism for leveraging high-level semantic information regarding disease context and imaging protocol.
4. Lesion-Aware Experience Replay Buffer
Catastrophic forgetting is mitigated via a compact replay buffer maintained per episode and globally. After each dataset , samples are ranked using complementary criteria:
- Representative (R_rep): Combines normalized lesion size and mean lesion-region confidence, favoring large, confidently segmented lesions.
- Difficult (R_diff): Blends normalized boundary uncertainty and morphological complexity (number of lesion components per voxel), emphasizing challenging cases.
For each episode, the top samples per criterion are selected for the buffer. The global buffer is updated by merging current selections and evicting lowest-ranked samples globally to maintain a fixed total of samples. This jointly balanced selection yields up to – mean Dice improvement over single-criterion selection [Table 3, (Sadegheih et al., 20 Jan 2026)].
5. Training Pipeline and Loss Formulation
The training process sequentially observes datasets with varying modalities and pathologies. Each episode includes:
- Input layer inflation if new modalities arise.
- 300 epochs over random voxel patches (batch size 2).
- Adam optimizer ().
- For each mini-batch: apply RMD, sample a replay buffer batch if available, and compute a composite loss:
with total loss per iteration
(). Buffer selection and update is performed at every episode’s end.
6. Empirical Evaluation and Results
CLMU-Net was evaluated on five heterogeneous 3D brain MRI cohorts: BRATS-Decathlon (tumors: T1, T1c, T2, FLAIR), ISLES (stroke: DWI), MSSEG (multiple sclerosis: T1, FLAIR, PD, T2), ATLAS (stroke lesions: T1), and WMH (white-matter hyperintensities: FLAIR). Experiments used two sequential dataset orders and standard intensity normalization and spatial resampling. Primary evaluation metrics were per-task Dice, AVG (final average Dice), ILM (mean Dice across all tasks), and BWT (backward transfer, with negative values indicating forgetting).
- CLMU-Net + ILI + DCTG achieves mean AVG=59.22%, ILM=66.03%, BWT=–9.23 (for buffer size ).
- Best rehearsal baseline (ER): AVG≈49.84, ILM≈58.61, BWT≈–21.15.
- Buffer-free methods: AVG≈43.62%, ILM≈53.78%.
- Relative improvements over ER: +21.5% AVG (β=10), +14.3% (β=20).
- Dynamic ILI provides +3–7% AVG and ~40–50% BWT reduction over fixed-channel architectures; dual-criterion buffer yields ΔAVG ≈+2–5% (Sadegheih et al., 20 Jan 2026).
These results demonstrate robust performance gains and marked reductions in forgetting, especially under heterogeneous-modality conditions and memory constraints.
7. Contributions, Strengths, and Limitations
CLMU-Net establishes true modality-agnostic continual learning in medical segmentation by eliminating the need to predefine maximum modality sets and enabling seamless architectural expansion. Domain-conditioned textual guidance provides efficient integration of global semantic priors, and the targeted replay buffer, by jointly balancing prototypical and difficult cases, sharply curtails forgetting even under small memory budgets.
Notable limitations include reliance on fixed prototypical/difficult ratios, which may be sub-optimal for some tasks, and potential privacy concerns due to raw 3D sample storage. Extensions to generative or privacy-preserving replay, and generalization to new pathologies or non-MRI modalities, are suggested directions. The approach demonstrates that dynamic adaptation, explicit domain knowledge injection, and principled sample selection are critical for robust continual segmentation in clinical imaging (Sadegheih et al., 20 Jan 2026).