Unified Modality-Relax Segmentation (UniMRSeg)

Updated 4 July 2026

The paper introduces a unified approach that replaces modality-specific models with one network handling any available subset of imaging modalities.
It employs modality-dropout and hierarchical self-supervised compensation to fuse diverse modality inputs into a shared, robust representation.
Empirical evaluations demonstrate improved Dice scores and reduced performance fluctuations across various missing and unseen modality combinations.

Unified Modality-Relax Segmentation Network (UniMRSeg) denotes a unified segmentation paradigm for settings in which input modalities are variably available, partially missing, incomplete, corrupted, or newly encountered at inference. In one formulation, UniMRSeg is the Unified Representation Network (URN) augmented with a segmentation head and modality-dropout training for segmentation with missing input modalities (Lau et al., 2019). In another, UniMRSeg is the explicit name of a hierarchical self-supervised compensation framework that bridges complete and incomplete inputs at the input, feature, and output levels within a single shared encoder–decoder (Zhao et al., 19 Sep 2025). Across these formulations, the central objective is consistent: replace modality-subset-specific models with one model that can accept whatever subset of modalities is available at test time, without network modification, model switching, or retraining.

1. Problem Setting and Conceptual Scope

The canonical problem addressed by UniMRSeg is the train–test mismatch induced by missing or heterogeneous imaging modalities. The URN formulation studies segmentation with missing input modalities under the observation that many segmentation pipelines assume that the train- and test distributions match, whereas practical clinical acquisition often violates that assumption (Lau et al., 2019). The later HSSC-based UniMRSeg formulation states the deployment problem more broadly: incomplete or corrupted modalities degrade performance, and specialized per-combination models introduce high deployment costs because they require exhaustive model subsets and model-modality matching (Zhao et al., 19 Sep 2025).

A closely related but distinct setting appears in work on modality-agnostic input channels for brain lesion segmentation. There, the target is not only previously seen modalities and missing subsets of them, but also sequences unavailable during training, with inference on heterogeneous combinations of seen and unseen modalities (Addison et al., 11 Sep 2025). This suggests a useful distinction within the UniMRSeg literature: some formulations aim at graceful degradation under missing modalities drawn from a known modality universe, while others explicitly target inference on previously unseen sequences.

Within this scope, “unified” refers to a single model handling multiple modality configurations, and “modality-relax” refers to relaxing the fixed-modality assumption that standard multimodal segmentation architectures usually impose. The unifying technical motif is to map variable modality inputs into either a common representation, a shared encoder–decoder pathway, or a fixed-size padded or conditioned input space.

2. Foundational URN-Based UniMRSeg

The URN-based formulation decomposes UniMRSeg into per-modality encoding, fusion into a unified representation, optional unsupervised pre-training via image synthesis, and a downstream segmentation head. Let $M$ be the total number of possible modalities. For each modality $m$ , the encoder is

$E_m:\;\mathbb{R}^{H\times W}\;\longrightarrow\;\mathbb{R}^{H\times W\times C},$

implemented as a 2D U-net with one conv-block per resolution level, batch-normalization with fixed scaling, and leaky-ReLU activations. For input $x_m$ , or zero if modality $m$ was dropped, the encoder produces

$\tilde z_m \;=\; E_m(x_m)\;\in\;\mathbb{R}^{H\times W\times C},\qquad C=16.$

Immediately after encoding, channel-wise standardization is applied so that across spatial positions each channel has zero mean and unit variance (Lau et al., 2019).

Robustness to absent inputs is induced during training by modality-dropout. A random subset of the $M$ input modalities is erased, with the number of dropped modalities sampled from a truncated geometric distribution. Typical settings are $\theta=0.5$ for segmentation training and $\theta=0.8$ for pre-training. The surviving modality set $\mathcal I\subset\{1,\dots,M\}$ is fused by arithmetic mean,

$m$ 0

Because the fusion is a mean, the magnitude of $m$ 1 is independent of how many modalities are present. The same formulation also states that the arithmetic mean can in principle be replaced by any invertible $m$ 2-mean,

$m$ 3

Optional unsupervised pre-training uses shallow modality-specific decoders $m$ 4 and a reconstruction objective over a large scan pool such as BRATS $m$ 5 HCP:

$m$ 6

Once the encoders are trained, their weights are frozen when training the segmentation head. The segmentation head $m$ 7 maps the fused representation to per-voxel class probabilities, and supervision uses standard cross-entropy,

$m$ 8

A small variance regularizer encourages encoded modality features to occupy the same feature space,

$m$ 9

and the supervised loss is

$E_m:\;\mathbb{R}^{H\times W}\;\longrightarrow\;\mathbb{R}^{H\times W\times C},$ 0

The corresponding training procedure uses batch size $E_m:\;\mathbb{R}^{H\times W}\;\longrightarrow\;\mathbb{R}^{H\times W\times C},$ 1, Adam, learning rate $E_m:\;\mathbb{R}^{H\times W}\;\longrightarrow\;\mathbb{R}^{H\times W\times C},$ 2 for segmentation training, learning rate $E_m:\;\mathbb{R}^{H\times W}\;\longrightarrow\;\mathbb{R}^{H\times W\times C},$ 3 for image-synthesis pre-training, $E_m:\;\mathbb{R}^{H\times W}\;\longrightarrow\;\mathbb{R}^{H\times W\times C},$ 4 epochs for segmentation, and no data augmentation beyond random modality-dropout. At test time, one simply encodes whatever subset $E_m:\;\mathbb{R}^{H\times W}\;\longrightarrow\;\mathbb{R}^{H\times W\times C},$ 5 is available, fuses it, and forwards the fused tensor through the segmentation head; no network modification or retraining is needed (Lau et al., 2019).

3. Hierarchical Self-Supervised Compensation

The later method explicitly titled “UniMRSeg: Unified Modality-Relax Segmentation via Hierarchical Self-Supervised Compensation” recasts the problem around a single 3D U-Net-style encoder–decoder with a 3D-ASPP head whose weights are shared across all possible input-modality combinations, such as $E_m:\;\mathbb{R}^{H\times W}\;\longrightarrow\;\mathbb{R}^{H\times W\times C},$ 6 combinations for four MRI modalities and $E_m:\;\mathbb{R}^{H\times W}\;\longrightarrow\;\mathbb{R}^{H\times W\times C},$ 7 combinations for RGB-D/T inputs (Zhao et al., 19 Sep 2025). Its stated goal is to avoid training a separate model per modality subset and to bridge the representation gap between complete and missing modalities through a three-level compensation pipeline.

At the input level, Stage 1 applies hybrid shuffled-masking augmentation. Each fully-modal sample undergoes random modality dropout with $E_m:\;\mathbb{R}^{H\times W}\;\longrightarrow\;\mathbb{R}^{H\times W\times C},$ 8 chance per modality but at least one modality remaining, channel-wise modality shuffle to decouple positional priors, and spatial masking with random local and global masks. A lightweight 3D U-Net decoder reconstructs the full set of normalized slices, with reconstruction loss

$E_m:\;\mathbb{R}^{H\times W}\;\longrightarrow\;\mathbb{R}^{H\times W\times C},$ 9

At the feature level, Stage 2 performs modality-invariant contrastive learning. For each complete sample $x_m$ 0, a randomly incomplete version $x_m$ 1 is created by modality dropout, and both are encoded to multi-level features $x_m$ 2 and $x_m$ 3 across the five encoder stages. Positive pairs are same-sample complete/incomplete features; negative pairs are drawn from different samples or cross-pairings. The per-level NT-Xent loss is

$x_m$ 4

with total contrastive loss

$x_m$ 5

A segmentation-guided Dice term is optimized in parallel,

$x_m$ 6

yielding

$x_m$ 7

At the output level, Stage 3 freezes the encoder and trains small reverse-attention adapters, one per encoder stage, together with the decoder. For incomplete-input feature $x_m$ 8 and adapter output $x_m$ 9, a 3D Swin Transformer block computes a mutual attention map $m$ 0, which is inverted to emphasize under-attended regions. The adapted feature is

$m$ 1

Fine-tuning then uses a hybrid consistency constraint. Feature consistency across encoder stages is

$m$ 2

and prediction consistency is

$m$ 3

The Stage 3 objective is

$m$ 4

Inference uses only the Stage 3 model. Any subset of modalities can be input, with missing channels zeroed or duplicated, and the shared network yields consistent segmentations without a modality classifier or model switching (Zhao et al., 19 Sep 2025).

4. Empirical Performance Under Missing and Unseen Modalities

In the original URN study, evaluation used BRATS 2018 with $m$ 5 patients and four MRI contrasts—FLAIR, T1, T1c, and T2—with labels for enhancing tumor (ET), tumor core (TC), and whole tumor (WT). All $m$ 6 nonempty modality subsets were evaluated, and Dice score was computed using the official BRATS leaderboard. The baseline U-net without modality dropout breaks catastrophically as soon as FLAIR or T1c is missing, with WT approximately $m$ 7– $m$ 8. Adding modality dropout restores robustness, yielding WT approximately $m$ 9– $\tilde z_m \;=\; E_m(x_m)\;\in\;\mathbb{R}^{H\times W\times C},\qquad C=16.$ 0 on most subsets. URN plus modality dropout, denoted UniMRSeg in that formulation, further improves WT to approximately $\tilde z_m \;=\; E_m(x_m)\;\in\;\mathbb{R}^{H\times W\times C},\qquad C=16.$ 1– $\tilde z_m \;=\; E_m(x_m)\;\in\;\mathbb{R}^{H\times W\times C},\qquad C=16.$ 2 for nearly all subsets, and even with all modalities present slightly outperforms the baseline with modality dropout. Pre-training on pooled BRATS+HCP yields marginal WT gains but substantial improvements on ET and TC in the hardest missing-modality regimes (Lau et al., 2019).

The HSSC-based UniMRSeg reports results on four tasks. On BraTS2020, average Dice across $\tilde z_m \;=\; E_m(x_m)\;\in\;\mathbb{R}^{H\times W\times C},\qquad C=16.$ 3 modality combinations is $\tilde z_m \;=\; E_m(x_m)\;\in\;\mathbb{R}^{H\times W\times C},\qquad C=16.$ 4, compared with PASSION at $\tilde z_m \;=\; E_m(x_m)\;\in\;\mathbb{R}^{H\times W\times C},\qquad C=16.$ 5, with standard deviation $\tilde z_m \;=\; E_m(x_m)\;\in\;\mathbb{R}^{H\times W\times C},\qquad C=16.$ 6 versus $\tilde z_m \;=\; E_m(x_m)\;\in\;\mathbb{R}^{H\times W\times C},\qquad C=16.$ 7. For the full four-modality case, WT, TC, and ET are $\tilde z_m \;=\; E_m(x_m)\;\in\;\mathbb{R}^{H\times W\times C},\qquad C=16.$ 8, $\tilde z_m \;=\; E_m(x_m)\;\in\;\mathbb{R}^{H\times W\times C},\qquad C=16.$ 9, and $M$ 0, respectively. On RGB-D salient object segmentation for STERE, the method reports $M$ 1 for RGB only, $M$ 2 for depth only, $M$ 3 for RGB-D, and average $M$ 4, compared with PopNet at $M$ 5, with standard deviation $M$ 6 versus $M$ 7. On VT1000 RGB-T salient object segmentation, the corresponding values are $M$ 8, $M$ 9, $\theta=0.5$ 0, and average $\theta=0.5$ 1, compared with CONTRINET at $\theta=0.5$ 2, with standard deviation $\theta=0.5$ 3 versus $\theta=0.5$ 4. On SUN-RGBD semantic segmentation, average mIoU is $\theta=0.5$ 5, compared with CMXNeXt at $\theta=0.5$ 6, with standard deviation $\theta=0.5$ 7 versus $\theta=0.5$ 8 (Zhao et al., 19 Sep 2025).

A different but closely related benchmark addresses sequences unavailable during training. Across eight 3D brain MRI datasets spanning five pathologies and eight pulse sequences, two hold-out protocols were used. In Setting 1, DWI and ADC are unseen at test time; in Setting 2, FLAIR is unseen at test time. For Setting 1, the Standard model yields Dice $\theta=0.5$ 9 on ISLES 22 and $\theta=0.8$ 0 on ISLES 15, Agnostic-Channel gives $\theta=0.8$ 1 and $\theta=0.8$ 2, and Agnostic-Pathway reaches $\theta=0.8$ 3 and $\theta=0.8$ 4. For Setting 2, the Standard model gives $\theta=0.8$ 5 on WMH and $\theta=0.8$ 6 on ISLES 15, Agnostic-Channel gives $\theta=0.8$ 7 and $\theta=0.8$ 8, and Agnostic-Pathway reaches $\theta=0.8$ 9 and $\mathcal I\subset\{1,\dots,M\}$ 0. On datasets with only seen modalities, such as TUMOUR2, neither agnostic approach degrades performance versus the Standard model (Addison et al., 11 Sep 2025).

A notable extension of the unified-modality idea appears in unpaired CT/MR segmentation. “MulModSeg: Enhancing Unpaired Multi-Modal Medical Image Segmentation with Modality-Conditioned Text Embedding and Alternating Training” uses a single encoder–decoder backbone, either a 3D UNet or Swin UNETR, and a modality-conditioned text branch built from a frozen CLIP-text encoder. The prompt template is “A {CT/MR} imaging of [CLS]”, and a FiLM-style controller MLP produces the parameters of a lightweight segmentation head. Training alternates CT and MR batches within each epoch, and no adversarial or consistency losses are required in that formulation. On AMOS with a UNet backbone, alternating training with text improves the average Dice from $\mathcal I\subset\{1,\dots,M\}$ 1 to $\mathcal I\subset\{1,\dots,M\}$ 2 for CT and from $\mathcal I\subset\{1,\dots,M\}$ 3 to $\mathcal I\subset\{1,\dots,M\}$ 4 for MR. On MMWHS, the corresponding UNet averages improve from $\mathcal I\subset\{1,\dots,M\}$ 5 to $\mathcal I\subset\{1,\dots,M\}$ 6 for CT and from $\mathcal I\subset\{1,\dots,M\}$ 7 to $\mathcal I\subset\{1,\dots,M\}$ 8 for MR (Li et al., 2024).

Continual-learning formulations push the same objective into domain-incremental settings. CLMU-Net introduces a replay-based continual learning framework for 3D brain lesion segmentation with arbitrary and variable modality combinations, without prior knowledge of the maximum set. Its channel-inflation mechanism pads an arbitrary modality subset into a fixed-size tensor, inflating the first convolution when a new modality appears by copying existing channels and zero-initializing the new one. To enrich local 3D patch features, it adds domain-conditioned textual embeddings through cross-attention or FiLM-style modulation, and it reduces forgetting by replay using a compact buffer of prototypical and challenging samples. On five heterogeneous MRI brain datasets, the method yields an average Dice score improvement of $\mathcal I\subset\{1,\dots,M\}$ 9 while remaining robust under heterogeneous-modality conditions (Sadegheih et al., 20 Jan 2026).

Prompt-driven retinal imaging offers a further related direction. CLAPS uses a CLIP-based image encoder, GroundingDINO bounding-box prompt generation, a modality-signature-enhanced text prompt, and MedSAM for unified segmentation across fundus and OCT datasets. In that work, expansion to a “Unified Modality-Relax Segmentation Network” is described as a potential extension, specifically by adding new prompt types such as scribbles and point-clouds and integrating other foundation models such as UNIVERSEG and OmniFuse (Zhao et al., 10 Sep 2025). This indicates that the UniMRSeg design objective is beginning to interact with foundation-model pipelines, even when the full UniMRSeg framework is not itself the reported method.

6. Advantages, Limitations, and Open Questions

Several advantages recur across UniMRSeg formulations. The URN-based version handles any subset of available modalities with one single model, avoiding combinatorial network explosion. By fusing at the feature level, downstream tasks operate on a modality-agnostic representation and benefit from shared structures. Modality-dropout acts as a strong regularizer and improves even the all-modalities-present case. Unsupervised pre-training can leverage large unlabelled or partially overlapping datasets (Lau et al., 2019). The HSSC-based version adds a second advantage profile: higher mean accuracy under diverse missing-modality scenarios together with smaller performance fluctuations across modality combinations, as indicated by the reported reductions in standard deviation on BraTS2020, STERE, VT1000, and SUN-RGBD (Zhao et al., 19 Sep 2025).

The limitations are equally specific. In the URN formulation, the variance regularizer requires at least two modalities present; the experiments are all 2D; the fusion is a fixed arithmetic mean; and no geometric or intensity augmentations were used. Proposed future directions there include a 3D URN, learned attention-weighted fusions, invertible networks to parameterize $m$ 00, standard augmentations such as flips, rotations, and intensity jitter, and domain-adaptation techniques for cross-scanner or cross-institution shifts (Lau et al., 2019). In unseen-modality segmentation, a stated concern is that some models which generalize to unseen modalities may lose discriminative modality-specific information (Addison et al., 11 Sep 2025). This suggests a persistent design tension: stronger modality invariance can improve flexibility, but excessive invariance may suppress informative contrast-specific cues.

Open questions extend beyond missing known modalities. One direction is true zero-shot modality expansion, which is explicitly contemplated by proposals to learn dynamic modality embeddings rather than fixed one-hot modality encodings (Zhao et al., 10 Sep 2025). Another is continual domain-incremental deployment, where modality sets grow over time and forgetting becomes as important as modality flexibility (Sadegheih et al., 20 Jan 2026). A broader implication is that UniMRSeg is best understood not as a single immutable architecture, but as a technical program centered on unified inference under modality variability, implemented through shared representations, compensation mechanisms, modality-agnostic channels, text-conditioned control, or continual channel inflation depending on the deployment regime.