Image-Modality Code Representation

Updated 4 February 2026

Image-modality code representation is the explicit encoding and separation of modality-specific signals from anatomical content, enhancing interpretability.
It employs continuous, quantized, and contrastive architectures to support tasks such as synthesis, segmentation, and cross-modal registration.
Empirical evaluations demonstrate high classification accuracy, effective modality swapping, and robust retrieval through targeted loss functions and model designs.

Image-modality code representation refers to the explicit encoding, disentanglement, or alignment of imaging-modality information—whether it is contrast, style, device-specific characteristics, or the domain of acquisition—into a structured latent or symbolic code separate from anatomical or semantic content. This decomposition enables interpretable, manipulable, and often modality-agnostic representations used in multi-modal image analysis, synthesis, registration, segmentation, and retrieval. The field encompasses both continuous and quantized codes, with architectures spanning variational models, contrastive frameworks, and hybrid symbolic–pixel methods.

1. Fundamental Principles and Taxonomy

At its core, image-modality code representation formalizes the factorization of an image’s content as the combination of (a) domain- or anatomy-specific signals, and (b) modality-defining signals. Pioneering work in medical image analysis, such as SDNet, models a 2D medical image $x \in \mathbb{R}^{H \times W \times 1}$ as being generated from a spatial anatomical factor $s \in \{0,1\}^{H \times W \times C}$ and a non-spatial modality factor $z \in \mathbb{R}^{n_z}$ , where $s$ encodes one-hot per-pixel anatomical features and $z$ encodes imaging style or modality (Chartsias et al., 2019). More generally, existing frameworks fall into three broad methodological classes:

Approach	Representation Form	Code Property
Disentangled Latents	$\mathbb{R}^d$ (continuous)	Semantic separation
Quantized/Symbolic	$\{0,\dots,L-1\}^d$ (discrete)	Compact, instance-specific
Contrastive Embedding	$\mathbb{R}^d$ , $\ell_2$ normed	Modality-aligned, task-tuned

The rationale is to provide codes that are: (i) informative (maximal for modality classification or synthesis), (ii) disentangled (modality-independent from anatomy), and (iii) actionable (enabling downstream operations such as interpolation, swapping, or retrieval).

2. Architecture and Latent Code Formation

Disentanglement Architectures

Spatial Decomposition (SDNet): Utilizes an anatomy encoder (U-Net) for spatial labels $s$ , a modality encoder (CNN+FC) predicting the mean and variance of modality-factor $z$ , and a decoder conditioned by $s$ and $z$ through FiLM modulation. The decoder reconstructs images $y = g(s,z)$ , ensuring that $z$ captures only intensity/style, while $s$ encodes anatomy with per-pixel one-hot (Chartsias et al., 2019).
Deep Scalar Quantization (CodeBrain): Constructs per-instance, per-modality quantized codes $\hat Z_a \in \{0,\dots,L-1\}^{d \times H \times W}$ via bounded rounding of encoder output (Finite Scalar Quantization). These codes, together with modality-agnostic “common features” $F_c$ from another modality, are decoded to reconstruct the original image. In the second stage, a “prior” encoder predicts all modality codes from available inputs (Wu et al., 30 Jan 2025).
Contrastive Encoders (CoMIR, DSIR, M³Bind): Employ either dual feed-forward encoders per modality (CoMIR), or single shared encoders (DSIR), to map images into a shared representation space by maximizing inter-modality agreement via InfoNCE or contrastive losses (Pielawski et al., 2020, Mok et al., 2024, Liu et al., 22 Jun 2025). In M³Bind, modality-specific CLIP variants are LoRA-fine-tuned and distilled into a shared text-aligned code space.

Losses Enforcing Code Disentanglement

Kullback–Leibler Prior (VAE): Encourages $z$ to match a standard normal, preventing collapse.
Modality-Code Reconstruction Loss: Penalizes divergence between input $z$ and the $z$ re-inferred from reconstructed images, ensuring $z$ is actually used by the decoder (SDNet: $L_{z_{\rm rec}}$ ).
Cycle Consistency and GANs: Used in hybrid settings (SGDR) to align cross-domain appearance and segmentation predictions via adversarial objectives (Wang et al., 2022).
Contrastive/InfoNCE Losses: Maximize (or preserve) the mutual information between correlated modality representations, with explicit negative sampling strategies and—in CoMIR—rotational-equivariance constraints.

3. Manipulability, Code Arithmetic, and Transferability

A defining feature is that these codes are explicitly manipulable and interpretable. In SDNet, “modality swapping” is realized by combining an anatomy code $s_{\rm MR}$ from an MR image with a modality code $z_{\rm CT}$ from a CT image to synthesize a new image $y = g(s_{\rm MR}, z_{\rm CT})$ that depicts the same anatomy in the style of CT (Chartsias et al., 2019). CodeBrain generalizes this via finite quantized codes: for any subject, the anchor code and common features from other modalities reconstruct the target, with code prediction supporting any-to-any imputation, including missing-modality synthesis (Wu et al., 30 Jan 2025).

In many frameworks, code interpolation along individual $z_i$ produces smooth transitions in image style, and, in the SDNet case, strong global control over intensity. In quantized settings, the codebook structure (e.g., $L^d$ possible codes) enables discrete control and instance-specific synthesis.

4. Evaluation Protocols and Empirical Validation

The informativeness, disentanglement, and utility of modality codes are validated empirically:

Classification Accuracy: In SDNet, an 8-dimensional $z$ achieves 92% MR vs. CT classification and 96% cine-MR vs. CP-BOLD MR, and even a single $z_5$ dimension yields $\sim$ 82% accuracy (Chartsias et al., 2019).
Ablation Studies: Removal of modality-code reconstruction loss leads to VAE collapse and leakage of modality content into the anatomy code.
Cross-Modal Synthesis: Modality swapping produces visually plausible and semantically consistent images, establishing the code’s utility in synthesis and data augmentation.
Downstream Task Performance: In segmentation and domain adaptation (SGDR), enforcing semantic meaning and domain invariance in the content code leads to significant improvements in Dice coefficient and ASSD compared to state-of-the-art baselines (Wang et al., 2022).
Retrieval and Classification Across Modalities: M³Bind leverages its shared, 512-dimensional text-anchored embedding as a code, enabling zero/few-shot classification and retrieval across X-ray, CT, retina, ECG, and pathology domains, with state-of-the-art metrics in all tested scenarios (Liu et al., 22 Jun 2025).
Registration Accuracy: DSIR and CoMIR directly use their representations as cross-modality similarity metrics, outperforming MI- or MIND-based baselines in DSC and HD95 for multimodal registration (Mok et al., 2024, Pielawski et al., 2020).

5. Symbolic and Structured Code Representations

Beyond continuous and quantized latent codes, certain applications formalize the code as an explicit symbolic structure. In glyph completion, a glyph image is encoded to a latent vector and further decoded to a graphic point-set and a graph (stroke connectivity), which is then rendered back to an image (image → point-set → graph → image cycle). This code supports disentanglement of style (local curve features) from content (global connectivity) and provides strong priors and editing interfaces for designers (Yuan et al., 2021). This approach generalizes: any domain-specific symbolic code (e.g., scene graph, CSG tree) can be similarly introduced as an intermediate representation bridging modalities.

6. Implementation and Training Considerations

The selection of code dimensionality, quantization level, and loss weighting is governed by task-specific requirements:

Typical continuous code dimensions: $n_z=8$ (SDNet) for style, $C=8{-}16$ for spatial anatomical channels.
Quantized codes: $d=7$ , $L=5$ yielding codebooks of size $L^d=78,125$ (CodeBrain) (Wu et al., 30 Jan 2025).
InfoNCE temperature hyperparameters are robust over wide ranges ( $\tau=0.07{-}0.5$ ), with gradient stability achieved via careful batch construction and negative sampling.
LoRA finetuning in text encoders preserves primary image–text alignment during modality unification (M³Bind) (Liu et al., 22 Jun 2025).
Deep self-similarity pipelines (DSIR) require no anatomical labels or prealigned pairs, relying instead on statistical and structural augmentation strategies.

Optimization is typically performed using Adam or SGD, with learning rates $1\text{e}{-3}$ to $1\text{e}{-4}$ , batch sizes from 4 to 64, and up to several hundred epochs. Feature-matching, reconstruction, adversarial, and code–alignment losses are balanced via empirically chosen weighting.

7. Applications and Future Directions

Image-modality codes enable a wide spectrum of multimodal image analysis tasks:

Semi-supervised and Domain-agnostic Segmentation: Enhanced cross-domain generalization by enforcing content–style separation at the representation level (SGDR, SDNet).
Many-to-many Modality Synthesis and Imputation: Unified models for arbitrary missing-modality inference (CodeBrain).
Cross-modal Retrieval and Zero-shot Classification: Embedding images from different modalities into a common code space to facilitate retrieval/classification without alignment or pairing constraints (M³Bind).
Deformable Image Registration: Substituting classical metrics with structural, modality-invariant codes for highly accurate registration (DSIR, CoMIR).
Interactive and Structured Design: Use of symbolic codes for editing and manipulation in domains with intrinsic structure, such as font design (Yuan et al., 2021).

Future extensions include richer priors (e.g., normalizing flows over modality codes), greater integration of symbolic or topological codes, direct code-cycle consistency objectives, volumetric/3D extensions, and incorporation of semi-supervised and self-supervised learning on unlabelled target data. Quantized modality codes, in particular, suggest further research into discrete, interpretable, and memory-efficient representations for large-scale, multimodal imaging datasets.