Modality–Anatomy–Task (MAT) Schema

Updated 21 November 2025

The MAT schema is a formalized framework that factorizes medical imaging data into modality, anatomy, and task to standardize evaluation of compositional generalization.
It employs related, unrelated, and zero-overlap splits to benchmark model performance and assess zero-shot transfer in diverse imaging tasks.
The schema underpins innovative applications including cross-domain translation, segmentation, and domain adaptation, advancing modular and interpretable medical AI systems.

The Modality–Anatomy–Task (MAT) schema is a formalized framework for organizing, benchmarking, and analyzing compositional generalization in multimodal medical imaging AI. By factorizing medical image data along dimensions of modality, anatomical region, and clinical task, the MAT schema underpins new approaches to model validation, dataset construction, and zero-shot generalization—particularly in the context of large vision-LLMs (VLMs) and cross-domain medical image translation and segmentation.

1. Formal Definition and Construction of the MAT Schema

The MAT schema defines three core axes:

$M$ : Set of imaging modalities (e.g., X-ray, MRI, CT)
$A$ : Set of anatomical regions (e.g., Chest, Brain, Lung)
$T$ : Set of task types (e.g., Classification, Segmentation)

Formally, the Cartesian product $S = M \times A \times T$ enumerates all possible MAT triplets $(m, a, t)$ , representing unique combinations of modality, anatomy, and task. In CrossMed, for example, $M = \{\text{X-ray}, \text{MRI}, \text{CT}\}$ , $A = \{\text{Chest}, \text{Brain}, \text{Lung}\}$ , $T = \{\text{Classification}, \text{Segmentation}\}$ , which yields $|S| = 18$ theoretical combinations, though only a subset may be clinically instantiated (Singh et al., 14 Nov 2025).

A “MAT-Triplet” uniquely specifies the provenance and purpose of a given medical image sample. For instance, (MRI, Brain, Segmentation) strictly encodes an MRI scan of the brain for a segmentation task.

A central application of MAT is to partition data in a manner that probes compositional generalization (CG)—that is, the model’s capacity to solve for MAT triplets not explicitly observed during training. This is operationalized via three key split strategies:

Related: Training set contains all triplets sharing exactly two factors with the test triplet. E.g., to test (X-ray, Chest, Segmentation), train on triplets like (X-ray, Chest, Classification) or (MRI, Chest, Segmentation).
Unrelated: Training includes only triplets that share at most one factor with the test triplet.
Zero-Overlap: Training set contains no triplets sharing any factor with the test triplet; all three axes are disjoint.

Mathematically, for a test triplet $(m^*, a^*, t^*)$ : $S_{\textrm{train}}^{\textrm{Rel}}(m^*, a^*, t^*) = \{ (m, a, t) \in S \setminus \{(m^*, a^*, t^*)\} \mid [ (m = m^* \wedge a = a^*) \vee (m = m^* \wedge t = t^*) \vee (a = a^* \wedge t = t^*) ] \}$

$S_{\textrm{train}}^{\textrm{Unrel}}(m^*, a^*, t^*) = \{ (m, a, t) \in S \setminus \{(m^*, a^*, t^*)\} \mid \mathbb{I}_{[m=m^*]} + \mathbb{I}_{[a=a^*]} + \mathbb{I}_{[t=t^*]} \leq 1 \}$

$S_{\textrm{train}}^0 (m^*, a^*, t^*) = \{ (m, a, t) \in S \mid m \neq m^*, a \neq a^*, t \neq t^* \}$

These splits enable rigorous measurement of the ability to recompose learned representations for previously unseen MAT combinations (Singh et al., 14 Nov 2025, Cai et al., 28 Dec 2024).

3. Task Representation and Unified VQA Conversion

Within MAT-based benchmarks, each MAT triplet is mapped into a unified prompt-based format for downstream evaluation. The most common is the Visual Question Answering (VQA) paradigm:

For classification: Natural language prompts and multiple-choice answer sets map pathology/diagnosis recognition (e.g., “Does this X-ray of the chest show pneumonia?” with four answer options).
For segmentation: Prompts select from candidate segmentation masks with only one correct and several distractors (e.g., “Which of these four masks correctly outlines the tumor region?”).

Images are preprocessed (resized, normalized), masks are overlaid or presented separately, and all textual components are instantiated systematically from MAT triplet descriptors. Distractor responses are sampled to ensure clinically meaningful alternatives. This schema enables direct, model-agnostic benchmarking of multimodal models and traditional architectures in an identical format (Singh et al., 14 Nov 2025, Cai et al., 28 Dec 2024).

4. Evaluation Metrics and Analytical Framework

The MAT paradigm supports quantitative assessment via:

Classification Accuracy: $\mathrm{Acc} = \frac{\#\{\text{correct predictions}\}}{\#\{\text{total instances}\}}$
Segmentation Class-wise Intersection over Union (cIoU): $\mathrm{cIoU} = \frac{1}{N} \sum_{i=1}^N \frac{|P_i \cap G_i|}{|P_i \cup G_i|}$ where $P_i$ is a predicted mask and $G_i$ is the ground truth for image $i$ .

Models are fine-tuned and evaluated on each split, with per-task and per-split metrics reported and averaged as appropriate (Singh et al., 14 Nov 2025).

5. Experimental Insights and Cross-Task Transfer

Empirical studies using the MAT schema reveal:

Significant compositional generalization under Related splits ( $83.2\% \pm 1.8$ accuracy, $0.75 \pm 0.02$ cIoU for multimodal LLMs).
Performance declines for Unrelated ( $48.7\% \pm 2.1$ accuracy, $0.32 \pm 0.01$ cIoU) and Zero-Overlap conditions ( $58.1\% \pm 1.6$ accuracy, $0.49 \pm 0.01$ cIoU), evidencing the challenge of factor recombination.
Pronounced cross-task transfer: segmentation performance can improve by $+7\%$ cIoU using classification-only training data, underscoring the shared representational space induced by the MAT framework.
Conventional models (e.g., ResNet-50, U-Net) show limited compositional gains under MAT splits compared to multimodal LLMs.

These findings validate the MAT schema’s sensitivity and discriminative granularity for compositionality and zero-shot generalization in medical VLMs (Singh et al., 14 Nov 2025, Cai et al., 28 Dec 2024).

6. Applications and Extensions of the MAT Paradigm

The MAT schema has been adopted in a range of applications beyond direct CG benchmarking:

Data curation and domain adaptation: MAT-driven sampling and labeling strategies enable principled dataset design for diverse anatomical and modality distributions (Cai et al., 28 Dec 2024).
Cross-domain translation and registration: In tasks such as MRI–CT or MR–US translation and registration, MAT elements precisely parameterize the reference (source/target) spaces, as seen in diffusion modeling approaches and attention-guided GANs (Ma et al., 1 Jun 2025, Emami et al., 2020, Lyu et al., 2020).
Unsupervised learning and disentanglement: Anatomy-aware networks exploit the MAT structure to disentangle domain and artifact variables, enabling effective knowledge transfer across modality–anatomy–task boundaries (Lyu et al., 2020).

A plausible implication is that the MAT schema provides a unifying backbone for the design of future medical AI systems, especially those emphasizing modular, factorized, and interpretable learning.

7. Analytical Significance and Outlook

The MAT schema delivers a rigorous, extensible approach for operationalizing compositional generalization in medical imaging. Its factorization enables precisely controlled benchmarks, interpretable model performance attribution, and comparably fair assessment across architectures. The schema’s adoption in large-scale public datasets (e.g., CrossMed, Med-MAT) and in specialized network designs for translation, segmentation, and registration underscores its versatility and impact (Singh et al., 14 Nov 2025, Cai et al., 28 Dec 2024, Ma et al., 1 Jun 2025, Emami et al., 2020, Lyu et al., 2020).

The MAT paradigm, by encoding modality, anatomy, and task as orthogonal axes, supports multi-task training, zero-shot transfer, cross-domain synthesis, and robust anatomical alignment—thereby advancing both theoretical understanding and practical capabilities in medical multimodal AI.