SLIM-Diff: Compact Joint Diffusion for Imaging

Updated 10 February 2026

SLIM-Diff is a joint image-mask diffusion framework that uses a shared-bottleneck U-Net to couple anatomical and lesion features in data-scarce settings.
It employs a tunable Lp loss to balance image realism and mask sharpness, outperforming standard diffusion models in FLAIR MRI augmentation.
The framework reduces parameter count significantly while enabling simultaneous synthesis of high-fidelity FLAIR images and corresponding lesion masks for rare neurological disorders.

SLIM-Diff denotes a compact joint image–mask diffusion framework designed for data-scarce medical imaging regimes, particularly for epilepsy-focused FLAIR MRI. Its core contributions are (i) a single shared-bottleneck U-Net enforcing coupling between anatomy and lesion geometry through a 2-channel representation, and (ii) explicit loss-geometry tuning via a tunable $L_p$ objective. Unlike standard large-scale diffusion architectures, SLIM-Diff is tailored for low-data settings and the simultaneous synthesis of both FLAIR images and corresponding lesion masks, targeting robust data augmentation and generative segmentation for rare disorders such as focal cortical dysplasia (FCD) (Pascual-González et al., 3 Feb 2026).

1. Background and Motivation

Medical imaging datasets for rare neurological disorders, such as FCD II, are highly limited, typically yielding no more than $\sim$ 80 lesion-positive samples in public collections. This scarcity, combined with the subtle and highly-localized nature of lesion morphology on FLAIR MRI, triggers instability and overfitting in high-capacity generative models (e.g., canonical Stable Diffusion or LDM U-Nets with $\sim$ 860M parameters). The simultaneous generation of anatomically plausible images and spatially consistent lesion masks can enable task-specific data augmentation for segmentation and classification pipelines—but only if the generative process faithfully preserves both modalities' structure and does not merely memorize the limited training set.

SLIM-Diff addresses these requirements by adopting a tightly-coupled architecture (shared-bottleneck U-Net), reducing parameter count by more than an order of magnitude, and optimizing with a generalized loss geometry rather than canonical $L_2$ regression (Pascual-González et al., 3 Feb 2026).

2. Mathematical Formulation

SLIM-Diff implements the discrete-time DDPM framework ( $T=1000$ steps, cosine noise schedule) for a joint image–mask tensor $\mathbf{x}_0\in\mathbb{R}^{2\times H\times W}=[I,M]^\top$ . The forward process adds Gaussian noise: $q(\mathbf{x}_t | \mathbf{x}_0) = \mathcal{N}\left(\mathbf{x}_t;\,\sqrt{\bar\alpha_t}\,\mathbf{x}_0,\,(1-\bar\alpha_t)\mathbf{I}\right),\qquad\bar\alpha_t=\prod_{s=1}^t \alpha_s$ The reverse process, parameterized by U-Net $f_\theta$ , targets one of three prediction parameterizations:

$\epsilon$ -prediction (standard, predicts noise)
velocity $v_\theta = \sqrt{\bar\alpha_t}\,\epsilon_\theta - \sqrt{1-\bar\alpha_t}\,\mathbf{x}_0$
direct $x_{0,\theta}$ -prediction (clean image + mask)

The loss is a general $L_p$ norm over the prediction target: $\mathcal{L} = \mathbb{E}_{t,\mathbf{x}_0,\epsilon} \|\mathrm{target} - f_\theta(\mathbf{x}_t, t, c)\|_p^p$ where $p$ is tunable ( $p\in\{1.5,2.0,2.5\}$ ), with the specific choice impacting fidelity vs. mask morphology.

3. Shared-Bottleneck U-Net Architecture

The SLIM-Diff U-Net is strictly single-stream: both FLAIR image and binary lesion mask are concatenated as input. The encoder consists of four downsampling levels with channel widths $[64, 128, 256, 256]$ , each with two residual blocks using GroupNorm (32 groups), with strided convolutions for spatial reduction. Multihead self-attention (32-dim heads) is included at the deepest two scales. A $20\times 20\times 256$ shared bottleneck compresses both modalities, imposing an information choke point to enforce representational coupling.

The decoder mirrors the encoder. Skip connections preserve spatial resolution, and a final $1\times1$ convolution predicts both modalities. Conditioning is supplied via:

Axial + pathology indicator (60 discrete tokens: 30 axial bins $\times$ {control, lesion}), embedded via a learnable embedding and sinusoidal positional encoding.
Timestep encoding (sinusoidal + 2-layer MLP).
Both are injected into ResBlocks via FiLM-style bias modulation after the first convolution.

The total parameter count is $\sim$ 26.9M, a fraction of large-scale DMs.

4. Loss Geometry and Empirical Analysis

SLIM-Diff replaces the canonical $\epsilon$ -prediction $L_2$ loss with explicit tuning over target and $p$ :

For $x_0$ -prediction, $p=1.5$ (sub-quadratic) down-weights large residuals (e.g., hyperintense lesion pixels), improving global image realism (measured by KID, LPIPS).
$p=2.0$ gives optimal mask boundary sharpness (lowest MMD-MF), as the uniform quadratic penalty preserves subtle geometry.
$p=2.5$ over-penalizes outliers, degrading performance.
Across all $p$ , $x_0$ -prediction significantly outperforms $\epsilon$ -prediction and velocity-based targets in both image and mask metrics.

Table: Summary of best/typical configurations

Target / $p$	KID	LPIPS	MMD-MF
$\epsilon$ , $p=2.0$	0.432	0.821	15.06
$x_0$ , $p=1.5$	0.012	0.305	1.43
$x_0$ , $p=2.0$	0.034	0.310	0.95

Qualitative samples with $x_0/p=1.5$ exhibit realistic FLAIR contrast with coherent lesion masks; $x_0/p=2.0$ sharpens mask boundaries at the cost of slightly over-smoothed intensities.

No evidence of pixel-wise memorization was found; Kernel Maximum Mean Discrepancy tests confirmed distributional (not copy-based) generation.

5. Training and Sampling Procedures

The network is trained via AdamW (learning rate $10^{-4}$ ), cosine annealing, weight EMA ($0.999$ decay), and early stopping (validation patience 25 epochs). Preprocessing includes MNI registration, skull-stripping (ANTS SyN), N4 bias correction, and percentile normalization. Data is organized at the subject and slice level, with lesion oversampling to counteract class imbalance.

Sampling employs DDIM (300 steps, $\eta=0.2$ ), conditioned on desired axial depth and pathology indicator. Both training and inference require explicit condition tokens.

Pseudocode (Algorithm 1—Training):

for epoch in 1...max_epochs:
    for (x0, c) in minibatch:
        t ~ Uniform(1, T)
        eps ~ N(0, I)
        xt = sqrt(bar_alpha_t)*x0 + sqrt(1-bar_alpha_t)*eps
        output = f_theta(xt, t, c)
        L = ||target(x0, eps, t) - output||_p^p
        # Backpropagation and parameter update
        # EMA update
    # Early stopping check

Pseudocode (Algorithm 2—Sampling):

x_S ~ N(0, I)
for s in S...1:
    t = schedule(s)
    out = f_theta_bar(x_s, t, c)
    x_{s-1} = DDIM_update(x_s, out, t, eta)
return x0_image, x0_mask

6. Limitations and Applicability

SLIM-Diff operates on 2D slices, potentially introducing slight inter-slice inconsistencies. No direct comparison is provided against multi-stream and multi-stage joint generative frameworks (such as MedSegFactory or brainSPADE) under matched data conditions. Extension to full 3D or enforcing explicit slice-to-volume consistency is suggested as future work. For clinical deployment, care must be taken to evaluate robustness across lesion types, scanners, and domains; fine-tuning may be necessary.

Memory and computational requirements are modest, supporting real-time slice-level synthesis on standard GPUs.

7. Code and Reproducibility

The official implementation and pretrained models are available at https://github.com/MarioPasc/slim-diff. To reproduce, preprocessing (registration, skull stripping, bias correction), slice extraction, and balanced splitting are required. Training proceeds with configurable --target and --p_norm options; at convergence, the sample.py script enables generation under specified anatomical or pathological conditions.

In summary, SLIM-Diff establishes a methodological foundation for joint image–mask diffusion in data-scarce medical regimes, demonstrating that shared low-capacity architectures with explicit loss-geometry control enable robust synthesis of anatomically faithful images and masks while resisting memorization, and supporting downstream data augmentation for rare-disease imaging research (Pascual-González et al., 3 Feb 2026).

Markdown Report Issue Upgrade to Chat

References (1)

SLIM-Diff: Shared Latent Image-Mask Diffusion with Lp loss for Data-Scarce Epilepsy FLAIR MRI (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SLIM-Diff.

SLIM-Diff: Compact Joint Diffusion for Imaging

1. Background and Motivation

2. Mathematical Formulation

3. Shared-Bottleneck U-Net Architecture

4. Loss Geometry and Empirical Analysis

5. Training and Sampling Procedures

6. Limitations and Applicability

7. Code and Reproducibility

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

SLIM-Diff: Compact Joint Diffusion for Imaging

1. Background and Motivation

2. Mathematical Formulation

3. Shared-Bottleneck U-Net Architecture

4. Loss Geometry and Empirical Analysis

5. Training and Sampling Procedures

6. Limitations and Applicability

7. Code and Reproducibility

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research