JoDiffusion: Joint Segmentation Generation

Updated 22 December 2025

JoDiffusion is a generative dataset framework that simultaneously synthesizes images and pixel-level annotation masks directly from textual prompts.
It leverages a latent diffusion model paired with an annotation VAE to jointly model visual content and categorical segmentation maps, ensuring precise semantic alignment.
Empirical evaluations show 3–8 point improvements in mIoU on benchmarks like Pascal VOC, COCO, and ADE20K over conventional Image2Mask and Mask2Image pipelines.

JoDiffusion is a generative dataset framework for semantic segmentation, designed to jointly synthesize images and precisely aligned pixel-level annotation masks directly from textual prompts. It addresses the twin challenges of annotation cost and semantic consistency that hamper traditional segmentation data pipelines, and leverages a tailored latent diffusion model to parameterize the joint distribution over visual content and categorical segmentation maps (Wang et al., 15 Dec 2025).

1. Motivation and Problem Formulation

Semantic segmentation benchmarks demand dense, per-pixel annotations, incurring significant manual labor. Given an image $x$ and label map $m$ —with $m$ a spatial tensor of integer class assignments—the construction of large, diverse $(x, m)$ datasets is a key bottleneck. Synthetic data promises scalability, but two prevailing paradigms have notable drawbacks:

Image2Mask: A standard text-to-image diffusion model produces $x$ conditioned on textual prompt $c$ ; pseudo-masks $m'$ are then extracted from $x$ via attention clustering or saliency. This yields poorly localized or noisy $m'$ , especially for complex layouts.
Mask2Image: Images are generated from manual masks plus prompts. This is limited by mask diversity and is infeasible for intricate or rare semantic compositions.

JoDiffusion directly models the paired distribution $p_\theta(x, m \mid c)$ , ensuring semantic alignment while obviating the need for mask templates or post hoc mask generation.

2. Model Architecture and Latent Formulation

JoDiffusion consists of two primary modules:

Latent Diffusion Backbone: A conventional text-to-image latent diffusion model (e.g., Stable Diffusion, U-ViT) parameterizes image latents $z_x$ .
Annotation VAE: A neural variational auto-encoder (VAE) for segmentation masks encodes discrete label maps $m$ into low-dimensional latents $z_m$ .

The entire generative process operates in the joint latent space $(z_x, z_m)$ , enabling coupled synthesis. All components are conditional on textual prompt $c$ . Notation summary:

Symbol	Description	Domain
$x$	RGB input image	$\mathbb{R}^{H \times W \times 3}$
$m$	Pixel-level annotation mask	$\{1,\ldots,K\}^{H \times W}$
$c$	Text prompt (caption)	—
$z_x$	Image latent code	Typically $\mathbb{R}^{d}$
$z_m$	Mask latent code (VAE)	Typically $\mathbb{R}^{d'}$

The objective is to learn $p_\theta(x, m \mid c)$ such that samples $(x, m) \sim p_\theta(\cdot | c)$ are both photorealistic and precisely labeled for downstream segmentation training.

3. Diffusion Process and Joint Training Objective

3.1 Standard Latent Diffusion

The core is the latent diffusion forward process for an image:

$q(x_t \mid x_{t-1}) = \mathcal{N}\left(x_t; \sqrt{1 - \beta_t}\, x_{t-1},\, \beta_t I\right)$

with cumulative reparameterization: $x_t = \sqrt{\bar{\alpha}_t}\, x + \sqrt{1 - \bar{\alpha}_t}\, \varepsilon,\quad \bar{\alpha}_t = \prod_{s=1}^t (1-\beta_s)$ Noise predictor $\varepsilon_\theta$ is trained via: $\mathcal{L}_\mathrm{image} = \mathbb{E} \left[\|\varepsilon - \varepsilon_\theta(x_t, c, t)\|^2\right]$

3.2 Annotation VAE

For mask $m$ , a VAE is trained: $q_\phi(z_m|m),\quad p_\phi(m|z_m)$ with mask loss (using deterministic encoding, no KL penalty): $\mathcal{L}_\mathrm{anno} = -\sum_{i, j} \sum_{c=1}^K \mathbf{1}_{m_{i,j}=c} \log p_\phi(m_{i,j} = c \mid z_m)$

3.3 Joint Diffusion Chain

The joint forward noising couples $(z_x, z_m)$ : $q(z_x^t, z_m^t \mid z_x^{t-1}, z_m^{t-1}) = \mathcal{N} \left( \begin{bmatrix} z_x^t \ z_m^t \end{bmatrix}; \sqrt{\alpha_t} \begin{bmatrix} z_x^{t-1} \ z_m^{t-1} \end{bmatrix}, \beta_t I \right)$

Reverse denoising uses a unified predictor: $p_\theta(z_x^{t-1}, z_m^{t-1} \mid z_x^t, z_m^t, c) = \mathcal{N}(\mu_\theta,\, \sigma_t^2 I)$ Optimized via MSE on joint noise: $\mathcal{L}_\mathrm{joint} = \mathbb{E}\bigl[\|\varepsilon - \varepsilon_\theta(z_x^t, z_m^t, c, t)\|^2\bigr]$ Alternatively, terms for image and mask can be weighted by $\lambda$ .

4. Mask Optimization and Data Generation Pipeline

Despite joint modeling, generated masks $\hat{m}$ can contain spurious small regions. To correct, JoDiffusion uses boundary-mode correction: for each small connected region $R$ ( $|R| < \tau$ , with typical $\tau \approx 20$ px), all pixels in $R$ are relabeled to the mode category among boundary pixels $\partial R$ : $c^* = \arg\max_{c} \sum_{(i, j) \in \partial R} \mathbf{1}(m_{i, j} = c)$

$m_{i, j} \leftarrow c^*\quad \forall (i, j) \in R$

Dataset synthesis proceeds as follows:

Sample a text prompt $c$ describing the scene.
Initialize $(z_x^T, z_m^T) \sim \mathcal{N}(0, I)$ .
Perform $T$ joint reverse diffusion steps.
Decode to $x = \mathrm{Dec}_x(z_x^0)$ , $m = \arg\max \mathrm{Dec}_m(z_m^0)$ .
Optionally diversify prompts via template expansion or LLM-based paraphrasing.

5. Implementation Specifics

Image VAE: Derived from Stable Diffusion, $\sim$ 300M parameters, operating at $64 \times 64$ resolution.
Annotation VAE: Lightweight convolutional, $\sim$ 50M parameters, latent size $16 \times 16 \times 8$ .
Diffusion Backbone: U-ViT/Unidiffuser, 24 layers, 1024 dimensions. $T = 1000$ diffusion steps with $\beta_t$ linearly scheduled in $[10^{-4}, 0.02]$ .
Training Details: AdamW optimizer, learning rate $1 \times 10^{-4}$ , batch size 64, 200k iterations. Data augmentation includes random flips and prompt augmentation.

6. Empirical Evaluation and Results

JoDiffusion demonstrates significant improvements in segmentation training when using generated datasets on standard benchmarks:

Dataset & Model	Real mIoU	Best Prior (Synth/Real+Synth)	JoDiffusion (Synth/Real+Synth)
Pascal VOC (DeepLabV3-R50)	77.4	SDS: 60.4 / 77.6	72.5 / 78.3
COCO (DeepLabV3-R50)	48.9	Dataset Diffusion: 32.4 / 54.6	42.6 / 56.4
ADE20K (Mask2Former-R50)	47.2	FreeMask: 48.2	48.4

Across architectures (ResNet101, Swin-S), JoDiffusion delivered consistent 3–8 point improvements in mIoU over conventional Image2Mask and Mask2Image pipelines. Qualitative analysis reports sharp, pixel-precise contours and reliable object–label alignment, even for cluttered scenes and small object instances.

7. Scalability and Extensions

By conditioning generation solely on text, JoDiffusion can theoretically synthesize unlimited, diverse paired images and annotation maps without reliance on hand-crafted mask sets or semantic templates. Extensions proposed include multi-modal VAE integration (for depth, instance IDs), unified latent spaces for multi-task training, and LLM-guided prompt design to emphasize rare or challenging semantic classes (Wang et al., 15 Dec 2025). This suggests strong potential for domain-adaptive synthetic dataset construction and large-scale semantic segmentation model enhancement.

PDF Markdown Chat (Pro)

References (1)

JoDiffusion: Jointly Diffusing Image with Pixel-Level Annotations for Semantic Segmentation Promotion (2025)

JoDiffusion: Joint Segmentation Generation

1. Motivation and Problem Formulation

2. Model Architecture and Latent Formulation

3. Diffusion Process and Joint Training Objective

3.1 Standard Latent Diffusion

3.2 Annotation VAE

3.3 Joint Diffusion Chain

4. Mask Optimization and Data Generation Pipeline

5. Implementation Specifics

6. Empirical Evaluation and Results

7. Scalability and Extensions

Whiteboard

Follow Topic

Continue Learning

JoDiffusion: Joint Segmentation Generation

1. Motivation and Problem Formulation

2. Model Architecture and Latent Formulation

3. Diffusion Process and Joint Training Objective

3.1 Standard Latent Diffusion

3.2 Annotation VAE

3.3 Joint Diffusion Chain

4. Mask Optimization and Data Generation Pipeline

5. Implementation Specifics

6. Empirical Evaluation and Results

7. Scalability and Extensions

Sponsor

Whiteboard

Follow Topic

Continue Learning

Related Topics