Papers
Topics
Authors
Recent
2000 character limit reached

JoDiffusion: Joint Segmentation Generation

Updated 22 December 2025
  • JoDiffusion is a generative dataset framework that simultaneously synthesizes images and pixel-level annotation masks directly from textual prompts.
  • It leverages a latent diffusion model paired with an annotation VAE to jointly model visual content and categorical segmentation maps, ensuring precise semantic alignment.
  • Empirical evaluations show 3–8 point improvements in mIoU on benchmarks like Pascal VOC, COCO, and ADE20K over conventional Image2Mask and Mask2Image pipelines.

JoDiffusion is a generative dataset framework for semantic segmentation, designed to jointly synthesize images and precisely aligned pixel-level annotation masks directly from textual prompts. It addresses the twin challenges of annotation cost and semantic consistency that hamper traditional segmentation data pipelines, and leverages a tailored latent diffusion model to parameterize the joint distribution over visual content and categorical segmentation maps (Wang et al., 15 Dec 2025).

1. Motivation and Problem Formulation

Semantic segmentation benchmarks demand dense, per-pixel annotations, incurring significant manual labor. Given an image xx and label map mm—with mm a spatial tensor of integer class assignments—the construction of large, diverse (x,m)(x, m) datasets is a key bottleneck. Synthetic data promises scalability, but two prevailing paradigms have notable drawbacks:

  • Image2Mask: A standard text-to-image diffusion model produces xx conditioned on textual prompt cc; pseudo-masks mm' are then extracted from xx via attention clustering or saliency. This yields poorly localized or noisy mm', especially for complex layouts.
  • Mask2Image: Images are generated from manual masks plus prompts. This is limited by mask diversity and is infeasible for intricate or rare semantic compositions.

JoDiffusion directly models the paired distribution pθ(x,mc)p_\theta(x, m \mid c), ensuring semantic alignment while obviating the need for mask templates or post hoc mask generation.

2. Model Architecture and Latent Formulation

JoDiffusion consists of two primary modules:

  • Latent Diffusion Backbone: A conventional text-to-image latent diffusion model (e.g., Stable Diffusion, U-ViT) parameterizes image latents zxz_x.
  • Annotation VAE: A neural variational auto-encoder (VAE) for segmentation masks encodes discrete label maps mm into low-dimensional latents zmz_m.

The entire generative process operates in the joint latent space (zx,zm)(z_x, z_m), enabling coupled synthesis. All components are conditional on textual prompt cc. Notation summary:

Symbol Description Domain
xx RGB input image RH×W×3\mathbb{R}^{H \times W \times 3}
mm Pixel-level annotation mask {1,,K}H×W\{1,\ldots,K\}^{H \times W}
cc Text prompt (caption)
zxz_x Image latent code Typically Rd\mathbb{R}^{d}
zmz_m Mask latent code (VAE) Typically Rd\mathbb{R}^{d'}

The objective is to learn pθ(x,mc)p_\theta(x, m \mid c) such that samples (x,m)pθ(c)(x, m) \sim p_\theta(\cdot | c) are both photorealistic and precisely labeled for downstream segmentation training.

3. Diffusion Process and Joint Training Objective

3.1 Standard Latent Diffusion

The core is the latent diffusion forward process for an image:

q(xtxt1)=N(xt;1βtxt1,βtI)q(x_t \mid x_{t-1}) = \mathcal{N}\left(x_t; \sqrt{1 - \beta_t}\, x_{t-1},\, \beta_t I\right)

with cumulative reparameterization: xt=αˉtx+1αˉtε,αˉt=s=1t(1βs)x_t = \sqrt{\bar{\alpha}_t}\, x + \sqrt{1 - \bar{\alpha}_t}\, \varepsilon,\quad \bar{\alpha}_t = \prod_{s=1}^t (1-\beta_s) Noise predictor εθ\varepsilon_\theta is trained via: Limage=E[εεθ(xt,c,t)2]\mathcal{L}_\mathrm{image} = \mathbb{E} \left[\|\varepsilon - \varepsilon_\theta(x_t, c, t)\|^2\right]

3.2 Annotation VAE

For mask mm, a VAE is trained: qϕ(zmm),pϕ(mzm)q_\phi(z_m|m),\quad p_\phi(m|z_m) with mask loss (using deterministic encoding, no KL penalty): Lanno=i,jc=1K1mi,j=clogpϕ(mi,j=czm)\mathcal{L}_\mathrm{anno} = -\sum_{i, j} \sum_{c=1}^K \mathbf{1}_{m_{i,j}=c} \log p_\phi(m_{i,j} = c \mid z_m)

3.3 Joint Diffusion Chain

The joint forward noising couples (zx,zm)(z_x, z_m): q(zxt,zmtzxt1,zmt1)=N([zxt zmt];αt[zxt1 zmt1],βtI)q(z_x^t, z_m^t \mid z_x^{t-1}, z_m^{t-1}) = \mathcal{N} \left( \begin{bmatrix} z_x^t \ z_m^t \end{bmatrix}; \sqrt{\alpha_t} \begin{bmatrix} z_x^{t-1} \ z_m^{t-1} \end{bmatrix}, \beta_t I \right)

Reverse denoising uses a unified predictor: pθ(zxt1,zmt1zxt,zmt,c)=N(μθ,σt2I)p_\theta(z_x^{t-1}, z_m^{t-1} \mid z_x^t, z_m^t, c) = \mathcal{N}(\mu_\theta,\, \sigma_t^2 I) Optimized via MSE on joint noise: Ljoint=E[εεθ(zxt,zmt,c,t)2]\mathcal{L}_\mathrm{joint} = \mathbb{E}\bigl[\|\varepsilon - \varepsilon_\theta(z_x^t, z_m^t, c, t)\|^2\bigr] Alternatively, terms for image and mask can be weighted by λ\lambda.

4. Mask Optimization and Data Generation Pipeline

Despite joint modeling, generated masks m^\hat{m} can contain spurious small regions. To correct, JoDiffusion uses boundary-mode correction: for each small connected region RR (R<τ|R| < \tau, with typical τ20\tau \approx 20 px), all pixels in RR are relabeled to the mode category among boundary pixels R\partial R: c=argmaxc(i,j)R1(mi,j=c)c^* = \arg\max_{c} \sum_{(i, j) \in \partial R} \mathbf{1}(m_{i, j} = c)

mi,jc(i,j)Rm_{i, j} \leftarrow c^*\quad \forall (i, j) \in R

Dataset synthesis proceeds as follows:

  1. Sample a text prompt cc describing the scene.
  2. Initialize (zxT,zmT)N(0,I)(z_x^T, z_m^T) \sim \mathcal{N}(0, I).
  3. Perform TT joint reverse diffusion steps.
  4. Decode to x=Decx(zx0)x = \mathrm{Dec}_x(z_x^0), m=argmaxDecm(zm0)m = \arg\max \mathrm{Dec}_m(z_m^0).
  5. Optionally diversify prompts via template expansion or LLM-based paraphrasing.

5. Implementation Specifics

  • Image VAE: Derived from Stable Diffusion, \sim300M parameters, operating at 64×6464 \times 64 resolution.
  • Annotation VAE: Lightweight convolutional, \sim50M parameters, latent size 16×16×816 \times 16 \times 8.
  • Diffusion Backbone: U-ViT/Unidiffuser, 24 layers, 1024 dimensions. T=1000T = 1000 diffusion steps with βt\beta_t linearly scheduled in [104,0.02][10^{-4}, 0.02].
  • Training Details: AdamW optimizer, learning rate 1×1041 \times 10^{-4}, batch size 64, 200k iterations. Data augmentation includes random flips and prompt augmentation.

6. Empirical Evaluation and Results

JoDiffusion demonstrates significant improvements in segmentation training when using generated datasets on standard benchmarks:

Dataset & Model Real mIoU Best Prior (Synth/Real+Synth) JoDiffusion (Synth/Real+Synth)
Pascal VOC (DeepLabV3-R50) 77.4 SDS: 60.4 / 77.6 72.5 / 78.3
COCO (DeepLabV3-R50) 48.9 Dataset Diffusion: 32.4 / 54.6 42.6 / 56.4
ADE20K (Mask2Former-R50) 47.2 FreeMask: 48.2 48.4

Across architectures (ResNet101, Swin-S), JoDiffusion delivered consistent 3–8 point improvements in mIoU over conventional Image2Mask and Mask2Image pipelines. Qualitative analysis reports sharp, pixel-precise contours and reliable object–label alignment, even for cluttered scenes and small object instances.

7. Scalability and Extensions

By conditioning generation solely on text, JoDiffusion can theoretically synthesize unlimited, diverse paired images and annotation maps without reliance on hand-crafted mask sets or semantic templates. Extensions proposed include multi-modal VAE integration (for depth, instance IDs), unified latent spaces for multi-task training, and LLM-guided prompt design to emphasize rare or challenging semantic classes (Wang et al., 15 Dec 2025). This suggests strong potential for domain-adaptive synthetic dataset construction and large-scale semantic segmentation model enhancement.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to JoDiffusion.