JoDiffusion: Joint Segmentation Generation
- JoDiffusion is a generative dataset framework that simultaneously synthesizes images and pixel-level annotation masks directly from textual prompts.
- It leverages a latent diffusion model paired with an annotation VAE to jointly model visual content and categorical segmentation maps, ensuring precise semantic alignment.
- Empirical evaluations show 3–8 point improvements in mIoU on benchmarks like Pascal VOC, COCO, and ADE20K over conventional Image2Mask and Mask2Image pipelines.
JoDiffusion is a generative dataset framework for semantic segmentation, designed to jointly synthesize images and precisely aligned pixel-level annotation masks directly from textual prompts. It addresses the twin challenges of annotation cost and semantic consistency that hamper traditional segmentation data pipelines, and leverages a tailored latent diffusion model to parameterize the joint distribution over visual content and categorical segmentation maps (Wang et al., 15 Dec 2025).
1. Motivation and Problem Formulation
Semantic segmentation benchmarks demand dense, per-pixel annotations, incurring significant manual labor. Given an image and label map —with a spatial tensor of integer class assignments—the construction of large, diverse datasets is a key bottleneck. Synthetic data promises scalability, but two prevailing paradigms have notable drawbacks:
- Image2Mask: A standard text-to-image diffusion model produces conditioned on textual prompt ; pseudo-masks are then extracted from via attention clustering or saliency. This yields poorly localized or noisy , especially for complex layouts.
- Mask2Image: Images are generated from manual masks plus prompts. This is limited by mask diversity and is infeasible for intricate or rare semantic compositions.
JoDiffusion directly models the paired distribution , ensuring semantic alignment while obviating the need for mask templates or post hoc mask generation.
2. Model Architecture and Latent Formulation
JoDiffusion consists of two primary modules:
- Latent Diffusion Backbone: A conventional text-to-image latent diffusion model (e.g., Stable Diffusion, U-ViT) parameterizes image latents .
- Annotation VAE: A neural variational auto-encoder (VAE) for segmentation masks encodes discrete label maps into low-dimensional latents .
The entire generative process operates in the joint latent space , enabling coupled synthesis. All components are conditional on textual prompt . Notation summary:
| Symbol | Description | Domain |
|---|---|---|
| RGB input image | ||
| Pixel-level annotation mask | ||
| Text prompt (caption) | — | |
| Image latent code | Typically | |
| Mask latent code (VAE) | Typically |
The objective is to learn such that samples are both photorealistic and precisely labeled for downstream segmentation training.
3. Diffusion Process and Joint Training Objective
3.1 Standard Latent Diffusion
The core is the latent diffusion forward process for an image:
with cumulative reparameterization: Noise predictor is trained via:
3.2 Annotation VAE
For mask , a VAE is trained: with mask loss (using deterministic encoding, no KL penalty):
3.3 Joint Diffusion Chain
The joint forward noising couples :
Reverse denoising uses a unified predictor: Optimized via MSE on joint noise: Alternatively, terms for image and mask can be weighted by .
4. Mask Optimization and Data Generation Pipeline
Despite joint modeling, generated masks can contain spurious small regions. To correct, JoDiffusion uses boundary-mode correction: for each small connected region (, with typical px), all pixels in are relabeled to the mode category among boundary pixels :
Dataset synthesis proceeds as follows:
- Sample a text prompt describing the scene.
- Initialize .
- Perform joint reverse diffusion steps.
- Decode to , .
- Optionally diversify prompts via template expansion or LLM-based paraphrasing.
5. Implementation Specifics
- Image VAE: Derived from Stable Diffusion, 300M parameters, operating at resolution.
- Annotation VAE: Lightweight convolutional, 50M parameters, latent size .
- Diffusion Backbone: U-ViT/Unidiffuser, 24 layers, 1024 dimensions. diffusion steps with linearly scheduled in .
- Training Details: AdamW optimizer, learning rate , batch size 64, 200k iterations. Data augmentation includes random flips and prompt augmentation.
6. Empirical Evaluation and Results
JoDiffusion demonstrates significant improvements in segmentation training when using generated datasets on standard benchmarks:
| Dataset & Model | Real mIoU | Best Prior (Synth/Real+Synth) | JoDiffusion (Synth/Real+Synth) |
|---|---|---|---|
| Pascal VOC (DeepLabV3-R50) | 77.4 | SDS: 60.4 / 77.6 | 72.5 / 78.3 |
| COCO (DeepLabV3-R50) | 48.9 | Dataset Diffusion: 32.4 / 54.6 | 42.6 / 56.4 |
| ADE20K (Mask2Former-R50) | 47.2 | FreeMask: 48.2 | 48.4 |
Across architectures (ResNet101, Swin-S), JoDiffusion delivered consistent 3–8 point improvements in mIoU over conventional Image2Mask and Mask2Image pipelines. Qualitative analysis reports sharp, pixel-precise contours and reliable object–label alignment, even for cluttered scenes and small object instances.
7. Scalability and Extensions
By conditioning generation solely on text, JoDiffusion can theoretically synthesize unlimited, diverse paired images and annotation maps without reliance on hand-crafted mask sets or semantic templates. Extensions proposed include multi-modal VAE integration (for depth, instance IDs), unified latent spaces for multi-task training, and LLM-guided prompt design to emphasize rare or challenging semantic classes (Wang et al., 15 Dec 2025). This suggests strong potential for domain-adaptive synthetic dataset construction and large-scale semantic segmentation model enhancement.