SynthSeg Agents: Synthetic Data for ZSWSSS
- SynthSeg Agents is a multi-agent framework that synthesizes high-quality training datasets using coordinated LLM and VLM agents for zero-shot weakly supervised semantic segmentation.
- It employs iterative self-refinement of prompts with CLIP-based semantic scoring and ViT-driven relabeling to ensure diverse and accurate synthetic annotations.
- The framework achieves competitive mIoU scores on benchmarks like PASCAL VOC 2012 and MS COCO 2014, narrowing the gap with traditional real-image training approaches.
SynthSeg Agents is a multi-agent framework designed to generate high-quality synthetic data for Zero-Shot Weakly Supervised Semantic Segmentation (ZSWSSS), a task which seeks to train dense prediction models using only synthetic image-level labels with no access to real images. The framework employs coordinated LLM-driven agents for prompt synthesis and image generation, using CLIP-based filtering and Vision Transformer (ViT)-based relabeling, to produce synthetic datasets suitable for weakly supervised semantic segmentation pipelines. SynthSeg Agents demonstrates competitive segmentation performance on benchmarks such as PASCAL VOC 2012 and MS COCO 2014 without using any real images at either the data generation or training stage (Wu et al., 17 Dec 2025).
1. Problem Formulation and Objectives
SynthSeg Agents addresses the problem of Zero-Shot Weakly Supervised Semantic Segmentation (ZSWSSS), which is defined as learning a segmentation model that predicts per-pixel class probabilities from real images, but is trained solely on a synthetic dataset generated without real images. Here, is a synthetic image, and is a multi-hot image-level label over a class set . The training objective minimizes a multi-label classification loss on global pooled features:
where is the predicted probability for class .
This decouples synthetic data generation from the segmentation model training: dataset is synthesized entirely by LLM/VLM-driven agents and used as input to any off-the-shelf WSSS model.
2. Self-Refine Prompt Agent
The Self-Refine Prompt Agent generates a bank of diverse, high-quality scene prompts for each class via a staged process:
2.1 Template Instantiation
A templating function leverages an LLM (e.g., GPT-4o) to instantiate scene prompts by populating descriptors such as background, pose, and style for each target class, yielding an initial prompt set .
2.2 Iterative Self-Refinement & Diversity Filtering
The agent maintains a memory buffer of accepted prompts and their CLIP text embeddings. During each refinement iteration:
- For a prompt , compute embedding .
- Retrieve the nearest neighbor embedding in using ANN search.
- If (e.g., ), the prompt is sufficiently diverse and added to the refined set and .
- Prompts undergo LLM-based quality checks, and refined with specific templates if below threshold .
Pseudocode for this refinement loop is provided in Algorithm 1 of the source.
2.3 CLIP-Based Semantic Scoring
Text–text CLIP similarity is exploited for semantic scoring, formalized as:
with output values in [0,1]. This metric governs both diversity acceptance during prompt generation (text–text) and selection in image filtering (text–image).
3. Image Generation Agent
The Image Generation Agent receives the filtered prompt bank and synthesizes labeled image samples through a three-stage process:
3.1 Vision–LLM Sampling
Each prompt is supplied to a pretrained Vision-LLM (VLM), such as GPT-Image-1, to generate an image .
3.2 CLIP-Based Dual Filtering
Determination of present classes in proceeds via dual alignment:
- Text alignment: Compare CLIP embeddings of the prompt and class label, ; retain if above threshold (e.g., 0.7).
- Image alignment: Compute similarity between generated image embedding and class label embedding, . Only the top-N scoring class–image associations are retained, forming the set .
3.3 ViT-Based Classifier & Relabeling
A ViT-B/32 model is trained on for multi-label classification. Images are patch-embedded as with class logits , . Aggregation uses global max pooling and binary cross-entropy loss:
After convergence, the classifier relabels the entire synthetic dataset, including lower-confidence images, yielding the final training set .
4. Integrated Pipeline and Training
SynthSeg Agents operates in two sequential stages:
| Sequence | Module | Input / Output |
|---|---|---|
| 1 | Self-Refine Prompt Agent | class set → |
| 2 | Image Generation Agent | → |
Once the synthetic dataset is established, any standard WSSS segmentation architecture (such as SEAM, ToCo, DeepLab) is trained with classification loss and segmentation-specific objectives. Segmentation performance is measured in terms of mean Intersection-over-Union (mIoU):
where and denote predicted and ground-truth masks for class .
5. Experimental Evaluation
SynthSeg Agents is instantiated on two major benchmarks:
5.1 PASCAL VOC 2012
- 20 classes, 10k synthetic images generated.
- Baseline WSSS models (ToCo, Seco) trained on real images yield $70.2$– mIoU.
- SynthSeg Agents, in zero-shot mode, achieves (ToCo) and (Seco) mIoU with purely synthetic data.
- Fine-tuning on real images improves performance to mIoU (on seen classes).
5.2 MS COCO 2014
- 80 classes, 80k synthetic images synthesized.
- State-of-the-art real-image WSSS performance: $43.6$– mIoU.
- SynthSeg Agents achieves mIoU without any real images.
- Mixing synthetic and real data for fine-tuning produces mIoU, surpassing the tested baselines.
5.3 Ablation Studies
Ablation experiments quantify contributions of agent modules:
| Component | mIoU (%) |
|---|---|
| Prompt Agent (Template Only) | 48.1 |
| + Self-Refine (quality scoring) | 49.9 |
| + CLIP diversity filtering | 52.5 |
| Image Agent (class-label only) | 46.7 |
| + CLIP filter | 50.8 |
| + CLIP + ViT relabel | 52.5 |
This demonstrates that both iterative prompt refinement with semantic diversity filtering and CLIP/ViT-driven relabeling yield substantial performance improvements.
5.4 Qualitative Analysis
Synthetic images for classes such as “dog,” “horse,” and “airplane” display diverse object poses, backgrounds, and multi-object compositions, outcomes directly attributable to prompt diversity and memory-based filtering.
6. Significance and Implications
SynthSeg Agents establishes that high-quality synthetic training datasets, generated entirely from coordinated LLM and VLM agents, can enable WSSS pipelines to achieve competitive segmentation performance in a true zero-shot setting. The modular architecture, comprising separate agents for prompt generation and image filtering/label refinement, supports semantic diversity and controllable data synthesis. Fine-tuning with real data further closes or exceeds the gap with traditionally supervised approaches. This framework highlights the potential for scalable WSSS and data-efficient semantic segmentation workflows unbounded by real-image availability (Wu et al., 17 Dec 2025).
A plausible implication is that LLM-driven synthetic data engines may become central to future semantic segmentation, particularly in domains or tasks where annotated datasets are scarce or unavailable.