SynthSeg Agents: Synthetic Data for ZSWSSS

Updated 20 December 2025

SynthSeg Agents is a multi-agent framework that synthesizes high-quality training datasets using coordinated LLM and VLM agents for zero-shot weakly supervised semantic segmentation.
It employs iterative self-refinement of prompts with CLIP-based semantic scoring and ViT-driven relabeling to ensure diverse and accurate synthetic annotations.
The framework achieves competitive mIoU scores on benchmarks like PASCAL VOC 2012 and MS COCO 2014, narrowing the gap with traditional real-image training approaches.

SynthSeg Agents is a multi-agent framework designed to generate high-quality synthetic data for Zero-Shot Weakly Supervised Semantic Segmentation (ZSWSSS), a task which seeks to train dense prediction models using only synthetic image-level labels with no access to real images. The framework employs coordinated LLM-driven agents for prompt synthesis and image generation, using CLIP-based filtering and Vision Transformer (ViT)-based relabeling, to produce synthetic datasets suitable for weakly supervised semantic segmentation pipelines. SynthSeg Agents demonstrates competitive segmentation performance on benchmarks such as PASCAL VOC 2012 and MS COCO 2014 without using any real images at either the data generation or training stage (Wu et al., 17 Dec 2025).

1. Problem Formulation and Objectives

SynthSeg Agents addresses the problem of Zero-Shot Weakly Supervised Semantic Segmentation (ZSWSSS), which is defined as learning a segmentation model $M_{\mathrm{seg}}(\cdot)$ that predicts per-pixel class probabilities from real images, but is trained solely on a synthetic dataset $\mathcal{D}_\mathrm{ZSWSSS} = \{(I_i, L_i)\}_{i=1}^n$ generated without real images. Here, $I_i \in \mathbb{R}^{H\times W\times 3}$ is a synthetic image, and $L_i \in \{0,1\}^{|\mathcal{C}|}$ is a multi-hot image-level label over a class set $\mathcal{C} = \{c_1, \ldots, c_{|\mathcal{C}|}\}$ . The training objective minimizes a multi-label classification loss on global pooled features:

$\mathcal{L}_{\mathrm{cls}} = -\frac{1}{|\mathcal{C}|} \sum_{c\in\mathcal{C}} [ y_c \log p_c + (1-y_c) \log (1-p_c) ],$

where $p_c$ is the predicted probability for class $c$ .

This decouples synthetic data generation from the segmentation model training: dataset $\mathcal{D}_\mathrm{ZSWSSS}$ is synthesized entirely by LLM/VLM-driven agents and used as input to any off-the-shelf WSSS model.

2. Self-Refine Prompt Agent

The Self-Refine Prompt Agent generates a bank of diverse, high-quality scene prompts for each class $c \in \mathcal{C}$ via a staged process:

2.1 Template Instantiation

A templating function $\mathcal{T}(c)$ leverages an LLM (e.g., GPT-4o) to instantiate scene prompts by populating descriptors such as background, pose, and style for each target class, yielding an initial prompt set $P_{\mathrm{init}}$ .

The agent maintains a memory buffer $\mathcal{M}$ of accepted prompts and their CLIP text embeddings. During each refinement iteration:

For a prompt $p$ , compute embedding $e_p = f_\mathrm{clip\_text}(p)$ .
Retrieve the nearest neighbor embedding $e_{nn}$ in $\mathcal{M}$ using ANN search.
If $score_\mathrm{div}(p) = \cos(e_p, e_{nn}) < \delta$ (e.g., $\delta=0.92$ ), the prompt is sufficiently diverse and added to the refined set $P_\mathrm{refined}$ and $\mathcal{M}$ .
Prompts undergo LLM-based quality checks, and refined with specific templates if below threshold $\epsilon$ .

Pseudocode for this refinement loop is provided in Algorithm 1 of the source.

2.3 CLIP-Based Semantic Scoring

Text–text CLIP similarity is exploited for semantic scoring, formalized as:

$S_\mathrm{clip}(p,v) = \frac{1+\cos(f_\mathrm{clip\_text}(p), v)}{2},$

with output values in [0,1]. This metric governs both diversity acceptance during prompt generation (text–text) and selection in image filtering (text–image).

3. Image Generation Agent

The Image Generation Agent receives the filtered prompt bank $P_\mathrm{refined}$ and synthesizes labeled image samples through a three-stage process:

3.1 Vision–LLM Sampling

Each prompt $p_i \in P_\mathrm{refined}$ is supplied to a pretrained Vision-LLM (VLM), such as GPT-Image-1, to generate an image $I_\mathrm{gen} = f_\mathrm{VLM}(p_i) \in \mathbb{R}^{H \times W \times 3}$ .

3.2 CLIP-Based Dual Filtering

Determination of present classes in $I_\mathrm{gen}$ proceeds via dual alignment:

Text alignment: Compare CLIP embeddings of the prompt and class label, $score_\mathrm{text}(p_i,c)$ ; retain if above threshold $\gamma_1$ (e.g., 0.7).
Image alignment: Compute similarity between generated image embedding and class label embedding, $score_\mathrm{image}(I_\mathrm{gen},c)$ . Only the top-N scoring class–image associations are retained, forming the set $\mathcal{D}_\mathrm{high}$ .

3.3 ViT-Based Classifier & Relabeling

A ViT-B/32 model is trained on $\mathcal{D}_\mathrm{high}$ for multi-label classification. Images are patch-embedded as $F = \mathrm{ViT}(I_\mathrm{gen}) \in \mathbb{R}^{s \times e}$ with class logits $Z = \mathrm{softmax}(FW)$ , $W \in \mathbb{R}^{e \times |\mathcal{C}|}$ . Aggregation uses global max pooling and binary cross-entropy loss:

$\mathcal{L}_\mathrm{MCE} = -\frac{1}{|\mathcal{C}|} \sum_c [y_c \log p_c + (1-y_c) \log(1-p_c)].$

After convergence, the classifier relabels the entire synthetic dataset, including lower-confidence images, yielding the final training set $\mathcal{D}_\mathrm{ZSWSSS}$ .

4. Integrated Pipeline and Training

SynthSeg Agents operates in two sequential stages:

Sequence	Module	Input / Output
1	Self-Refine Prompt Agent	class set $\mathcal{C}$ → $P_\mathrm{refined}$
2	Image Generation Agent	$P_\mathrm{refined}$ → $\mathcal{D}_\mathrm{ZSWSSS}$

Once the synthetic dataset is established, any standard WSSS segmentation architecture (such as SEAM, ToCo, DeepLab) is trained with classification loss $\mathcal{L}_\mathrm{cls}$ and segmentation-specific objectives. Segmentation performance is measured in terms of mean Intersection-over-Union (mIoU):

$\mathrm{mIoU} = \frac{1}{|\mathcal{C}|} \sum_{c} \frac{|P_c \cap G_c|}{|P_c \cup G_c|},$

where $P_c$ and $G_c$ denote predicted and ground-truth masks for class $c$ .

5. Experimental Evaluation

SynthSeg Agents is instantiated on two major benchmarks:

5.1 PASCAL VOC 2012

20 classes, 10k synthetic images generated.
Baseline WSSS models (ToCo, Seco) trained on real images yield $70.2$– $74.0\%$ mIoU.
SynthSeg Agents, in zero-shot mode, achieves $57.4\%$ (ToCo) and $60.1\%$ (Seco) mIoU with purely synthetic data.
Fine-tuning on real images improves performance to $75.4\%$ mIoU (on seen classes).

5.2 MS COCO 2014

80 classes, 80k synthetic images synthesized.
State-of-the-art real-image WSSS performance: $43.6$– $46.7\%$ mIoU.
SynthSeg Agents achieves $30.2\%$ mIoU without any real images.
Mixing synthetic and real data for fine-tuning produces $47.8\%$ mIoU, surpassing the tested baselines.

5.3 Ablation Studies

Ablation experiments quantify contributions of agent modules:

Component	mIoU (%)
Prompt Agent (Template Only)	48.1
+ Self-Refine (quality scoring)	49.9
+ CLIP diversity filtering	52.5
Image Agent (class-label only)	46.7
+ CLIP filter	50.8
+ CLIP + ViT relabel	52.5

This demonstrates that both iterative prompt refinement with semantic diversity filtering and CLIP/ViT-driven relabeling yield substantial performance improvements.

5.4 Qualitative Analysis

Synthetic images for classes such as “dog,” “horse,” and “airplane” display diverse object poses, backgrounds, and multi-object compositions, outcomes directly attributable to prompt diversity and memory-based filtering.

6. Significance and Implications

SynthSeg Agents establishes that high-quality synthetic training datasets, generated entirely from coordinated LLM and VLM agents, can enable WSSS pipelines to achieve competitive segmentation performance in a true zero-shot setting. The modular architecture, comprising separate agents for prompt generation and image filtering/label refinement, supports semantic diversity and controllable data synthesis. Fine-tuning with real data further closes or exceeds the gap with traditionally supervised approaches. This framework highlights the potential for scalable WSSS and data-efficient semantic segmentation workflows unbounded by real-image availability (Wu et al., 17 Dec 2025).

A plausible implication is that LLM-driven synthetic data engines may become central to future semantic segmentation, particularly in domains or tasks where annotated datasets are scarce or unavailable.

PDF Markdown Chat (Pro)

References (1)

SynthSeg-Agents: Multi-Agent Synthetic Data Generation for Zero-Shot Weakly Supervised Semantic Segmentation (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to SynthSeg Agents.

SynthSeg Agents: Synthetic Data for ZSWSSS

1. Problem Formulation and Objectives

2. Self-Refine Prompt Agent

2.1 Template Instantiation

2.2 Iterative Self-Refinement & Diversity Filtering

2.3 CLIP-Based Semantic Scoring

3. Image Generation Agent

3.1 Vision–LLM Sampling

3.2 CLIP-Based Dual Filtering

3.3 ViT-Based Classifier & Relabeling

4. Integrated Pipeline and Training

5. Experimental Evaluation

5.1 PASCAL VOC 2012

5.2 MS COCO 2014

5.3 Ablation Studies

5.4 Qualitative Analysis

6. Significance and Implications

Whiteboard

Follow Topic

Continue Learning

SynthSeg Agents: Synthetic Data for ZSWSSS

1. Problem Formulation and Objectives

2. Self-Refine Prompt Agent

2.1 Template Instantiation

2.2 Iterative Self-Refinement & Diversity Filtering

2.3 CLIP-Based Semantic Scoring

3. Image Generation Agent

3.1 Vision–LLM Sampling

3.2 CLIP-Based Dual Filtering

3.3 ViT-Based Classifier & Relabeling

4. Integrated Pipeline and Training

5. Experimental Evaluation

5.1 PASCAL VOC 2012

5.2 MS COCO 2014

5.3 Ablation Studies

5.4 Qualitative Analysis

6. Significance and Implications

Sponsor

Whiteboard

Follow Topic

Continue Learning

Related Topics