Papers
Topics
Authors
Recent
2000 character limit reached

SynthSeg Agents: Synthetic Data for ZSWSSS

Updated 20 December 2025
  • SynthSeg Agents is a multi-agent framework that synthesizes high-quality training datasets using coordinated LLM and VLM agents for zero-shot weakly supervised semantic segmentation.
  • It employs iterative self-refinement of prompts with CLIP-based semantic scoring and ViT-driven relabeling to ensure diverse and accurate synthetic annotations.
  • The framework achieves competitive mIoU scores on benchmarks like PASCAL VOC 2012 and MS COCO 2014, narrowing the gap with traditional real-image training approaches.

SynthSeg Agents is a multi-agent framework designed to generate high-quality synthetic data for Zero-Shot Weakly Supervised Semantic Segmentation (ZSWSSS), a task which seeks to train dense prediction models using only synthetic image-level labels with no access to real images. The framework employs coordinated LLM-driven agents for prompt synthesis and image generation, using CLIP-based filtering and Vision Transformer (ViT)-based relabeling, to produce synthetic datasets suitable for weakly supervised semantic segmentation pipelines. SynthSeg Agents demonstrates competitive segmentation performance on benchmarks such as PASCAL VOC 2012 and MS COCO 2014 without using any real images at either the data generation or training stage (Wu et al., 17 Dec 2025).

1. Problem Formulation and Objectives

SynthSeg Agents addresses the problem of Zero-Shot Weakly Supervised Semantic Segmentation (ZSWSSS), which is defined as learning a segmentation model Mseg()M_{\mathrm{seg}}(\cdot) that predicts per-pixel class probabilities from real images, but is trained solely on a synthetic dataset DZSWSSS={(Ii,Li)}i=1n\mathcal{D}_\mathrm{ZSWSSS} = \{(I_i, L_i)\}_{i=1}^n generated without real images. Here, IiRH×W×3I_i \in \mathbb{R}^{H\times W\times 3} is a synthetic image, and Li{0,1}CL_i \in \{0,1\}^{|\mathcal{C}|} is a multi-hot image-level label over a class set C={c1,,cC}\mathcal{C} = \{c_1, \ldots, c_{|\mathcal{C}|}\}. The training objective minimizes a multi-label classification loss on global pooled features:

Lcls=1CcC[yclogpc+(1yc)log(1pc)],\mathcal{L}_{\mathrm{cls}} = -\frac{1}{|\mathcal{C}|} \sum_{c\in\mathcal{C}} [ y_c \log p_c + (1-y_c) \log (1-p_c) ],

where pcp_c is the predicted probability for class cc.

This decouples synthetic data generation from the segmentation model training: dataset DZSWSSS\mathcal{D}_\mathrm{ZSWSSS} is synthesized entirely by LLM/VLM-driven agents and used as input to any off-the-shelf WSSS model.

2. Self-Refine Prompt Agent

The Self-Refine Prompt Agent generates a bank of diverse, high-quality scene prompts for each class cCc \in \mathcal{C} via a staged process:

2.1 Template Instantiation

A templating function T(c)\mathcal{T}(c) leverages an LLM (e.g., GPT-4o) to instantiate scene prompts by populating descriptors such as background, pose, and style for each target class, yielding an initial prompt set PinitP_{\mathrm{init}}.

2.2 Iterative Self-Refinement & Diversity Filtering

The agent maintains a memory buffer M\mathcal{M} of accepted prompts and their CLIP text embeddings. During each refinement iteration:

  • For a prompt pp, compute embedding ep=fclip_text(p)e_p = f_\mathrm{clip\_text}(p).
  • Retrieve the nearest neighbor embedding enne_{nn} in M\mathcal{M} using ANN search.
  • If scorediv(p)=cos(ep,enn)<δscore_\mathrm{div}(p) = \cos(e_p, e_{nn}) < \delta (e.g., δ=0.92\delta=0.92), the prompt is sufficiently diverse and added to the refined set PrefinedP_\mathrm{refined} and M\mathcal{M}.
  • Prompts undergo LLM-based quality checks, and refined with specific templates if below threshold ϵ\epsilon.

Pseudocode for this refinement loop is provided in Algorithm 1 of the source.

2.3 CLIP-Based Semantic Scoring

Text–text CLIP similarity is exploited for semantic scoring, formalized as:

Sclip(p,v)=1+cos(fclip_text(p),v)2,S_\mathrm{clip}(p,v) = \frac{1+\cos(f_\mathrm{clip\_text}(p), v)}{2},

with output values in [0,1]. This metric governs both diversity acceptance during prompt generation (text–text) and selection in image filtering (text–image).

3. Image Generation Agent

The Image Generation Agent receives the filtered prompt bank PrefinedP_\mathrm{refined} and synthesizes labeled image samples through a three-stage process:

3.1 Vision–LLM Sampling

Each prompt piPrefinedp_i \in P_\mathrm{refined} is supplied to a pretrained Vision-LLM (VLM), such as GPT-Image-1, to generate an image Igen=fVLM(pi)RH×W×3I_\mathrm{gen} = f_\mathrm{VLM}(p_i) \in \mathbb{R}^{H \times W \times 3}.

3.2 CLIP-Based Dual Filtering

Determination of present classes in IgenI_\mathrm{gen} proceeds via dual alignment:

  • Text alignment: Compare CLIP embeddings of the prompt and class label, scoretext(pi,c)score_\mathrm{text}(p_i,c); retain if above threshold γ1\gamma_1 (e.g., 0.7).
  • Image alignment: Compute similarity between generated image embedding and class label embedding, scoreimage(Igen,c)score_\mathrm{image}(I_\mathrm{gen},c). Only the top-N scoring class–image associations are retained, forming the set Dhigh\mathcal{D}_\mathrm{high}.

3.3 ViT-Based Classifier & Relabeling

A ViT-B/32 model is trained on Dhigh\mathcal{D}_\mathrm{high} for multi-label classification. Images are patch-embedded as F=ViT(Igen)Rs×eF = \mathrm{ViT}(I_\mathrm{gen}) \in \mathbb{R}^{s \times e} with class logits Z=softmax(FW)Z = \mathrm{softmax}(FW), WRe×CW \in \mathbb{R}^{e \times |\mathcal{C}|}. Aggregation uses global max pooling and binary cross-entropy loss:

LMCE=1Cc[yclogpc+(1yc)log(1pc)].\mathcal{L}_\mathrm{MCE} = -\frac{1}{|\mathcal{C}|} \sum_c [y_c \log p_c + (1-y_c) \log(1-p_c)].

After convergence, the classifier relabels the entire synthetic dataset, including lower-confidence images, yielding the final training set DZSWSSS\mathcal{D}_\mathrm{ZSWSSS}.

4. Integrated Pipeline and Training

SynthSeg Agents operates in two sequential stages:

Sequence Module Input / Output
1 Self-Refine Prompt Agent class set C\mathcal{C}PrefinedP_\mathrm{refined}
2 Image Generation Agent PrefinedP_\mathrm{refined}DZSWSSS\mathcal{D}_\mathrm{ZSWSSS}

Once the synthetic dataset is established, any standard WSSS segmentation architecture (such as SEAM, ToCo, DeepLab) is trained with classification loss Lcls\mathcal{L}_\mathrm{cls} and segmentation-specific objectives. Segmentation performance is measured in terms of mean Intersection-over-Union (mIoU):

mIoU=1CcPcGcPcGc,\mathrm{mIoU} = \frac{1}{|\mathcal{C}|} \sum_{c} \frac{|P_c \cap G_c|}{|P_c \cup G_c|},

where PcP_c and GcG_c denote predicted and ground-truth masks for class cc.

5. Experimental Evaluation

SynthSeg Agents is instantiated on two major benchmarks:

5.1 PASCAL VOC 2012

  • 20 classes, 10k synthetic images generated.
  • Baseline WSSS models (ToCo, Seco) trained on real images yield $70.2$–74.0%74.0\% mIoU.
  • SynthSeg Agents, in zero-shot mode, achieves 57.4%57.4\% (ToCo) and 60.1%60.1\% (Seco) mIoU with purely synthetic data.
  • Fine-tuning on real images improves performance to 75.4%75.4\% mIoU (on seen classes).

5.2 MS COCO 2014

  • 80 classes, 80k synthetic images synthesized.
  • State-of-the-art real-image WSSS performance: $43.6$–46.7%46.7\% mIoU.
  • SynthSeg Agents achieves 30.2%30.2\% mIoU without any real images.
  • Mixing synthetic and real data for fine-tuning produces 47.8%47.8\% mIoU, surpassing the tested baselines.

5.3 Ablation Studies

Ablation experiments quantify contributions of agent modules:

Component mIoU (%)
Prompt Agent (Template Only) 48.1
+ Self-Refine (quality scoring) 49.9
+ CLIP diversity filtering 52.5
Image Agent (class-label only) 46.7
+ CLIP filter 50.8
+ CLIP + ViT relabel 52.5

This demonstrates that both iterative prompt refinement with semantic diversity filtering and CLIP/ViT-driven relabeling yield substantial performance improvements.

5.4 Qualitative Analysis

Synthetic images for classes such as “dog,” “horse,” and “airplane” display diverse object poses, backgrounds, and multi-object compositions, outcomes directly attributable to prompt diversity and memory-based filtering.

6. Significance and Implications

SynthSeg Agents establishes that high-quality synthetic training datasets, generated entirely from coordinated LLM and VLM agents, can enable WSSS pipelines to achieve competitive segmentation performance in a true zero-shot setting. The modular architecture, comprising separate agents for prompt generation and image filtering/label refinement, supports semantic diversity and controllable data synthesis. Fine-tuning with real data further closes or exceeds the gap with traditionally supervised approaches. This framework highlights the potential for scalable WSSS and data-efficient semantic segmentation workflows unbounded by real-image availability (Wu et al., 17 Dec 2025).

A plausible implication is that LLM-driven synthetic data engines may become central to future semantic segmentation, particularly in domains or tasks where annotated datasets are scarce or unavailable.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to SynthSeg Agents.