Scene-Augmented Pseudo Prototypes (SAPP)

Updated 27 November 2025

The paper introduces SAPP to reduce the semantic gap in WS-OVOD by aligning context-enriched region features with scene-aware textual prototypes.
SAPP leverages LLM-generated scene phrases and computes cosine similarity with a soft multi-label sigmoid alignment to integrate contextual information.
Empirical results demonstrate notable gains in novel class detection, with improvements of +2.1 AP on OV-COCO and +1.9 AP on OV-LVIS benchmarks.

Scene-Augmented Pseudo Prototypes (SAPP) is a prototype enhancement module for Weakly Supervised Open-Vocabulary Object Detection (WS-OVOD). SAPP addresses the semantic gap inherent in weakly supervised paradigms, where region proposals, especially those obtained using maximal box heuristics from image-level labels, capture not only the object but also extensive contextual background. Standard prototypes—whether handcrafted, class-centric, or even augmented by state-aware generative models—fail to account for this co-occurring visual context. SAPP introduces a mechanism for extracting and modeling common scene-object configurations as a pool of class-specific, context-enriched textual prototypes and enforces a soft multi-label alignment between these scene-augmented textual embeddings and context-rich RoI features. This architecture bridges the modality gap between proposal features and textual prototypes, consistently improving detection of novel classes under weak supervision (Zhou et al., 22 Nov 2025).

1. Motivation and Problem Formulation

In WS-OVOD, models are required to detect objects from a large, open vocabulary with only a fraction of class labels annotated at the box level. The pseudo-labeling strategy commonly used—maximally sized region proposals—inevitably produces RoI features that entangle object-centric content with substantial scene context (background, co-occurring objects, geometry). Traditional semantic prototypes, whether static or enriched with state information via LLMs, remain predominantly object-centric and do not adequately represent this contextual information. This mismatch leads to suboptimal alignment between visual features and text-derived prototypes, particularly in context-dependent scenes.

SAPP is designed to reduce this semantic mismatch by directly modeling typical scene-object pairings at the prototype level, thereby ensuring that the semantic content of visual region features aligns more closely with the textual prototypes used in supervision.

2. Mathematical Framework and Algorithm

SAPP augments each class’s prototype representation with a finite, curated set of scene-aware textual phrases. For a class $c$ , a prompt—“In which contexts is C most commonly found? Please output phrases in the form ‘C + context’.”—is issued to a LLM (e.g., GPT-4o), generating $L$ distinct scene phrases $\{\mathrm{scene}_c^l\}_{l=1}^L$ such as “cat on sofa”, “cat in grass,” etc. Each phrase is encoded by a vision-LLM’s text encoder $f_t(\cdot)$ (e.g., CLIP-ViT-B/32) to yield the scene-augmented prototype set $W_{\mathrm{scene}} = \{ w_{\mathrm{scene},c}^l \mid c \in C, l=1 \ldots L\}$ , $w_{\mathrm{scene},c}^l = f_t(\mathrm{scene}_c^l)$ .

Given a weakly labeled image with true class $c^*$ , SAPP proceeds as follows:

The maximal-area proposal $b$ is selected. Its RoI feature $w_{mr} \in \mathbb{R}^d$ is obtained post-projection.
For each scene prototype $w_{\mathrm{scene},c}^l$ , compute cosine similarity:

$s_{b,c,l} = \frac{ \langle w_{mr}, w_{\mathrm{scene},c}^l \rangle }{ \|w_{mr}\| \cdot \|w_{\mathrm{scene},c}^l\| }$

Assign pseudo multi-labels:

$y_{b,c,l} = \begin{cases} 1 & \text{if } c=c^*, \ s_{b,c,l} \geq \tau \ 0 & \text{otherwise} \end{cases}$

$w_{b,c,l} = \sigma(s_{b,c,l})$ , with $\sigma(\cdot)$ the sigmoid and $\tau$ a threshold.

The loss is a confidence-weighted, multi-label cross-entropy averaged over $B$ batch proposals:

$\mathcal{L}_{\mathrm{scene}} = -\frac{1}{B} \sum_{b=1}^B \sum_{c \in C} \sum_{l=1}^L w_{b,c,l} \cdot \ell_{\mathrm{bce}}(y_{b,c,l}, s_{b,c,l})$

Where $\ell_{\mathrm{bce}}(y, s) = y \log \sigma(s) + (1-y) \log (1-\sigma(s))$ .

The full training objective jointly optimizes standard detection, weak supervision, and scene alignment:

$\mathcal{L} = \mathcal{L}_{\mathrm{det}} + \mathcal{L}_{\mathrm{weak}} + \lambda \cdot \mathcal{L}_{\mathrm{scene}}$

with $\lambda = 0.1$ .

3. Scene Context Extraction and Prototype Pooling

Scene contexts are extracted per class by querying an LLM with a dedicated prompt. GPT-4o is employed with zero-shot prompting (top-p $=0.9$ , temperature $=0.7$ ), generating $L=5$ candidate scene phrases per class. Each phrase is encoded via the CLIP text encoder; the resulting $L$ prototypes for each class participate directly in the alignment loss without further aggregation—unlike State-Enhanced Semantic Prototypes (SESP), which average over state and generic prototypes.

Example for class “cat”:

“cat on sofa”
“cat in grass”
“cat under table”
…

This strategy systematically expands each class’s semantic embedding to include plausible visual contexts, capturing more of the variance present in context-rich RoI features observed during weakly supervised learning.

4. Training Protocol and Implementation Specifics

SAPP operates only during training. The following parameters are used:

Scene phrase count $L=5$ (validated; larger $L$ introduces redundancy/noise),
Similarity threshold $\tau=0.2$ (validated to balance recall and false positives),
Scene alignment loss weight $\lambda=0.1$ ,
Text encoder: CLIP ViT-B/32, feature dimension $d=512$ ,
No temperature scaling in cosine similarities; direct sigmoid is applied,
During inference, only state-enhanced prototypes ( $p_c$ ) are used; SAPP modules are not active.

During joint training, the network alternates between supervised updates on detection-labeled data ( $D_{\mathrm{det}}$ ) and weakly supervised updates on classification-only data ( $D_{\mathrm{cls}}$ ), applying $\mathcal{L}_{\mathrm{scene}}$ in the latter setup. The full protocol—including RoI feature extraction, class-prototype similarity computation, assignment of pseudo targets, and confidence weighting—is strictly adhered to as per the reference implementation.

5. Empirical Performance and Ablation Studies

SAPP’s performance is validated on major WS-OVOD benchmarks, including OV-COCO (48 base / 17 novel) and OV-LVIS. The following AP (average precision) performance highlights are reported:

Model Variant	OV-COCO $\mathrm{AP}^n_{50}$	OV-LVIS AP $_r$
Baseline (Detic)	27.8	24.9
+ SAPP	29.9	26.8
+ SESP	29.6	26.7
Full (SESP+SAPP)	31.3	28.0

For SAPP, the standalone gain in AP for novel categories is +2.1 on OV-COCO; combined with SESP, the total gain is +3.5. On OV-LVIS, SAPP yields a +1.9 increase in rare-category precision, and the combined model yields +3.1. Ablation studies support $L=5$ as optimal; higher $L$ increases redundancy and harms performance. The threshold $\tau = 0.2$ is essential—deviation in either direction degrades calibration. Use of a learnable temperature or scaling factor was tested but found unstable under weak supervision.

6. Limitations and Failure Modes

Several limitations are observed:

SAPP’s effectiveness hinges on the quality and coverage of LLM-generated scene phrases; rare or noisy contexts diminish utility.
Over-alignment is possible: RoIs lying in plausible contexts without the object can erroneously receive high loss weights.
Exclusivity to max-area proposals restricts context modeling granularity; multi-instance alignment remains unexplored.
Classes with extreme context diversity (e.g., “giraffe”) are not fully captured by a fixed $L$ of scene phrases, limiting recall for rare scenes.

Failure modes include imprecise matching when context dominates, and underrepresentation of unusual class-context pairs.

7. Significance and Relationship to Prototype-Based Detection

SAPP constitutes a significant extension to prior open-vocabulary and prototype-based object detection methods by introducing scene context directly into the semantic prototype pool. Unlike approaches that merely enrich class representations through object states or large corpora expansions, SAPP’s explicit modeling of scene-object pairs enables more faithful alignment between visual region proposals and the semantic space used for supervision—even when only weak image-level labels are available. Combined with State-Enhanced Semantic Prototypes, SAPP delivers notable gains in novel-class localization and detection, especially in challenging weakly supervised and open-vocabulary contexts (Zhou et al., 22 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

State and Scene Enhanced Prototypes for Weakly Supervised Open-Vocabulary Object Detection (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Scene-Augmented Pseudo Prototypes (SAPP).