Papers
Topics
Authors
Recent
2000 character limit reached

State-Enhanced Semantic Prototypes (SESP)

Updated 27 November 2025
  • SESP is a paradigm that enhances semantic prototypes by explicitly incorporating state information to capture intra-class visual variability.
  • It employs GPT-4 for generating concise state-specific descriptions and CLIP for embedding, enabling effective aggregation of semantic cues.
  • Empirical results show that SESP significantly improves detection performance on novel and rare classes in weakly supervised settings.

State-Enhanced Semantic Prototypes (SESP) constitute a prototype construction paradigm specifically designed to address limitations of static semantic prototypes in weakly supervised open-vocabulary object detection (WS-OVOD). SESP explicitly injects state information into semantic representations, enhancing cross-modal alignment between visual region features and their corresponding textual prototypes. This approach enables detectors to capture intra-class visual variability driven by different object states—such as poses or activities—by aggregating state-aware textual descriptions into rich, discriminative prototypes that improve generalization to novel and weakly labeled object categories (Zhou et al., 22 Nov 2025).

1. Motivation: Limitations of Static Semantic Prototypes

Conventional open-vocabulary object detectors employ text embeddings of category names (e.g., “cat,” “dog”) as semantic prototypes to align with visual region features. Such static prototypes are sensitive to intra-class variation, as they cannot accommodate significant state-driven visual differences (e.g., a cat at rest versus mid-leap). Previous prompt-enrichment methods generate generic, state-agnostic descriptions (“a small furry feline”), failing to capture nuanced pose, activity, or state-induced appearance shifts. The SESP framework advances beyond state-agnostic prototypes by incorporating explicit state information, creating prototypes that are naturally more discriminative and better aligned with the diversity of real-world object appearances (Zhou et al., 22 Nov 2025).

2. Construction of State-Aware Textual Descriptions

SESP leverages a LLM (GPT-4) deployed via API to generate concise, visually descriptive sentences for each class in its prevalent states. For a given class CC, a two-part prompting strategy is utilized:

  • Query for common states or forms: "What are the common states or forms of CC? For each common state of CC, provide a one-sentence description of its visual appearance."
  • Query for a generic appearance description: "What does CC generally look like? Provide a one-sentence generic description of its appearance."

The model responds with KK state phrases (e.g., “a sleeping cat,” “a running dog”) and a broad generic description (e.g., “a small furry animal often found in homes”). These responses provide explicit state information and general category cues as textual inputs to the prototype formulation process (Zhou et al., 22 Nov 2025).

3. Prototype Formulation and Aggregation

For each class cc, the KK generated state descriptions Sc={s1,,sK}S_c = \{s_1, \ldots, s_K\} and the generic description descdes_c are embedded via a frozen pretrained text encoder, ft()f_t(\cdot), specifically the CLIP (ViT-B/32) text encoder:

  • tc,sk=ft(sk)t_{c,s_k} = f_t(s_k) for k=1,,Kk = 1, \ldots, K
  • tc,des=ft(desc)t_{c,des} = f_t(des_c)

Each vector may be 2\ell_2-normalized for unit length: t^c,sk=tc,sk/tc,sk2\hat{t}_{c,s_k} = t_{c,s_k} / \| t_{c,s_k} \|_2.

The embeddings are aggregated using mean pooling, combining both state-specific and generic information:

Pc=1K+1(tc,des+k=1Ktc,sk)P_c = \frac{1}{K+1} \left( t_{c,des} + \sum_{k=1}^K t_{c,s_k} \right)

Here, PcP_c is a class-specific, state-enhanced prototype in the CLIP latent space, encoding both general and state-conditioned cues for improved discriminability (Zhou et al., 22 Nov 2025).

4. Integration into Training and Classification Loss

SESP prototypes are incorporated into the classification objective for open-vocabulary detection models. For region proposal ii with visual feature fif_i and ground-truth class yiy_i, classification employs a cosine-based head with temperature τ\tau:

LSESP=ilogexp(fiPyi/τ)cCexp(fiPc/τ)\mathcal{L}_{SESP} = -\sum_i \log \frac{\exp(f_i^\top P_{y_i} / \tau)} {\sum_{c \in \mathcal{C}} \exp(f_i^\top P_c / \tau)}

  • fiRDf_i \in \mathbb{R}^D: vision feature for proposal ii
  • PcRDP_c \in \mathbb{R}^D: SESP of class cc
  • yiy_i: ground-truth class label for proposal ii
  • C\mathcal{C}: category set
  • τ\tau: softmax temperature (default \sim0.01 from CLIP)

This loss function simultaneously governs fully-supervised and weakly supervised branches, enforcing alignment between region features and state-aware prototypes throughout the training regime (Zhou et al., 22 Nov 2025).

5. Implementation Details

Key components of the SESP pipeline include:

Component Implementation Note
Text Encoder CLIP (ViT-B/32), frozen Ensures consistency and compatibility
LLM for Description GPT-4 via API Offline generation, prompted per class
Number of States KK 5 or 7 Ablation finds optimal trade-off at K=7K=7
Aggregation Mean pooling Simple, stable, and effective
Temperature τ\tau \sim0.01 Adopted from CLIP definition
Storage Prototypes cached offline One PcP_c per class in vocabulary

Prompt responses for all classes are generated and cached once per training session, ensuring efficient prototype lookup and stability across runs. No fine-tuning of the text encoder is performed during model training (Zhou et al., 22 Nov 2025).

6. Empirical Evaluation and Performance Impact

The inclusion of SESP in the Detic baseline on the OV-COCO WS-OVOD benchmark results in an increase of +1.8 APⁿ₅₀ on novel classes, demonstrating the practical benefit of modeling intra-class state variation. When K=7K=7, the performance peaks; values of K>7K > 7 yield marginal decreases attributable to noisy or redundant state descriptions. Ablation studies show that alternative aggregation methods (median, two-stage mean, similarity-weighted mean) lead to only minor variation (±\pm0.2 APⁿ₅₀), indicating robustness of mean pooling.

In zero-shot evaluation on Objects365, SESP improves AP_r (rare classes) from 12.4 to 13.7 over the Detic baseline. Combined with Scene-Augmented Pseudo Prototypes (SAPP), the approach achieves a cumulative +3.5 APⁿ₅₀ improvement, establishing the importance of integrating both state-aware and context-aware semantics for robust weakly supervised open-vocabulary detection (Zhou et al., 22 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to State-Enhanced Semantic Prototypes (SESP).