State-Enhanced Semantic Prototypes (SESP)
- SESP is a paradigm that enhances semantic prototypes by explicitly incorporating state information to capture intra-class visual variability.
- It employs GPT-4 for generating concise state-specific descriptions and CLIP for embedding, enabling effective aggregation of semantic cues.
- Empirical results show that SESP significantly improves detection performance on novel and rare classes in weakly supervised settings.
State-Enhanced Semantic Prototypes (SESP) constitute a prototype construction paradigm specifically designed to address limitations of static semantic prototypes in weakly supervised open-vocabulary object detection (WS-OVOD). SESP explicitly injects state information into semantic representations, enhancing cross-modal alignment between visual region features and their corresponding textual prototypes. This approach enables detectors to capture intra-class visual variability driven by different object states—such as poses or activities—by aggregating state-aware textual descriptions into rich, discriminative prototypes that improve generalization to novel and weakly labeled object categories (Zhou et al., 22 Nov 2025).
1. Motivation: Limitations of Static Semantic Prototypes
Conventional open-vocabulary object detectors employ text embeddings of category names (e.g., “cat,” “dog”) as semantic prototypes to align with visual region features. Such static prototypes are sensitive to intra-class variation, as they cannot accommodate significant state-driven visual differences (e.g., a cat at rest versus mid-leap). Previous prompt-enrichment methods generate generic, state-agnostic descriptions (“a small furry feline”), failing to capture nuanced pose, activity, or state-induced appearance shifts. The SESP framework advances beyond state-agnostic prototypes by incorporating explicit state information, creating prototypes that are naturally more discriminative and better aligned with the diversity of real-world object appearances (Zhou et al., 22 Nov 2025).
2. Construction of State-Aware Textual Descriptions
SESP leverages a LLM (GPT-4) deployed via API to generate concise, visually descriptive sentences for each class in its prevalent states. For a given class , a two-part prompting strategy is utilized:
- Query for common states or forms: "What are the common states or forms of ? For each common state of , provide a one-sentence description of its visual appearance."
- Query for a generic appearance description: "What does generally look like? Provide a one-sentence generic description of its appearance."
The model responds with state phrases (e.g., “a sleeping cat,” “a running dog”) and a broad generic description (e.g., “a small furry animal often found in homes”). These responses provide explicit state information and general category cues as textual inputs to the prototype formulation process (Zhou et al., 22 Nov 2025).
3. Prototype Formulation and Aggregation
For each class , the generated state descriptions and the generic description are embedded via a frozen pretrained text encoder, , specifically the CLIP (ViT-B/32) text encoder:
- for
Each vector may be -normalized for unit length: .
The embeddings are aggregated using mean pooling, combining both state-specific and generic information:
Here, is a class-specific, state-enhanced prototype in the CLIP latent space, encoding both general and state-conditioned cues for improved discriminability (Zhou et al., 22 Nov 2025).
4. Integration into Training and Classification Loss
SESP prototypes are incorporated into the classification objective for open-vocabulary detection models. For region proposal with visual feature and ground-truth class , classification employs a cosine-based head with temperature :
- : vision feature for proposal
- : SESP of class
- : ground-truth class label for proposal
- : category set
- : softmax temperature (default 0.01 from CLIP)
This loss function simultaneously governs fully-supervised and weakly supervised branches, enforcing alignment between region features and state-aware prototypes throughout the training regime (Zhou et al., 22 Nov 2025).
5. Implementation Details
Key components of the SESP pipeline include:
| Component | Implementation | Note |
|---|---|---|
| Text Encoder | CLIP (ViT-B/32), frozen | Ensures consistency and compatibility |
| LLM for Description | GPT-4 via API | Offline generation, prompted per class |
| Number of States | 5 or 7 | Ablation finds optimal trade-off at |
| Aggregation | Mean pooling | Simple, stable, and effective |
| Temperature | 0.01 | Adopted from CLIP definition |
| Storage | Prototypes cached offline | One per class in vocabulary |
Prompt responses for all classes are generated and cached once per training session, ensuring efficient prototype lookup and stability across runs. No fine-tuning of the text encoder is performed during model training (Zhou et al., 22 Nov 2025).
6. Empirical Evaluation and Performance Impact
The inclusion of SESP in the Detic baseline on the OV-COCO WS-OVOD benchmark results in an increase of +1.8 APⁿ₅₀ on novel classes, demonstrating the practical benefit of modeling intra-class state variation. When , the performance peaks; values of yield marginal decreases attributable to noisy or redundant state descriptions. Ablation studies show that alternative aggregation methods (median, two-stage mean, similarity-weighted mean) lead to only minor variation (0.2 APⁿ₅₀), indicating robustness of mean pooling.
In zero-shot evaluation on Objects365, SESP improves AP_r (rare classes) from 12.4 to 13.7 over the Detic baseline. Combined with Scene-Augmented Pseudo Prototypes (SAPP), the approach achieves a cumulative +3.5 APⁿ₅₀ improvement, establishing the importance of integrating both state-aware and context-aware semantics for robust weakly supervised open-vocabulary detection (Zhou et al., 22 Nov 2025).