Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 137 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 29 tok/s Pro
GPT-5 High 31 tok/s Pro
GPT-4o 90 tok/s Pro
Kimi K2 207 tok/s Pro
GPT OSS 120B 425 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

Prototype-Aware Multimodal Alignment for Open-Vocabulary Visual Grounding (2509.06291v1)

Published 8 Sep 2025 in cs.CV

Abstract: Visual Grounding (VG) aims to utilize given natural language queries to locate specific target objects within images. While current transformer-based approaches demonstrate strong localization performance in standard scene (i.e, scenarios without any novel objects), they exhibit notable limitations in open-vocabulary scene (i.e, both familiar and novel object categories during testing). These limitations primarily stem from three key factors: (1) imperfect alignment between visual and linguistic modalities, (2) insufficient cross-modal feature fusion, and (3) ineffective utilization of semantic prototype information. To overcome these challenges, we present Prototype-Aware Multimodal Learning (PAML), an innovative framework that systematically addresses these issues through several key components: First, we leverage ALBEF to establish robust cross-modal alignment during initial feature encoding. Subsequently, our Visual Discriminative Feature Encoder selectively enhances salient object representations while suppressing irrelevant visual context. The framework then incorporates a novel prototype discovering and inheriting mechanism that extracts and aggregates multi-neighbor semantic prototypes to facilitate open-vocabulary recognition. These enriched features undergo comprehensive multimodal integration through our Multi-stage Decoder before final bounding box regression. Extensive experiments across five benchmark datasets validate our approach, showing competitive performance in standard scene while achieving state-of-the-art results in open-vocabulary scene. Our code is available at https://github.com/plankXie/PAML.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.