Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

144 tokens/sec

GPT-4o

7 tokens/sec

Gemini 2.5 Pro Pro

46 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

VisTex-OVLM: Visual Textualization for Object Detection

Updated 1 July 2025

VisTex-OVLM is a methodology that transforms visual exemplars into textual tokens, enabling OVLMs to detect rare or novel categories.
It employs Multi-Scale Textualizing Blocks and Multi-Stage Fusion to project multi-scale image features into the text space without altering the base model.
The approach achieves superior performance on open-set and few-shot detection benchmarks while preserving original object-text alignment.

VisTex-OVLM is a methodology for image-prompted object detection that introduces a visual textualization mechanism, enabling object-level vision-LLMs (OVLMs) to better detect rare or novel categories with limited or missing textual descriptions, particularly in scenarios where such categories are underrepresented or absent in pre-training data. The method achieves this by projecting visual exemplars—support images—into the text feature space and fusing these with text prompts, all while maintaining the base OVLM architecture and its original object-text alignment properties.

1. Visual Textualization: Bridging Vision and Text Feature Spaces

VisTex-OVLM’s core contribution is a visual textualization process that transforms a set of visual support exemplars into textualized tokens in the same space as OVLM text prompts. Visual exemplars, prepared via image prompt engineering (blurring the background, focusing on the target object), are encoded using the OVLM’s own visual encoder. The resulting multi-scale visual features are then projected through Multi-Scale Textualizing Blocks (MSTBs), which map the image-derived features into the text feature space—the same vector space as that of the OVLM’s text encoder.

This strategy enables the generation of tokens that encapsulate visual semantics and can be directly concatenated with text tokens for joint processing in the OVLM. Importantly, the OVLM backbone is never modified or fine-tuned, which preserves its generalization capability and pre-trained object-text alignment. Thus, VisTex-OVLM allows for the direct integration of one or more image prompts as guidance for object detection tasks, leveraging both textual and visual context.

2. Multi-Scale Textualizing Blocks (MSTBs): Architecture and Function

MSTBs are designed to project features from multiple receptive field sizes into a uniform text feature space, yielding a single visual textualized token from each scale and network stage. Given input features from various stages (denoted as $R_S^{i,(j)}$ ), each corresponding to a discrete spatial scale, the MSTB utilizes parameter-shared 3x3 convolutions, potentially with downsampling for high-resolution scales, followed by an MLP that maps to the text embedding dimension ( $d_T$ ). The output from all scales and stages is a sequence of tokens that represent the visual support image in text space.

Mathematically, for each stage $i$ and scale $j$ : $\widetilde{R_S^{i,(j)}} = \begin{cases} \prod_{j}^{M-2} \operatorname{Conv}_{\text{down}}^{i,(j)}(R_S^{i,(j)}), & \text{if } j \neq M-1 \ R_S^{i,(j)}, & \text{if } j = M-1 \end{cases}$ and the full projected set for stage $i$ is

$\widetilde{R_S^{i}} = \left[ \widetilde{R_S^{i,(j)}} \right]_{j=0}^{M-1},$

while the token for that stage is

$\widetilde{P_S^{i}} = \operatorname{MLP}^i ( \widetilde{R_S^i} ) \in \mathbb{R}^{1 \times d_T}.$

Parameter sharing among scales enhances efficiency and consistency, and only the MSTB itself is trained, with the remainder of the OVLM kept frozen.

3. Multi-Stage Fusion Strategy (MSF): Integrating Hierarchical Features

To fully harness hierarchical representations, MSTB outputs from all selected stages are combined via a Multi-Stage Fusion (MSF) mechanism. This process uses a non-parametric approach, with max pooling empirically found to be most effective, producing a single robust token per support image that aggregates the most salient features across network depths.

Formally,

$\widetilde{P_S} = \operatorname{MSF}^i\left( \{ \widetilde{R_S^i} \}_{i=0}^L \right)$

where maximum values across the feature axis for all stages are taken.

At inference, for $K$ support shots,

$P^0 = \left[ \operatorname{BERT}(t), \widetilde{P_{S_1}}, ..., \widetilde{P_{S_K}} \right]$

is composed, where $t$ are the text prompts, and each $\widetilde{P_{S_k}}$ is a visual token.

This results in seamless integration of $K$ support exemplars by including their textualized tokens in the sequence processed by the OVLM’s text encoder.

4. Empirical Performance and Benchmarks

VisTex-OVLM demonstrates consistently superior performance across a diverse range of open-set and few-shot object detection benchmarks:

Open-Set and Domain-Transfer Scenarios: On datasets such as LVIS MiniVal, ODinW (object detection in the wild), and medical datasets (e.g., MoNu, LIDC), VisTex-GLIP (an instantiation of VisTex-OVLM with the GLIP backbone) achieves higher mAP scores than previous image-prompting and fine-tuning baselines. For example, mean mAP on ODinW A is 77.5% for VisTex-GLIP, compared to 75.8% for GLIP-FF and 44.8% for MQ-Det.
Few-Shot Learning: On standard few-shot detection tasks, including PASCAL VOC (1, 2, 3, 5, 10 shots) and MSCOCO (1, 5, 10, 30 shots), VisTex-GLIP outperforms SOTA approaches, with mean AP50 of 71.8 (10-shot) for VOC and AP of 47.9 (1-shot), 53.6 (30-shot) for COCO.
Generalized FSOD: The method preserves base class performance while significantly improving novel class AP, indicating balanced transfer to both old and new classes.
Ablation and Analysis: The inclusion of both MSTB and MSF is critical: omitting either reduces performance. The parameter-shared MSTB configuration and max pooling fusion are empirically best.

5. Generalization and Preservation of OVLM Alignment

A defining feature of VisTex-OVLM is its capacity to preserve the originally trained object-text alignment of the underlying OVLM. Because no fine-tuning or architectural modification is required, the alignment between features for text and visual object regions remains undistorted, supporting robust generalization to out-of-distribution classes and domains. Cosine similarity analyses in the paper indicate that the alignment distribution of VisTex-OVLM closely matches that of the original OVLM, in contrast to alternative methods (e.g., full fine-tuning, prompt tuning) that disturb this distribution, potentially harming generalization.

This property translates into robust zero-shot and open-vocabulary detection, even in domains that deviate significantly from the pre-training data, such as medical images.

6. Code Availability and Implementation

The authors have committed to releasing code, pretrained models, and instructions for VisTex-OVLM at [https://github.com/WitGotFlg/VisTex-OVLM]. The software package is expected to include pipelines for both VisTex-GLIP and VisTex-DINO backbones, enabling reproducibility and practical application to a broad range of object detection challenges.

VisTex-OVLM represents a methodology that empowers object-level vision-LLMs to leverage a small set of visual support exemplars for few-shot and open-set detection, integrating them as textualized tokens in the prompt space without sacrificing generalization. By using multi-scale textualizing blocks and non-parametric multi-stage fusion, VisTex-OVLM achieves state-of-the-art results on multiple detection tasks and domains, while preserving the alignment and scalability advantages inherent to its base OVLMs.

PDF Markdown Chat (Upgrade)