Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Multi-modal Queried Object Detection in the Wild (2305.18980v2)

Published 30 May 2023 in cs.CV
Multi-modal Queried Object Detection in the Wild

Abstract: We introduce MQ-Det, an efficient architecture and pre-training strategy design to utilize both textual description with open-set generalization and visual exemplars with rich description granularity as category queries, namely, Multi-modal Queried object Detection, for real-world detection with both open-vocabulary categories and various granularity. MQ-Det incorporates vision queries into existing well-established language-queried-only detectors. A plug-and-play gated class-scalable perceiver module upon the frozen detector is proposed to augment category text with class-wise visual information. To address the learning inertia problem brought by the frozen detector, a vision conditioned masked language prediction strategy is proposed. MQ-Det's simple yet effective architecture and training strategy design is compatible with most language-queried object detectors, thus yielding versatile applications. Experimental results demonstrate that multi-modal queries largely boost open-world detection. For instance, MQ-Det significantly improves the state-of-the-art open-set detector GLIP by +7.8% AP on the LVIS benchmark via multi-modal queries without any downstream finetuning, and averagely +6.3% AP on 13 few-shot downstream tasks, with merely additional 3% modulating time required by GLIP. Code is available at https://github.com/YifanXu74/MQ-Det.

Analyzing Multi-Modal Queried Object Detection in the Wild

The paper "Multi-modal Queried Object Detection in the Wild" introduces MQ-Det, a novel approach for multi-modal object detection aimed at effectively utilizing both textual and visual cues to enhance open-world detection performance. The authors propose this architecture to address the inherent limitations of text-only object detection frameworks, which struggle with the generalization of fine-grained categories due to insufficient descriptive granularity.

Key Contributions

  1. Integration of Multi-modal Queries:
    • MQ-Det effectively integrates visual queries alongside textual descriptions. This dual-query system leverages the broad conceptualization abilities of LLMs and the detailed information provided by visual inputs. Such integration allows the system to mitigate ambiguity inherent in solely text-based queries, thereby enhancing detection accuracy in complex scenarios.
  2. Gated Class-scalable Perceiver (GCP) Module:
    • A significant component of MQ-Det is the GCP module. This plug-and-play component facilitates the fusion of visual and textual cues by dynamically incorporating class-specific visual information alongside textual semantic embeddings. It does so with minimal computational overhead, positioning MQ-Det as a scalable solution across various granularities of object categories.
  3. Vision Conditioned Masked Language Prediction:
    • To counteract the learning inertia from frozen base detectors, the authors introduce a novel masking strategy. This involves masking parts of textual queries, forcing the model to rely more heavily on visual cues during prediction. This significantly boosts the model's learning capability by ensuring that visual queries dynamically influence the detection outcome.

Experimental Findings

The paper reports impressive numerical improvements in detection accuracy. Notably, MQ-Det outperformed existing state-of-the-art models such as GLIP on challenging benchmarks. It achieved a remarkable +7.8% AP improvement on the LVIS benchmark using multi-modal queries. Furthermore, it demonstrated consistent gains across 13 few-shot downstream tasks, highlighting its robustness and transferability.

Implications and Future Directions

The MQ-Det framework offers substantial advancements in aligning visual and language modalities for object detection, with promising implications for applications where visual data alone may not suffice due to nuanced subclass variations. Furthermore, the efficient computational demands suggest practicality in large-scale deployments.

Theoretically, this work underscores the potential of utilizing comprehensive multi-modal data to expand the boundaries of open-world detection systems. Future research may explore extending MQ-Det's capabilities to other dense prediction tasks like segmentation or implementing its architecture on emerging foundational models with enhanced multimodal understanding capabilities.

In summary, the paper presents a sophisticated approach that significantly augments the state-of-the-art in open-world object detection through an innovative fusion of multimodal data streams, demonstrating notable improvements in both accuracy and computational efficiency.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Yifan Xu (92 papers)
  2. Mengdan Zhang (18 papers)
  3. Chaoyou Fu (46 papers)
  4. Peixian Chen (21 papers)
  5. Xiaoshan Yang (19 papers)
  6. Ke Li (722 papers)
  7. Changsheng Xu (100 papers)
Citations (23)