Open-Vocabulary Object Detection via Scene Graph Discovery (2307.03339v1)
Abstract: In recent years, open-vocabulary (OV) object detection has attracted increasing research attention. Unlike traditional detection, which only recognizes fixed-category objects, OV detection aims to detect objects in an open category set. Previous works often leverage vision-language (VL) training data (e.g., referring grounding data) to recognize OV objects. However, they only use pairs of nouns and individual objects in VL data, while these data usually contain much more information, such as scene graphs, which are also crucial for OV detection. In this paper, we propose a novel Scene-Graph-Based Discovery Network (SGDN) that exploits scene graph cues for OV detection. Firstly, a scene-graph-based decoder (SGDecoder) including sparse scene-graph-guided attention (SSGA) is presented. It captures scene graphs and leverages them to discover OV objects. Secondly, we propose scene-graph-based prediction (SGPred), where we build a scene-graph-based offset regression (SGOR) mechanism to enable mutual enhancement between scene graph extraction and object localization. Thirdly, we design a cross-modal learning mechanism in SGPred. It takes scene graphs as bridges to improve the consistency between cross-modal embeddings for OV object classification. Experiments on COCO and LVIS demonstrate the effectiveness of our approach. Moreover, we show the ability of our model for OV scene graph detection, while previous OV scene graph generation methods cannot tackle this task.
- Zero-shot object detection. In Proceedings of the European Conference on Computer Vision (ECCV). 384–400.
- X-DETR: A Versatile Architecture for Instance-wise Vision-Language Tasks. European Conference on Computer Vision (2022).
- Correspondence matters for video referring expression comprehension. In Proceedings of the 30th ACM International Conference on Multimedia. 4967–4976.
- Shaoxiang Chen. 2021. Towards bridging video and language by caption generation and sentence localization. In Proceedings of the 29th ACM International Conference on Multimedia. 2964–2968.
- Learning to Prompt for Open-Vocabulary Object Detection with Vision-Language Model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14084–14093.
- Promptdet: Towards open-vocabulary detection using uncurated images. In European Conference on Computer Vision. 701–717.
- Open Vocabulary Object Detection with Pseudo Bounding-Box Labels. European Conference on Computer Vision (2022).
- Open-vocabulary Object Detection via Vision and Language Knowledge Distillation. In International Conference on Learning Representations.
- Lvis: A dataset for large vocabulary instance segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 5356–5364.
- Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.
- Towards open-vocabulary scene graph generation with prompt-based finetuning. In European Conference on Computer Vision. 56–73.
- Open-vocabulary instance segmentation via robust cross-modal pseudo-labeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7020–7031.
- Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of NAACL-HLT. 4171–4186.
- Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV (2016).
- FindIt: Generalized Localization with Natural Language Queries. European Conference on Computer Vision (2022).
- F-VLM: Open-Vocabulary Object Detection upon Frozen Vision and Language Models. International Conference on Learning Representations (2023).
- Bottom-Up and Bidirectional Alignment for Referring Expression Comprehension. In Proceedings of the 29th ACM International Conference on Multimedia. 5167–5175.
- Grounded language-image pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10965–10975.
- Sgtr: End-to-end scene graph generation with transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 19486–19496.
- Bipartite graph network with adaptive message passing for unbiased scene graph generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11109–11119.
- Integrating object-aware and interaction-aware knowledge for weakly supervised scene graph generation. In Proceedings of the 30th ACM International Conference on Multimedia. 4204–4213.
- Scene graph generation from objects, phrases and region captions. In Proceedings of the IEEE international conference on computer vision. 1261–1270.
- Learning Object-Language Alignments for Open-Vocabulary Object Detection. International Conference on Learning Representations (2023).
- Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision. Springer, 740–755.
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
- Structure inference net: Object detection using scene-level context and instance-level relationships. In Proceedings of the IEEE conference on computer vision and pattern recognition. 6985–6994.
- Swin transformer: Hierarchical vision transformer using shifted windows. ICCV (2021).
- Visual relationship detection with language priors. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14. 852–869.
- vtGraphNet: Learning weakly-supervised scene graph for complex visual grounding. Neurocomputing 413 (2020), 51–60.
- Implicit feature refinement for instance segmentation. In Proceedings of the 29th ACM International Conference on Multimedia. 3088–3096.
- Open-Vocabulary One-Stage Detection with Hierarchical Visual-Language Knowledge Distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14074–14083.
- Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 (2019), 8026–8037.
- Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE international conference on computer vision. 2641–2649.
- Language-Aware Fine-Grained Object Representation for Referring Expression Comprehension. In Proceedings of the 28th ACM International Conference on Multimedia. 4171–4180.
- Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020 (2021).
- Bridging the gap between object and image-level representations for open-vocabulary detection. Conference on Neural Information Processing Systems (2022).
- Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Transactions on Pattern Analysis & Machine Intelligence 39, 06 (2017), 1137–1149.
- Generating semantically precise scene graphs from textual descriptions for improved image retrieval. In Proceedings of the Fourth Workshop on Vision and Language. Citeseer, 70–80.
- Boosting Scene Parsing Performance via Reliable Scale Prediction. In 2018 ACM Multimedia Conference on Multimedia Conference. ACM, 492–500.
- A simple baseline for weakly-supervised scene graph generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 16393–16402.
- Relationformer: A unified framework for image-to-graph generation. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVII. 422–439.
- Clip4caption: Clip for video caption. In Proceedings of the 29th ACM International Conference on Multimedia. 4858–4862.
- Attention is all you need. Advances in neural information processing systems 30 (2017).
- Point to Rectangle Matching for Image Text Retrieval. In Proceedings of the 30th ACM International Conference on Multimedia. 4977–4986.
- Scene graph generation by iterative message passing. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5410–5419.
- Graph-structured referring expression reasoning in the wild. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 9952–9961.
- Open-Vocabulary DETR with Conditional Matching. European Conference on Computer Vision (2022).
- Open-vocabulary object detection using captions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14393–14402.
- Glipv2: Unifying localization and vision-language understanding. Conference on Neural Information Processing Systems (2022).
- Exploiting unlabeled data with vision and language models for object detection. In European Conference on Computer Vision. 159–175.
- Learning to generate scene graph from natural language supervision. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1823–1834.
- Regionclip: Region-based language-image pretraining. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16793–16803.
- Detecting twenty-thousand classes using image-level supervision. European Conference on Computer Vision (2022).
- Deformable DETR: Deformable Transformers for End-to-End Object Detection. In International Conference on Learning Representations.
- Hengcan Shi (13 papers)
- Munawar Hayat (73 papers)
- Jianfei Cai (163 papers)