Language-conditioned Detection Transformer (2311.17902v1)
Abstract: We present a new open-vocabulary detection framework. Our framework uses both image-level labels and detailed detection annotations when available. Our framework proceeds in three steps. We first train a language-conditioned object detector on fully-supervised detection data. This detector gets to see the presence or absence of ground truth classes during training, and conditions prediction on the set of present classes. We use this detector to pseudo-label images with image-level labels. Our detector provides much more accurate pseudo-labels than prior approaches with its conditioning mechanism. Finally, we train an unconditioned open-vocabulary detector on the pseudo-annotated images. The resulting detector, named DECOLA, shows strong zero-shot performance in open-vocabulary LVIS benchmark as well as direct zero-shot transfer benchmarks on LVIS, COCO, Object365, and OpenImages. DECOLA outperforms the prior arts by 17.1 AP-rare and 9.4 mAP on zero-shot LVIS benchmark. DECOLA achieves state-of-the-art results in various model sizes, architectures, and datasets by only training on open-sourced data and academic-scale computing. Code is available at https://github.com/janghyuncho/DECOLA.
- Three ways to improve feature alignment for open vocabulary detection. arXiv, 2023.
- Cascade r-cnn: Delving into high quality object detection. In CVPR, 2018.
- End-to-end object detection with transformers. In ECCV, 2020.
- Hybrid task cascade for instance segmentation. In CVPR, 2019.
- Group detr: Fast detr training with group-wise one-to-many assignment. arXiv preprint arXiv:2207.13085, 2022.
- PaLI: A jointly-scaled multilingual language-image model. In ICLR, 2023.
- Long-tail detection with effective class-margins. In ECCV, 2022.
- Partdistillation: Learning parts from instance segmentation. In CVPR, 2023.
- Evaluating large-vocabulary object detectors: The devil is in the details. arXiv, 2021.
- Mostafa et al Dehghani. Scaling vision transformers to 22 billion parameters. ICML, 2023.
- Bert: Pre-training of deep bidirectional transformers for language understanding. 2019.
- Learning to prompt for open-vocabulary object detection with vision-language model. 2022.
- Eva: Exploring the limits of masked visual representation learning at scale. In CVPR, 2023.
- Promptdet: Towards open-vocabulary detection using uncurated images. In ECCV, 2022.
- Datacomp: In search of the next generation of multimodal datasets. In Neurips (Datasets and Benchmarks Track), 2023.
- Simple copy-paste is a strong data augmentation method for instance segmentation. In CVPR, 2021.
- Ross Girshick. Fast r-cnn. In ICCV, 2015.
- Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014.
- Open-vocabulary object detection via vision and language knowledge distillation. In ICLR, 2022.
- Lvis: A dataset for large vocabulary instance segmentation. In CVPR, 2019.
- Deep residual learning for image recognition. In CVPR, 2016.
- Mask r-cnn. In ICCV, 2017.
- Distilling the knowledge in a neural network. In Neurips, 2015.
- Gabriel et al Ilharco. Openclip, 2021.
- Detrs with hybrid matching. 2023.
- Mdetr - modulated detection for end-to-end multi-modal understanding. ICCV, 2021.
- Contrastive feature masking open-vocabulary vision transformer. In ICCV, 2023a.
- Detection-oriented image-text pretraining for open-vocabulary detection. arXiv, 2023b.
- Region-aware pretraining for open-vocabulary object detection with vision transformers. In CVPR, 2023c.
- Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV, 2017.
- Open-vocabulary object detection upon frozen vision and language models. In ICLR, 2023.
- The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. IJCV, 2020.
- Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, 2022a.
- Grounded language-image pre-training. In CVPR, 2022b.
- Learning object-language alignments for open-vocabulary object detection. In ICLR, 2023.
- Microsoft coco: Common objects in context. In ECCV, 2014.
- DAB-DETR: Dynamic anchor boxes are better queries for DETR. In ICLR, 2022.
- Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023.
- Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, 2021.
- Decoupled weight decay regularization. In ICLR, 2019.
- Codet: Co-occurrence guided region-word alignment for open-vocabulary object detection. In Nuerips, 2023.
- Neil Houlsby Matthias Minderer, Alexey Gritsenko. Scaling open-vocabulary object detection. NeurIPS, 2023.
- Conditional detr for fast training convergence. In ICCV, 2021.
- Matthias Minderer et al. Simple open-vocabulary object detection with vision transformers. ECCV, 2022.
- Nms strikes back. arXiv preprint arXiv:2212.06137, 2022.
- Pytorch: An imperative style, high-performance deep learning library. In Neurips. 2019.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Bridging the gap between object and image-level representations for open-vocabulary detection. In Neurips, 2022.
- Faster r-cnn: Towards real-time object detection with region proposal networks. Neurips, 2015.
- Imagenet-21k pretraining for the masses. In Neurips, 2021.
- ImageNet Large Scale Visual Recognition Challenge. IJCV, 2015.
- Laion-5b: An open large-scale dataset for training next generation image-text models. Neurips, 2022.
- Objects365: A large-scale, high-quality dataset for object detection. In ICCV, 2019.
- Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL, 2018.
- Edadet: Open-vocabulary object detection using early dense alignment. In ICCV, 2023.
- Equalization loss for long-tailed object recognition. In CVPR, 2020.
- Equalization loss v2: A new gradient balance approach for long-tailed object detection. In CVPR, 2021.
- Seesaw loss for long-tailed instance segmentation. In CVPR, 2021.
- Object-aware distillation pyramid for open-vocabulary object detection. In CVPR, 2023.
- Tao Wang. Learning to detect and segment for open vocabulary object detection. In CVPR, 2023.
- Anchor detr: Query design for transformer-based detector. In AAAI, 2022.
- Aligning bag of regions for open-vocabulary object detection. In CVPR, 2023a.
- Cora: Adapting clip for open-vocabulary detection with region prompting and anchor pre-matching. In CVPR, 2023b.
- Detectron2, 2019.
- Multi-modal queried object detection in the wild. In Neurips, 2023.
- Open-vocabulary detr with conditional matching. In ECCV, 2022.
- Open-vocabulary object detection using captions. In CVPR, 2021.
- Lit: Zero-shot transfer with locked-image text tuning. In CVPR, 2022.
- MosaicOS: A simple and effective use of object-centric images for long-tailed object detection. In ICCV, 2021.
- Glipv2: Unifying localization and vision-language understanding. Neurips, 2022.
- Dino: Detr with improved denoising anchor boxes for end-to-end object detection. 2023.
- Regionclip: Region-based language-image pretraining. In CVPR, 2022.
- Probabilistic two-stage detection. arXiv preprint arXiv:2103.07461, 2021.
- Detecting twenty-thousand classes using image-level supervision. In ECCV, 2022a.
- Simple multi-dataset detection. In CVPR, 2022b.
- Deformable {detr}: Deformable transformers for end-to-end object detection. In ICLR, 2021.
- Rethinking pre-training and self-training. Neurips, 2020.
- Jang Hyun Cho (9 papers)
- Philipp Krähenbühl (55 papers)