OpenSD: Unified Open-Vocabulary Segmentation and Detection (2312.06703v1)
Abstract: Recently, a few open-vocabulary methods have been proposed by employing a unified architecture to tackle generic segmentation and detection tasks. However, their performance still lags behind the task-specific models due to the conflict between different tasks, and their open-vocabulary capability is limited due to the inadequate use of CLIP. To address these challenges, we present a universal transformer-based framework, abbreviated as OpenSD, which utilizes the same architecture and network parameters to handle open-vocabulary segmentation and detection tasks. First, we introduce a decoder decoupled learning strategy to alleviate the semantic conflict between thing and staff categories so that each individual task can be learned more effectively under the same framework. Second, to better leverage CLIP for end-to-end segmentation and detection, we propose dual classifiers to handle the in-vocabulary domain and out-of-vocabulary domain, respectively. The text encoder is further trained to be region-aware for both thing and stuff categories through decoupled prompt learning, enabling them to filter out duplicated and low-quality predictions, which is important to end-to-end segmentation and detection. Extensive experiments are conducted on multiple datasets under various circumstances. The results demonstrate that OpenSD outperforms state-of-the-art open-vocabulary segmentation and detection methods in both closed- and open-vocabulary settings. Code is available at https://github.com/strongwolf/OpenSD
- Yolact: Real-time instance segmentation. In ICCV, pages 9157–9166, 2019.
- End-to-end object detection with transformers. In ECCV, pages 213–229. Springer, 2020.
- Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE TPAMI, 40(4):834–848, 2017.
- Open-vocabulary panoptic segmentation with embedding modulation. In ICCV, 2023.
- Panoptic-deeplab: A simple, strong, and fast baseline for bottom-up panoptic segmentation. In CVPR, pages 12475–12485, 2020.
- Masked-attention mask transformer for universal image segmentation. In CVPR, pages 1290–1299, 2022.
- The cityscapes dataset. In CVPR Workshop on the Future of Datasets in Vision. sn, 2015.
- An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
- Instances as queries. In ICCV, pages 6910–6919, 2021.
- Object detection with discriminatively trained part-based models. IEEE TPAMI, 32(9):1627–1645, 2009.
- A survey on image segmentation. PR, 13(1):3–16, 1981.
- Scaling open-vocabulary image segmentation with image-level labels. In ECCV, pages 540–557. Springer, 2022.
- Open-vocabulary object detection via vision and language knowledge distillation. arXiv preprint arXiv:2104.13921, 2021.
- Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
- Mask r-cnn. In ICCV, pages 2961–2969, 2017.
- Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
- Densely connected convolutional networks. In CVPR, pages 4700–4708, 2017.
- Openclip. https://doi.org/10.5281/zenodo.5143773, 2021.
- Oneformer: One transformer to rule universal image segmentation. In CVPR, pages 2989–2998, 2023.
- Panoptic feature pyramid networks. In CVPR, pages 6399–6408, 2019a.
- Panoptic segmentation. In CVPR, pages 9404–9413, 2019b.
- F-vlm: Open-vocabulary object detection upon frozen vision and language models. In ICLR, 2023.
- Language-driven semantic segmentation. In ICLR, 2022a.
- Mask dino: Towards a unified transformer-based framework for object detection and segmentation. In CVPR, pages 3041–3050, 2023a.
- Grounded language-image pre-training. In CVPR, pages 10965–10975, 2022b.
- Spatial feature calibration and temporal fusion for effective one-stage video instance segmentation. In CVPR, pages 11215–11224, 2021a.
- Mdqe: Mining discriminative query embeddings to segment occluded instances on challenging videos. In CVPR, pages 10524–10533, 2023b.
- Fully convolutional networks for panoptic segmentation. In CVPR, pages 214–223, 2021b.
- Open-vocabulary semantic segmentation with mask-adapted clip. In CVPR, pages 7061–7070, 2023.
- Microsoft coco: Common objects in context. In ECCV, pages 740–755. Springer, 2014.
- Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In CVPR, 2023.
- Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, 2021.
- Fully convolutional networks for semantic segmentation. In CVPR, pages 3431–3440, 2015.
- Image segmentation using deep learning: A survey. IEEE TPAMI, pages 3523–3542, 2021.
- Freeseg: Unified, universal and open-vocabulary image segmentation. In CVPR, pages 19446–19455, 2023.
- Learning transferable visual models from natural language supervision. In Int. Conf. Mach. Learn., pages 8748–8763, 2021.
- Denseclip: Language-guided dense prediction with context-aware prompting. In CVPR, pages 18082–18091, 2022.
- Objects365: A large-scale, high-quality dataset for object detection. In ICCV, pages 8430–8439, 2019.
- Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
- Going deeper with convolutions. In CVPR, pages 1–9, 2015.
- Conditional convolutions for instance segmentation. In ECCV, pages 282–298. Springer, 2020.
- Attention is all you need. NeurIPS, 30, 2017.
- SOLO: Segmenting objects by locations. In ECCV, 2020.
- Hierarchical open-vocabulary universal image segmentation, 2023.
- Upsnet: A unified panoptic segmentation network. In CVPR, pages 8818–8826, 2019.
- Groupvit: Semantic segmentation emerges from text supervision. In CVPR, pages 18134–18144, 2022a.
- Open-vocabulary panoptic segmentation with text-to-image diffusion models. In CVPR, pages 2955–2966, 2023a.
- A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In ECCV, pages 736–753. Springer, 2022b.
- Side adapter network for open-vocabulary semantic segmentation. In CVPR, pages 2945–2954, 2023b.
- Universal instance perception as object discovery and retrieval. In CVPR, pages 15325–15336, 2023.
- Convolutions die hard: Open-vocabulary segmentation with single frozen convolutional clip. 2023.
- Object-contextual representations for semantic segmentation. In ECCV, pages 173–190. Springer, 2020.
- A simple framework for open-vocabulary segmentation and detection. In CVPR, 2023.
- K-net: Towards unified image segmentation. In NeurIPS, 2021.
- Pyramid scene parsing network. In CVPR, pages 2881–2890, 2017.
- Zhuowen Tu Zheng Ding, Jieke Wang. Open-vocabulary universal image segmentation with maskclip. In Int. Conf. Mach. Learn., 2023.
- Regionclip: Region-based language-image pretraining. In CVPR, pages 16793–16803, 2022.
- Scene parsing through ade20k dataset. In CVPR, pages 633–641, 2017.
- Detecting twenty-thousand classes using image-level supervision. In ECCV, pages 350–368. Springer, 2022.
- Deformable detr: Deformable transformers for end-to-end object detection. In ICLR, 2020.
- Generalized decoding for pixel, image, and language. In CVPR, pages 15116–15127, 2023a.
- Segment everything everywhere all at once. arXiv preprint arXiv:2304.06718, 2023b.
- Object detection in 20 years: A survey. Proceedings of the IEEE, 2023c.
- Shuai Li (295 papers)
- Minghan Li (38 papers)
- Pengfei Wang (176 papers)
- Lei Zhang (1689 papers)