Instance Brownian Bridge as Texts for Open-vocabulary Video Instance Segmentation (2401.09732v1)
Abstract: Temporally locating objects with arbitrary class texts is the primary pursuit of open-vocabulary Video Instance Segmentation (VIS). Because of the insufficient vocabulary of video data, previous methods leverage image-text pretraining model for recognizing object instances by separately aligning each frame and class texts, ignoring the correlation between frames. As a result, the separation breaks the instance movement context of videos, causing inferior alignment between video and text. To tackle this issue, we propose to link frame-level instance representations as a Brownian Bridge to model instance dynamics and align bridge-level instance representation to class texts for more precisely open-vocabulary VIS (BriVIS). Specifically, we build our system upon a frozen video segmentor to generate frame-level instance queries, and design Temporal Instance Resampler (TIR) to generate queries with temporal context from frame queries. To mold instance queries to follow Brownian bridge and accomplish alignment with class texts, we design Bridge-Text Alignment (BTA) to learn discriminative bridge-level representations of instances via contrastive objectives. Setting MinVIS as the basic video segmentor, BriVIS surpasses the Open-vocabulary SOTA (OV2Seg) by a clear margin. For example, on the challenging large-vocabulary VIS dataset (BURST), BriVIS achieves 7.43 mAP and exhibits 49.49% improvement compared to OV2Seg (4.97 mAP).
- Burst: A benchmark for unifying object recognition, segmentation and tracking in video. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1674–1683, 2023.
- Simple online and realtime tracking. In Proceedings of the IEEE international conference on image processing, pages 3464–3468. IEEE, 2016.
- Zero-shot semantic segmentation. Advances in Neural Information Processing Systems, 32, 2019.
- End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, pages 213–229. Springer, 2020.
- Mask2former for video instance segmentation. arXiv preprint arXiv:2112.10764, 2021.
- Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1290–1299, 2022.
- Tao: A large-scale benchmark for tracking any object. In Proceedings of the European conference on computer vision, pages 436–454. Springer, 2020.
- Language-bridged spatial-temporal interaction for referring video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4964–4973, 2022.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Scaling open-vocabulary image segmentation with image-level labels. In European Conference on Computer Vision, pages 540–557. Springer, 2022.
- Deep sparse rectifier neural networks. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pages 315–323. JMLR Workshop and Conference Proceedings, 2011.
- Open-vocabulary object detection via vision and language knowledge distillation. In Proceedings of the International Conference on Learning Representations, 2021.
- Open-vocabulary object detection via vision and language knowledge distillation. In International Conference on Learning Representations, 2022.
- Openvis: Open-vocabulary video instance segmentation. arXiv preprint arXiv:2305.16835, 2023.
- Lvis: A dataset for large vocabulary instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5356–5364, 2019.
- Open-vocabulary semantic segmentation with decoupled one-pass network. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1086–1096, 2023.
- Vita: Video instance segmentation via object token association. 2022.
- A generalized framework for video instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14623–14632, 2023.
- Bicnet-tks: Learning efficient spatial-temporal representation for video person re-identification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2014–2023, 2021.
- Minvis: A minimal video instance segmentation framework without video-based training. arXiv preprint arXiv:2208.02245, 2022.
- Collaborative spatial-temporal modeling for language-queried video actor segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4187–4196, 2021.
- Language-aware spatial-temporal collaboration for referring video segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
- Open-vocabulary instance segmentation via robust cross-modal pseudo-labeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7020–7031, 2022.
- Video instance segmentation using inter-frame communication transformers. Advances in Neural Information Processing Systems, 34:13352–13363, 2021.
- Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, pages 4904–4916. PMLR, 2021.
- Video mask transfiner for high-quality video instance segmentation. In European Conference on Computer Vision, pages 731–747. Springer, 2022.
- Harold W Kuhn. The hungarian method for the assignment problem. Naval research logistics quarterly, 2(1-2):83–97, 1955.
- Language-driven semantic segmentation. In International Conference on Learning Representations, 2022a.
- Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888–12900. PMLR, 2022b.
- Grounded language-image pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10965–10975, 2022c.
- Mdqe: Mining discriminative query embeddings to segment occluded instances on challenging videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10524–10533, 2023.
- Open-vocabulary semantic segmentation with mask-adapted clip. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7061–7070, 2023.
- Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision, pages 740–755. Springer, 2014.
- Opening up open world tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19045–19055, 2022.
- Decoupled weight decay regularization. In International Conference on Learning Representations, 2019.
- Hota: A higher order metric for evaluating multi-object tracking. International journal of computer vision, 129:548–578, 2021.
- Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems, 26, 2013.
- George A Miller. Wordnet: a lexical database for english. Communications of the ACM, 38(11):39–41, 1995.
- Occluded video instance segmentation: A benchmark. International Journal of Computer Vision, 2022.
- Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR, 2021.
- Continuous martingales and Brownian motion. Springer Science & Business Media, Berlin, 2013.
- Contrastive learning with hard negative samples. In International Conference on Learning Representations, 2021.
- Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389, 2023.
- Implicit temporal modeling with learnable alignment for video recognition. arXiv preprint arXiv:2304.10465, 2023.
- Video description with spatial-temporal attention. In Proceedings of the 25th ACM international conference on Multimedia, pages 1014–1022, 2017.
- Towards open-vocabulary video instance segmentation. arXiv preprint arXiv:2304.01715, 2023a.
- Dialogue planning via brownian bridge stochastic process for goal-directed proactive dialogue. In Findings of the Association for Computational Linguistics: ACL 2023, pages 370–387, Toronto, Canada, 2023b. Association for Computational Linguistics.
- Language modeling via stochastic processes. In International Conference on Learning Representations, 2022.
- End-to-end video instance segmentation with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8741–8750, 2021.
- Seqformer: Sequential transformer for video instance segmentation. In European Conference on Computer Vision, pages 553–569. Springer, 2022a.
- Language as queries for referring video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4974–4984, 2022b.
- In defense of online models for video instance segmentation. In European Conference on Computer Vision, pages 588–605. Springer, 2022c.
- Efficient video instance segmentation via tracklet query and proposal. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 959–968, 2022d.
- Semantic projection network for zero-and few-label semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8256–8265, 2019.
- A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In European Conference on Computer Vision, pages 736–753. Springer, 2022.
- Side adapter network for open-vocabulary semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2945–2954, 2023.
- Video instance segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5188–5197, 2019.
- Temporally efficient vision transformer for video instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2885–2895, 2022.
- Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917, 2022.
- Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432, 2021.
- Robust online video instance segmentation with track queries. arXiv preprint arXiv:2211.09108, 2022.
- Modeling video as stochastic processes for fine-grained video representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2225–2234, 2023a.
- Dvis: Decoupled video instance segmentation framework. arXiv preprint arXiv:2306.03413, 2023b.
- Open vocabulary scene parsing. In Proceedings of the IEEE International Conference on Computer Vision, pages 2002–2010, 2017.
- Zhuowen Tu Zheng Ding, Jieke Wang. Open-vocabulary universal image segmentation with maskclip. In International Conference on Machine Learning, 2023.
- Extract free dense labels from clip. In European Conference on Computer Vision, pages 696–712. Springer, 2022a.
- Detecting twenty-thousand classes using image-level supervision. In Proceedings of the European Conference on Computer Vision, 2022b.
- Zesen Cheng (24 papers)
- Kehan Li (23 papers)
- Hao Li (803 papers)
- Peng Jin (91 papers)
- Chang Liu (863 papers)
- Xiawu Zheng (63 papers)
- Rongrong Ji (315 papers)
- Jie Chen (602 papers)