Fully Transformer-Equipped Architecture for End-to-End Referring Video Object Segmentation (2309.11933v1)
Abstract: Referring Video Object Segmentation (RVOS) requires segmenting the object in video referred by a natural language query. Existing methods mainly rely on sophisticated pipelines to tackle such cross-modal task, and do not explicitly model the object-level spatial context which plays an important role in locating the referred object. Therefore, we propose an end-to-end RVOS framework completely built upon transformers, termed \textit{Fully Transformer-Equipped Architecture} (FTEA), which treats the RVOS task as a mask sequence learning problem and regards all the objects in video as candidate objects. Given a video clip with a text query, the visual-textual features are yielded by encoder, while the corresponding pixel-level and word-level features are aligned in terms of semantic similarity. To capture the object-level spatial context, we have developed the Stacked Transformer, which individually characterizes the visual appearance of each candidate object, whose feature map is decoded to the binary mask sequence in order directly. Finally, the model finds the best matching between mask sequence and text query. In addition, to diversify the generated masks for candidate objects, we impose a diversity loss on the model for capturing more accurate mask of the referred object. Empirical studies have shown the superiority of the proposed method on three benchmarks, e.g., FETA achieves 45.1% and 38.7% in terms of mAP on A2D Sentences (3782 videos) and J-HMDB Sentences (928 videos), respectively; it achieves 56.6% in terms of $\mathcal{J&F}$ on Ref-YouTube-VOS (3975 videos and 7451 objects). Particularly, compared to the best candidate method, it has a gain of 2.1% and 3.2% in terms of P$@$0.5 on the former two, respectively, while it has a gain of 2.9% in terms of $\mathcal{J}$ on the latter one.
- Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
- BEit: BERT pre-training of image transformers. In Proceedings of the International Conference on Learning Representations (ICLR), 2022.
- Refvos: A closer look at referring expressions for video object segmentation. arXiv preprint arXiv:2010.00263, 2020.
- End-to-end referring video object segmentation with multimodal transformers. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision (ECCV), pages 213–229, 2020.
- Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4724–4733, 2017.
- Hybrid task cascade for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4974–4983, 2019.
- Cascade cross-modal attention network for video actor and action segmentation from a sentence. In Proceedings of the ACM International Conference on Multimedia (ACM MM), pages 4053–4062, 2021.
- Transformer tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 8126–8135, 2021.
- Per-pixel classification is not all you need for semantic segmentation. In Advances in Neural Information Processing Systems (NeurIPS), 2021.
- Rethinking space-time networks with improved memory coverage for efficient video object segmentation. In Advances in Neural Information Processing Systems (NeurIPS), 2021.
- Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014.
- Progressive multimodal interaction network for referring video object segmentation. The 3rd Large-scale Video Object Segmentation Challenge, 2021.
- An image is worth 16x16 words: transformers for image recognition at scale. In Proceedings of the International Conference on Learning Representations (ICLR), 2021.
- Language-based video editing via multi-modal multi-level transformer. arXiv preprint arXiv:2104.01122, 2021.
- Actor and action video segmentation from a sentence. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5958–5966, 2018.
- Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 2980–2988, 2017.
- Matrix capsules with EM routing. In Proceedings of the International Conference on Learning Representations (ICLR), 2018.
- Long short-term memory. Neural Computation, 9(8):1735–1780, 1997.
- Adaptive selection of reference frames for video object segmentation. IEEE Transactions on Image Processing, 31:1057–1071, 2022.
- Learning position and target consistency for memory-based video object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4144–4154, 2021.
- Collaborative spatial-temporal modeling for language-queried video actor segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4187–4196, 2021.
- Towards understanding action recognition. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 3192–3199, 2013.
- The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.
- Synthref: Generation of synthetic referring expressions for object segmentation. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics Workshop (NAACLW), 2021.
- Harold W Kuhn. The hungarian method for the assignment problem. Naval Research Logistics Quarterly, 2(1-2):83–97, 1955.
- Time–frequency recurrent transformer with diversity constraint for dense video captioning. Information Processing & Management (IPM), 60(2):103204, 2023.
- Efficient long-short temporal attention network for unsupervised video object segmentation. arXiv, page arXiv2023, 2023.
- Rethinking cross-modal interaction from a top-down perspective for referring video object segmentation. arXiv preprint arXiv:2106.01061, 2021.
- Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2117–2125, 2017.
- Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 2999–3007, 2017.
- Microsoft COCO: common objects in context. In Proceedings of the European Conference on Computer Vision (ECCV), pages 740–755, 2014.
- Cross-modal progressive comprehension for referring segmentation. IEEE Transactions Pattern Analysis and Machine Intelligence (TPAMI), 44(9):4761–4775, 2022.
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
- Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 10012–10022, 2021.
- Video swin transformer. arXiv preprint arXiv:2106.13230, 2021.
- Decoupled weight decay regularization. In Proceedings of the International Conference on Learning Representations (ICLR), 2019.
- Multi-task collaborative network for joint referring expression comprehension and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 10031–10040, 2020.
- Visual-textual capsule routing for text-based video segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 9939–9948, 2020.
- V-net: Fully convolutional neural networks for volumetric medical image segmentation. In Proceedings of the International Conference on 3D Vision (3DV), pages 565–571, 2016.
- Polar relative positional encoding for video-language segmentation. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), pages 948–954, 2020.
- A scene segmentation algorithm combining the body and the edge of the object. Information Processing & Management (IPM), 59(2):102840, 2022.
- Image transformer. In Proceedings of the International Conference on Machine Learning (ICML), pages 4052–4061, 2018.
- REVERIE: remote embodied visual referring expression in real indoor environments. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 9979–9988, 2020.
- Segmentation mask and feature similarity loss guided gan for object-oriented image-to-image translation. Information Processing & Management (IPM), 59(3):102926, 2022.
- URVOS: unified referring video object segmentation network with a large-scale benchmark. In Proceedings of the European Conference on Computer Vision (ECCV), volume 12360, pages 208–223, 2020.
- Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2818–2826, 2016.
- Conditional convolutions for instance segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), pages 282–298, 2020.
- Attention is all you need. In Advances in Neural Information Processing Systems (NeurIPS), pages 5999–6009, 2017.
- Context modulated dynamic networks for actor and action video segmentation with language queries. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), pages 12152–12159, 2020.
- Asymmetric cross-guided attention network for actor and action video segmentation from natural language query. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 3938–3947, 2019.
- End-to-end video instance segmentation with transformers. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 8741–8750, 2021.
- Transformers: State-of-the-art natural language processing. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 38–45, 2020.
- Language as queries for referring video object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- Can humans fly? action understanding with multiple classes of actors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2264–2273, 2015.
- YouTube-VOS: A large-scale video object segmentation benchmark. arXiv preprint arXiv:1809.03327, 2018.
- Learning spatio-temporal transformer for visual tracking. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 10428–10437, 2021.
- Actor and action modular network for text-based video segmentation. IEEE Transactions on Image Processing (TIP), 31:4474–4489, 2022.
- Actor and action modular network for text-based video segmentation. arXiv preprint arXiv:2011.00786, 2020.
- Video instance segmentation. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 5187–5196, 2019.
- Object-agnostic transformers for video referring segmentation. IEEE Transactions on Image Processing (TIP), 31:2839–2849, 2022.
- Collaborative video object segmentation by foreground-background integration. In Proceedings of the European Conference on Computer Vision (ECCV), pages 332–348, 2020.
- Referring segmentation in images and videos with cross-modal self-attention network. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), pages 1–1, 2021.
- Deformable DETR: deformable transformers for end-to-end object detection. In Proceedings of the International Conference on Learning Representations (ICLR), 2021.