ClawCraneNet: Leveraging Object-level Relation for Text-based Video Segmentation (2103.10702v4)
Abstract: Text-based video segmentation is a challenging task that segments out the natural language referred objects in videos. It essentially requires semantic comprehension and fine-grained video understanding. Existing methods introduce language representation into segmentation models in a bottom-up manner, which merely conducts vision-language interaction within local receptive fields of ConvNets. We argue that such interaction is not fulfilled since the model can barely construct region-level relationships given partial observations, which is contrary to the description logic of natural language/referring expressions. In fact, people usually describe a target object using relations with other objects, which may not be easily understood without seeing the whole video. To address the issue, we introduce a novel top-down approach by imitating how we human segment an object with the language guidance. We first figure out all candidate objects in videos and then choose the refereed one by parsing relations among those high-level objects. Three kinds of object-level relations are investigated for precise relationship understanding, i.e., positional relation, text-guided semantic relation, and temporal relation. Extensive experiments on A2D Sentences and J-HMDB Sentences show our method outperforms state-of-the-art methods by a large margin. Qualitative results also show our results are more explainable.
- Bottom-up and top-down attention for image captioning and visual question answering. In CVPR, pages 6077–6086, 2018.
- Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In CVPR, pages 3674–3683, 2018.
- Vqa: Visual question answering. In ICCV, pages 2425–2433, 2015.
- G3raphground: Graph-based language grounding. In ICCV, pages 4281–4290, 2019.
- Guy Thomas Buswell. How people look at pictures: a study of the psychology and perception in art. 1935.
- See-through-text grouping for referring image segmentation. In ICCV, pages 7454–7463, 2019.
- Referring expression object segmentation with caption-aware consistency. In BMVC, page 263, 2019.
- Imagenet: A large-scale hierarchical image database. In CVPR, pages 248–255. Ieee, 2009.
- Video captioning with attention-based lstm and semantic consistency. TMM, 19(9):2045–2055, 2017.
- Actor and action video segmentation from a sentence. In CVPR, pages 5958–5966, 2018.
- Mask r-cnn. In ICCV, pages 2961–2969, 2017.
- Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
- Meaning-based guidance of attention in scenes as revealed by meaning maps. Nature Human Behaviour, 1(10):743–747, 2017.
- Segmentation from natural language expressions. In ECCV, pages 108–124, 2016.
- Bi-directional relationship inferring network for referring image segmentation. In CVPR, pages 4424–4433, 2020.
- Referring image segmentation via cross-modal progressive comprehension. In CVPR, pages 10488–10497, 2020.
- Bidirectional lstm-crf models for sequence tagging. arXiv preprint arXiv:1508.01991, 2015.
- Collaborative spatial-temporal modeling for language-queried video actor segmentation. In CVPR, pages 4187–4196, 2021.
- Linguistic structure guided context modeling for referring image segmentation. In ECCV, pages 59–75, 2020.
- Towards understanding action recognition. In ICCV, pages 3192–3199, 2013.
- Person-centered cognition: The presence of people in a visual scene promotes relational reasoning. Journal of Experimental Social Psychology, 90:104009, 2020.
- Neural correlates of actual and predicted memory formation. Nature neuroscience, 8(12):1776–1783, 2005.
- Referitgame: Referring to objects in photographs of natural scenes. In EMNLP, pages 787–798, 2014.
- Adam: A method for stochastic optimization. In ICLR, 2014.
- Harold W Kuhn. The hungarian method for the assignment problem. Naval research logistics quarterly, 2(1-2):83–97, 1955.
- Tracking by natural language specification. In CVPR, pages 6495–6503, 2017.
- Local-global context aware transformer for language-guided video segmentation. IEEE TPAMI, 45(8):10055–10069, 2023.
- Recurrent multimodal interaction for referring image segmentation. In ICCV, pages 1271–1280, 2017.
- Fully convolutional networks for semantic segmentation. In CVPR, pages 3431–3440, 2015.
- Hierarchical question-image co-attention for visual question answering. In NeurIPS, pages 289–297, 2016.
- Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 2008.
- Dynamic multimodal instance segmentation guided by natural language queries. In ECCV, pages 630–645, 2018.
- Visual-textual capsule routing for text-based video segmentation. In CVPR, pages 9942–9951, 2020.
- Polar relative positional encoding for video-language segmentation. In IJCAI, 2020.
- Referring image segmentation by generative adversarial learning. TMM, 22(5):1333–1344, 2019.
- Video object grounding using semantic roles in language description. In CVPR, pages 10417–10427, 2020.
- Conditional convolutions for instance segmentation. In ECCV, 2020.
- Context modulated dynamic networks for actor and action video segmentation with language queries. In AAAI, pages 12152–12159, 2020.
- Asymmetric cross-guided attention network for actor and action video segmentation from natural language query. In ICCV, pages 3939–3948, 2019.
- Video captioning via hierarchical reinforcement learning. In CVPR, pages 4213–4222, 2018.
- Five factors that guide attention in visual search. Nature Human Behaviour, 1(3):1–8, 2017.
- Phrasecut: Language-based image segmentation in the wild. In CVPR, pages 10216–10225, 2020.
- Decoupled novel object captioner. In ACM MM, 2018.
- Dual attention matching for audio-visual event localization. In ICCV, 2019.
- Can humans fly? action understanding with multiple classes of actors. In CVPR, pages 2264–2273, 2015.
- Segment as points for efficient online multi-object tracking and segmentation. In ECCV, 2020.
- A fast and accurate one-stage approach to visual grounding. In ICCV, pages 4683–4693, 2019.
- Grounding-tracking-integration. TCSVT, 2020.
- Cross-modal self-attention network for referring image segmentation. In CVPR, pages 10502–10511, 2019.
- Image captioning with semantic attention. In CVPR, pages 4651–4659, 2016.
- Mattnet: Modular attention network for referring expression comprehension. In CVPR, pages 1307–1315, 2018.
- Where does it exist: Spatio-temporal video grounding for multi-form sentences. In CVPR, pages 10668–10677, 2020.
- Vision-dialog navigation by exploring cross-modal memory. In CVPR, pages 10730–10739, 2020.