End-to-End Dense Video Grounding via Parallel Regression (2109.11265v5)
Abstract: Video grounding aims to localize the corresponding video moment in an untrimmed video given a language query. Existing methods often address this task in an indirect way, by casting it as a proposal-and-match or fusion-and-detection problem. Solving these surrogate problems often requires sophisticated label assignment during training and hand-crafted removal of near-duplicate results. Meanwhile, existing works typically focus on sparse video grounding with a single sentence as input, which could result in ambiguous localization due to its unclear description. In this paper, we tackle a new problem of dense video grounding, by simultaneously localizing multiple moments with a paragraph as input. From a perspective on video grounding as language conditioned regression, we present an end-to-end parallel decoding paradigm by re-purposing a Transformer-alike architecture (PRVG). The key design in our PRVG is to use languages as queries, and directly regress the moment boundaries based on language-modulated visual representations. Thanks to its simplicity in design, our PRVG framework can be applied in different testing schemes (sparse or dense grounding) and allows for efficient inference without any post-processing technique. In addition, we devise a robust proposal-level attention loss to guide the training of PRVG, which is invariant to moment duration and contributes to model convergence. We perform experiments on two video grounding benchmarks of ActivityNet Captions and TACoS, demonstrating that our PRVG can significantly outperform previous methods. We also perform in-depth studies to investigate the effectiveness of parallel regression paradigm on video grounding.
- Localizing moments in video with natural language, in: Proceedings of the IEEE international conference on computer vision, pp. 5803–5812.
- Layer normalization. arXiv preprint arXiv:1607.06450 .
- Dense events grounding in video, in: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 920–928.
- End-to-end object detection with transformers, in: European Conference on Computer Vision, Springer. pp. 213–229.
- Temporally grounding natural sentence in video, in: Proceedings of the 2018 conference on empirical methods in natural language processing, pp. 162–171.
- Rethinking the bottom-up framework for query-based video localization, in: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 10551–10558.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 .
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 .
- Tall: Temporal activity localization via language query, in: Proceedings of the IEEE international conference on computer vision, pp. 5267–5275.
- Masked autoencoders are scalable vision learners, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 16000–16009.
- Deep residual learning for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778.
- Long short-term memory. Neural computation , 1735–1780.
- Semi-supervised video paragraph grounding with contrastive encoder, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2466–2475.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 .
- Dense-captioning events in videos, in: Proceedings of the IEEE international conference on computer vision, pp. 706–715.
- Detecting moments and highlights in videos via natural language queries. Advances in Neural Information Processing Systems 34, 11846–11858.
- Proposal-free video grounding with contextual pyramid network, in: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 1902–1910.
- Bsn: Boundary sensitive network for temporal action proposal generation, in: Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–19.
- Context-aware biaffine localizing network for temporal sentence grounding, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11235–11244.
- Jointly cross-and self-modal graph attention network for query-based moment localization, in: Proceedings of the 28th ACM International Conference on Multimedia, pp. 4070–4078.
- A survey on video moment localization. ACM Computing Surveys 55, 1–37.
- Attentive moment retrieval in videos, in: The 41st international ACM SIGIR conference on research & development in information retrieval, pp. 15–24.
- Cross-modal moment localization in videos, in: Proceedings of the 26th ACM international conference on Multimedia, pp. 843–851.
- Swin transformer: Hierarchical vision transformer using shifted windows, in: Proceedings of the IEEE/CVF international conference on computer vision, pp. 10012–10022.
- Cross-modal contrastive learning with asymmetric co-attention network for video moment retrieval, in: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 607–614.
- Glove: Global vectors for word representation, in: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543.
- Grounding action descriptions in videos. Transactions of the Association for Computational Linguistics 1, 25–36.
- Proposal-free temporal moment localization of a natural-language query in video using guided attention, in: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2464–2473.
- Bmrn: Boundary matching and refinement network for temporal moment localization with natural language, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pp. 5570–5578.
- Learning to combine the modalities of language and video for temporal moment localization. Computer Vision and Image Understanding 217, 103375.
- Hierarchical semantic correspondence networks for video paragraph grounding, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 18973–18982.
- Relaxed transformer decoders for direct action proposal generation, in: ICCV, pp. 13526–13535.
- Training data-efficient image transformers & distillation through attention, in: International conference on machine learning, PMLR. pp. 10347–10357.
- Learning spatiotemporal features with 3d convolutional networks, in: Proceedings of the IEEE international conference on computer vision, pp. 4489–4497.
- Attention is all you need, in: Advances in neural information processing systems, pp. 5998–6008.
- Temporally grounding language queries in videos by contextual boundary-aware prediction, in: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 12168–12175.
- Temporal segment networks: Towards good practices for deep action recognition, in: European conference on computer vision, Springer. pp. 20–36.
- Boundary proposal network for two-stage natural language video localization, in: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 2986–2994.
- Multilevel language and vision integration for text-to-clip retrieval, in: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 9062–9069.
- Mh-detr: Video moment and highlight detection with cross-modal transformer. arXiv preprint arXiv:2305.00355 .
- Tener: adapting transformer encoder for named entity recognition. arXiv preprint arXiv:1911.04474 .
- Semantic conditioned dynamic modulation for temporal sentence grounding in videos. arXiv preprint arXiv:1910.14303 .
- To find where you talk: Temporal sentence localization in video with attention based location regression, in: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 9159–9166.
- Dense regression network for video grounding, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10287–10296.
- Man: Moment alignment network for natural language moment retrieval via iterative graph adjustment, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1247–1257.
- Span-based localizing network for natural language video localization. arXiv preprint arXiv:2004.13931 .
- Temporal sentence grounding in videos: A survey and future directions. IEEE Transactions on Pattern Analysis and Machine Intelligence .
- Learning 2d temporal adjacent networks for moment localization with natural language, in: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 12870–12877.
- Cross-modal interaction networks for query-based moment retrieval in videos, in: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 655–664.
- Cascaded prediction network via segment tree for temporal video grounding, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4197–4206.
- Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159 .
- Fengyuan Shi (8 papers)
- Weilin Huang (61 papers)
- Limin Wang (221 papers)