Open-Vocabulary Video Relation Extraction (2312.15670v1)
Abstract: A comprehensive understanding of videos is inseparable from describing the action with its contextual action-object interactions. However, many current video understanding tasks prioritize general action classification and overlook the actors and relationships that shape the nature of the action, resulting in a superficial understanding of the action. Motivated by this, we introduce Open-vocabulary Video Relation Extraction (OVRE), a novel task that views action understanding through the lens of action-centric relation triplets. OVRE focuses on pairwise relations that take part in the action and describes these relation triplets with natural languages. Moreover, we curate the Moments-OVRE dataset, which comprises 180K videos with action-centric relation triplets, sourced from a multi-label action classification dataset. With Moments-OVRE, we further propose a crossmodal mapping model to generate relation triplets as a sequence. Finally, we benchmark existing cross-modal generation models on the new task of OVRE.
- REBEL: Relation extraction by end-to-end language generation. In Findings of the Association for Computational Linguistics: EMNLP 2021, 2370–2381.
- Valor: Vision-audio-language omni-perception pretraining model and dataset. arXiv preprint arXiv:2304.08345.
- Semantic proposal for activity localization in videos via sentence query. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, 8199–8206.
- Deep Learning for Video Captioning: A Review. In IJCAI, volume 1, 2.
- RandAugment: Practical data augmentation with no separate search. CoRR, abs/1909.13719.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
- Video relation detection via tracklet based visual transformer. In Proceedings of the 29th ACM international conference on multimedia, 4833–4837.
- Compositional Prompt Tuning with Motion Cues for Open-vocabulary Video Relation Detection. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net.
- Simcse: Simple contrastive learning of sentence embeddings. arXiv preprint arXiv:2104.08821.
- Detecting and recognizing human-object interactions. In Proceedings of the IEEE conference on computer vision and pattern recognition, 8359–8367.
- The “Something Something” Video Database for Learning and Evaluating Visual Common Sense. In 2017 IEEE International Conference on Computer Vision (ICCV), 5843–5851.
- Seq-NMS for Video Object Detection. arXiv:1602.08465.
- ActivityNet: A large-scale video benchmark for human activity understanding. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 961–970.
- Action Genome: Actions as Composition of Spatio-temporal Scene Graphs. CoRR, abs/1912.06992.
- The Kinetics Human Action Video Dataset. ArXiv, abs/1705.06950.
- Human action recognition and prediction: A survey. International Journal of Computer Vision, 130(5): 1366–1401.
- Hake: a knowledge engine foundation for human activity understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence.
- Knowledge-guided semantic transfer network for few-shot image recognition. IEEE Transactions on Neural Networks and Learning Systems.
- Deep collaborative embedding for social image understanding. IEEE transactions on pattern analysis and machine intelligence, 41(9): 2070–2083.
- Swinbert: End-to-end transformers with sparse attention for video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 17949–17958.
- FineAction: A Fine-Grained Video Dataset for Temporal Action Localization. arXiv:2105.11107.
- Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning. Neurocomputing, 508: 293–304.
- Clipcap: Clip prefix for image captioning. arXiv preprint arXiv:2111.09734.
- Moments in Time Dataset: one million videos for event understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1–8.
- Multi-Moments in Time: Learning and Interpreting Models for Multi-Action Video Understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1–1.
- Expanding Language-Image Pretrained Models for General Video Recognition. arXiv:2208.02816.
- Locate before answering: Answer guided question localization for video question answering. IEEE Transactions on Multimedia.
- Video Relation Detection with Spatio-Temporal Graph. In Proceedings of the 27th ACM International Conference on Multimedia, MM ’19, 84–93. New York, NY, USA: Association for Computing Machinery. ISBN 9781450368896.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, 8748–8763. PMLR.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8): 9.
- Finetuned CLIP models are efficient video learners. In The IEEE/CVF Conference on Computer Vision and Pattern Recognition.
- Visual semantic role labeling for video understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5589–5600.
- Annotating Objects and Relations in User-Generated Videos. In Proceedings of the 2019 on International Conference on Multimedia Retrieval, 279–287. ACM.
- Video Visual Relation Detection. Proceedings of the 25th ACM international conference on Multimedia.
- Relation Triplet Construction for Cross-modal Text-to-Video Retrieval. In Proceedings of the 31st ACM International Conference on Multimedia, 4759–4767.
- Spatial-temporal graphs for cross-modal text2video retrieval. IEEE Transactions on Multimedia, 24: 2914–2923.
- Clip4caption: Clip for video caption. In Proceedings of the 29th ACM International Conference on Multimedia, 4858–4862.
- GIT: A Generative Image-to-text Transformer for Vision and Language. arXiv:2205.14100.
- Vatex: A large-scale, high-quality multilingual dataset for video-and-language research. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 4581–4591.
- Visual co-occurrence alignment learning for weakly-supervised video moment retrieval. In Proceedings of the 29th ACM International Conference on Multimedia, 1459–1468.
- A Survey on Temporal Action Localization. IEEE Access, 8: 70477–70487.
- mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video. arXiv:2302.00402.
- MSR-VTT: A Large Video Description Dataset for Bridging Video and Language. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 5288–5296.
- VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners. arXiv:2212.04979.
- Commonsense justification for action explanation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2627–2637.
- Situation recognition: Visual semantic role labeling for image understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, 5534–5542.
- Reconstruct and represent video contents for captioning via reinforcement learning. IEEE transactions on pattern analysis and machine intelligence, 42(12): 3088–3101.
- Videolt: Large-scale long-tailed video recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 7960–7969.
- VRDFormer: End-to-End Video Visual Relation Detection with Transformers. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 18814–18824.
- RegionCLIP: Region-based Language-Image Pretraining. arXiv:2112.09106.