Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
184 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Open-Vocabulary Video Relation Extraction (2312.15670v1)

Published 25 Dec 2023 in cs.CV

Abstract: A comprehensive understanding of videos is inseparable from describing the action with its contextual action-object interactions. However, many current video understanding tasks prioritize general action classification and overlook the actors and relationships that shape the nature of the action, resulting in a superficial understanding of the action. Motivated by this, we introduce Open-vocabulary Video Relation Extraction (OVRE), a novel task that views action understanding through the lens of action-centric relation triplets. OVRE focuses on pairwise relations that take part in the action and describes these relation triplets with natural languages. Moreover, we curate the Moments-OVRE dataset, which comprises 180K videos with action-centric relation triplets, sourced from a multi-label action classification dataset. With Moments-OVRE, we further propose a crossmodal mapping model to generate relation triplets as a sequence. Finally, we benchmark existing cross-modal generation models on the new task of OVRE.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (50)
  1. REBEL: Relation extraction by end-to-end language generation. In Findings of the Association for Computational Linguistics: EMNLP 2021, 2370–2381.
  2. Valor: Vision-audio-language omni-perception pretraining model and dataset. arXiv preprint arXiv:2304.08345.
  3. Semantic proposal for activity localization in videos via sentence query. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, 8199–8206.
  4. Deep Learning for Video Captioning: A Review. In IJCAI, volume 1, 2.
  5. RandAugment: Practical data augmentation with no separate search. CoRR, abs/1909.13719.
  6. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
  7. Video relation detection via tracklet based visual transformer. In Proceedings of the 29th ACM international conference on multimedia, 4833–4837.
  8. Compositional Prompt Tuning with Motion Cues for Open-vocabulary Video Relation Detection. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net.
  9. Simcse: Simple contrastive learning of sentence embeddings. arXiv preprint arXiv:2104.08821.
  10. Detecting and recognizing human-object interactions. In Proceedings of the IEEE conference on computer vision and pattern recognition, 8359–8367.
  11. The “Something Something” Video Database for Learning and Evaluating Visual Common Sense. In 2017 IEEE International Conference on Computer Vision (ICCV), 5843–5851.
  12. Seq-NMS for Video Object Detection. arXiv:1602.08465.
  13. ActivityNet: A large-scale video benchmark for human activity understanding. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 961–970.
  14. Action Genome: Actions as Composition of Spatio-temporal Scene Graphs. CoRR, abs/1912.06992.
  15. The Kinetics Human Action Video Dataset. ArXiv, abs/1705.06950.
  16. Human action recognition and prediction: A survey. International Journal of Computer Vision, 130(5): 1366–1401.
  17. Hake: a knowledge engine foundation for human activity understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence.
  18. Knowledge-guided semantic transfer network for few-shot image recognition. IEEE Transactions on Neural Networks and Learning Systems.
  19. Deep collaborative embedding for social image understanding. IEEE transactions on pattern analysis and machine intelligence, 41(9): 2070–2083.
  20. Swinbert: End-to-end transformers with sparse attention for video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 17949–17958.
  21. FineAction: A Fine-Grained Video Dataset for Temporal Action Localization. arXiv:2105.11107.
  22. Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning. Neurocomputing, 508: 293–304.
  23. Clipcap: Clip prefix for image captioning. arXiv preprint arXiv:2111.09734.
  24. Moments in Time Dataset: one million videos for event understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1–8.
  25. Multi-Moments in Time: Learning and Interpreting Models for Multi-Action Video Understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1–1.
  26. Expanding Language-Image Pretrained Models for General Video Recognition. arXiv:2208.02816.
  27. Locate before answering: Answer guided question localization for video question answering. IEEE Transactions on Multimedia.
  28. Video Relation Detection with Spatio-Temporal Graph. In Proceedings of the 27th ACM International Conference on Multimedia, MM ’19, 84–93. New York, NY, USA: Association for Computing Machinery. ISBN 9781450368896.
  29. Learning transferable visual models from natural language supervision. In International conference on machine learning, 8748–8763. PMLR.
  30. Language models are unsupervised multitask learners. OpenAI blog, 1(8): 9.
  31. Finetuned CLIP models are efficient video learners. In The IEEE/CVF Conference on Computer Vision and Pattern Recognition.
  32. Visual semantic role labeling for video understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5589–5600.
  33. Annotating Objects and Relations in User-Generated Videos. In Proceedings of the 2019 on International Conference on Multimedia Retrieval, 279–287. ACM.
  34. Video Visual Relation Detection. Proceedings of the 25th ACM international conference on Multimedia.
  35. Relation Triplet Construction for Cross-modal Text-to-Video Retrieval. In Proceedings of the 31st ACM International Conference on Multimedia, 4759–4767.
  36. Spatial-temporal graphs for cross-modal text2video retrieval. IEEE Transactions on Multimedia, 24: 2914–2923.
  37. Clip4caption: Clip for video caption. In Proceedings of the 29th ACM International Conference on Multimedia, 4858–4862.
  38. GIT: A Generative Image-to-text Transformer for Vision and Language. arXiv:2205.14100.
  39. Vatex: A large-scale, high-quality multilingual dataset for video-and-language research. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 4581–4591.
  40. Visual co-occurrence alignment learning for weakly-supervised video moment retrieval. In Proceedings of the 29th ACM International Conference on Multimedia, 1459–1468.
  41. A Survey on Temporal Action Localization. IEEE Access, 8: 70477–70487.
  42. mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video. arXiv:2302.00402.
  43. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 5288–5296.
  44. VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners. arXiv:2212.04979.
  45. Commonsense justification for action explanation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2627–2637.
  46. Situation recognition: Visual semantic role labeling for image understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, 5534–5542.
  47. Reconstruct and represent video contents for captioning via reinforcement learning. IEEE transactions on pattern analysis and machine intelligence, 42(12): 3088–3101.
  48. Videolt: Large-scale long-tailed video recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 7960–7969.
  49. VRDFormer: End-to-End Video Visual Relation Detection with Transformers. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 18814–18824.
  50. RegionCLIP: Region-based Language-Image Pretraining. arXiv:2112.09106.
Citations (2)

Summary

  • The paper introduces OVRE, a novel task that extracts detailed action-centric relation triplets to capture nuanced actor-action relationships.
  • It leverages a cross-modal mapping model that uses the CLIP visual encoder and a pre-trained LLM to translate video semantics into natural language.
  • The Moments-OVRE dataset, comprising 180,000 videos with unrestricted vocabulary annotations, significantly advances the field of video understanding.

The paper "Open-Vocabulary Video Relation Extraction" introduces a novel task designed to enhance video understanding by focusing on action-centric relation triplets. This task, termed Open-vocabulary Video Relation Extraction (OVRE), seeks to transcend traditional action classification methods that often overlook the nuanced actors and relationships involved in video actions. Instead, OVRE emphasizes the extraction of pairwise relations in videos and describes these relation triplets using natural language.

Key elements of the paper include the introduction of the Moments-OVRE dataset, a vast collection consisting of 180,000 videos annotated with action-centric relation triplets. This dataset is derived from the Multi-Moments in Time (M-MiT) dataset, which is known for its multi-label nature and brief video durations, typically around three seconds.

The authors propose a methodology involving a cross-modal mapping model that utilizes the CLIP visual encoder to encapsulate video semantics, which are then translated into linguistic expressions using a pre-trained LLM. This approach allows the generation of unconstrained vocabulary relation triplets, effectively moving beyond the limitations of fixed vocabulary sets typically found in related tasks such as Video Visual Relation Detection (VidVRD) and Action Genome. These conventional tasks often fail to capture the full complexity of actions and their contexts, as they restrict objects and predicates to limited categories.

In benchmarking for OVRE, existing cross-modal generation models are evaluated, setting the stage for further research and development in video relation extraction. By leveraging the expansive vocabulary capabilities of LLMs, the authors aim to capture more dynamic and nuanced actions within videos, offering a deeper understanding of video content. The Moments-OVRE dataset is also positioned as the most extensive video relation extraction dataset, featuring:

  1. Unrestricted Vocabulary Annotations: Annotations encompass diverse actors and relations, providing a more accurate representation of real-world scenarios.
  2. Focus on Action-Centric Annotations: The emphasis is on annotations relevant to video actions/events.
  3. Scalability: With over 180,000 videos, it marks a significant contribution to video relation extraction datasets.

In conclusion, the paper presents OVRE as a step forward in video understanding, linking general action classification and detailed linguistic description through a contextual level comprehension of video content. The dataset and new task formulation proposed in this paper are poised to drive forward the field of automatic video understanding by providing a richer framework for modeling human-like comprehension of video scenes.