Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

OmniVid: A Generative Framework for Universal Video Understanding (2403.17935v1)

Published 26 Mar 2024 in cs.CV

Abstract: The core of video understanding tasks, such as recognition, captioning, and tracking, is to automatically detect objects or actions in a video and analyze their temporal evolution. Despite sharing a common goal, different tasks often rely on distinct model architectures and annotation formats. In contrast, natural language processing benefits from a unified output space, i.e., text sequences, which simplifies the training of powerful foundational LLMs, such as GPT-3, with extensive training corpora. Inspired by this, we seek to unify the output space of video understanding tasks by using languages as labels and additionally introducing time and box tokens. In this way, a variety of video tasks could be formulated as video-grounded token generation. This enables us to address various types of video tasks, including classification (such as action recognition), captioning (covering clip captioning, video question answering, and dense video captioning), and localization tasks (such as visual object tracking) within a fully shared encoder-decoder architecture, following a generative framework. Through comprehensive experiments, we demonstrate such a simple and straightforward idea is quite effective and can achieve state-of-the-art or competitive results on seven video benchmarks, providing a novel perspective for more universal video understanding. Code is available at https://github.com/wangjk666/OmniVid.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (131)
  1. Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning. In CVPR, 2019.
  2. Vivit: A video vision transformer. In ICCV, 2021.
  3. Is space-time attention all you need for video understanding? In ICML, 2021.
  4. Fully-convolutional siamese networks for object tracking. In ECCVW, 2016.
  5. Learning discriminative model prediction for tracking. In ICCV, 2019.
  6. Know your surroundings: Exploiting scene information for object tracking. In ECCV, 2020.
  7. Rethinking zero-shot video classification: End-to-end training for realistic applications. In CVPR, 2020.
  8. Language models are few-shot learners. In NeurIPS, 2020.
  9. Space-time mixing attention for video transformer. In NeurIPS, 2021.
  10. Activitynet: A large-scale video benchmark for human activity understanding. In CVPR, 2015.
  11. End-to-end object detection with transformers. In ECCV, 2020.
  12. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, 2017.
  13. A short note about kinetics-600. arXiv preprint arXiv:1808.01340, 2018.
  14. Collecting highly parallel data for paraphrase evaluation. In ACL-HLT, 2011.
  15. Motion guided spatial attention for video captioning. In AAAI, 2019.
  16. Deep learning for video captioning: A review. In IJCAI, 2019.
  17. Learning modality interaction for temporal sentence localization and event captioning in videos. In ECCV, 2020.
  18. Pix2seq: A language modeling framework for object detection. In ICLR, 2022a.
  19. A unified sequence interface for vision tasks. In NeurIPS, 2022b.
  20. Transformer tracking. In CVPR, 2021a.
  21. Seqtrack: Sequence to sequence learning for visual object tracking. In CVPR, 2023.
  22. Less is more: Picking informative frames for video captioning. In ECCV, 2018.
  23. End-to-end multi-modal video temporal grounding. In NeurIPS, 2021b.
  24. Natural language processing. FAI, 2020.
  25. Mixformer: End-to-end tracking with iterative mixed attention. In CVPR, 2022.
  26. Atom: Accurate tracking by overlap maximization. In CVPR, 2019.
  27. Probabilistic regression for visual tracking. In CVPR, 2020.
  28. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
  29. Lasot: A high-quality benchmark for large-scale single object tracking. In CVPR, 2019.
  30. Multiscale vision transformers. In CVPR, 2021.
  31. Christoph Feichtenhofer. X3d: Expanding architectures for efficient video recognition. In CVPR, 2020.
  32. Convolutional two-stream network fusion for video action recognition. In CVPR, 2016.
  33. Slowfast networks for video recognition. In ICCV, 2019.
  34. Autoregressive times series methods for time domain astronomy. Frontiers in Physics, 2018.
  35. Beam search strategies for neural machine translation. In ACL, 2017.
  36. Video captioning with attention-based lstm and semantic consistency. TMM, 2017.
  37. Aiatrack: Attention in attention for transformer visual tracking. In ECCV, 2022.
  38. Ross Girshick. Fast r-cnn. In ICCV, 2015.
  39. The" something something" video database for learning and evaluating visual common sense. In ICCV, 2017.
  40. Gta: Global temporal attention for video action understanding. In BMVC, 2021.
  41. Asm-loc: Action-aware segment modeling for weakly-supervised temporal action localization. In CVPR, 2022.
  42. Align and attend: Multimodal summarization with dual contrastive losses. In CVPR, 2023.
  43. Ma-lmm: Memory-augmented large multimodal model for long-term video understanding. In CVPR, 2024.
  44. Mask r-cnn. In CVPR, 2017.
  45. The curious case of neural text degeneration. In ICLR, 2020.
  46. Joint syntax representation learning and visual cue translation for video captioning. In ICCV, 2019.
  47. Got-10k: A large high-diversity benchmark for generic object tracking in the wild. TPAMI, 2019.
  48. Multi-modal dense video captioning. In CVPRW, 2020.
  49. Lumen: Unleashing versatile vision-centric capabilities of large multimodal models. arXiv preprint arXiv:2403.07304, 2024.
  50. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.
  51. Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL, 2019.
  52. Movinets: Mobile video networks for efficient video recognition. In CVPR, 2021.
  53. Dense-captioning events in videos. In ICCV, 2017.
  54. Hierarchical conditional relation networks for video question answering. In CVPR, 2020.
  55. Tvqa: Localized, compositional video question answering. arXiv preprint arXiv:1809.01696, 2018.
  56. Less is more: Clipbert for video-and-language learning via sparse sampling. In CVPR, 2021.
  57. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In ACL, 2020.
  58. Siamrpn++: Evolution of siamese visual tracking with very deep networks. In CVPR, 2019.
  59. Align and prompt: Video-and-language pre-training with entity prompts. In CVPR, 2022a.
  60. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
  61. Uniformer: Unified transformer for efficient spatiotemporal representation learning. In ICLR, 2022b.
  62. Jointly localizing and describing events for dense video captioning. In CVPR, 2018.
  63. Invariant grounding for video question answering. In CVPR, 2022c.
  64. Mvitv2: Improved multiscale vision transformers for classification and detection. In CVPR, 2022d.
  65. Swinbert: End-to-end transformers with sparse attention for video captioning. In CVPR, 2022.
  66. Sibnet: Sibling convolutional encoder for video captioning. In ACM MM, 2018.
  67. Dab-detr: Dynamic anchor boxes are better queries for detr. In ICLR, 2022a.
  68. Video swin transformer. In CVPR, 2022b.
  69. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  70. Rethinking resolution in the context of efficient video recognition. In NeurIPS, 2022.
  71. Learning target candidate association to keep track of what not to track. In ICCV, 2021.
  72. Autoregressive tree models for time-series analysis. In ICDM, 2002.
  73. Trackingnet: A large-scale dataset and benchmark for object tracking in the wild. In ECCV, 2018.
  74. Streamlined dense video captioning. In CVPR, 2019.
  75. Local-global video-text interactions for temporal grounding. In CVPR, 2020.
  76. Video transformer network. In ICCV, 2021.
  77. Spatio-temporal graph for video captioning with knowledge distillation. In CVPR, 2020.
  78. Keeping your eye on the ball: Trajectory attention in video transformers. In NeurIPS, 2021.
  79. Zero-shot action recognition with error-correcting output codes. In CVPR, 2017.
  80. Univtg: Towards unified video-language temporal grounding. In ICCV, 2023.
  81. Improving language understanding by generative pre-training. OpenAI Blog, 2018.
  82. Language models are unsupervised multitask learners. OpenAI Blog, 2019.
  83. Look before you speak: Visually contextualized utterances. In CVPR, 2021.
  84. Two-stream convolutional networks for action recognition in videos. In NeurIPS, 2014.
  85. Visual tracking: An experimental survey. TPAMI, 2013.
  86. Resformer: Scaling vits with multi-resolution training. In CVPR, 2023.
  87. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  88. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
  89. A closer look at spatiotemporal convolutions for action recognition. In CVPR, 2018.
  90. Video classification with channel-separated convolutional networks. In ICCV, 2019.
  91. Attention is all you need. In NeurIPS, 2017.
  92. Siam r-cnn: Visual tracking by re-detection. In CVPR, 2020.
  93. Controllable video captioning with pos sequence guidance based on gated fusion network. In ICCV, 2019.
  94. Video modeling with correlation networks. In CVPR, 2020.
  95. Bidirectional attentive fusion with context gating for dense video captioning. In CVPR, 2018.
  96. Omnivl: One foundation model for image-language and video-language tasks. In NeurIPS, 2022a.
  97. Chatvideo: A tracklet-centric multimodal and versatile video understanding system. arXiv preprint arXiv:2304.14407, 2023a.
  98. Omnitracker: Unifying object tracking by tracking-with-detection. arXiv preprint arXiv:2303.12079, 2023b.
  99. Look before you match: Instance understanding matters in video object segmentation. In CVPR, 2023c.
  100. Visual tracking with fully convolutional networks. In ICCV, 2015.
  101. Transformer meets tracker: Exploiting temporal context for robust visual tracking. In CVPR, 2021a.
  102. Alternative semantic representations for zero-shot human action recognition. In ECML PKDD, 2017.
  103. End-to-end dense video captioning with parallel decoding. In ICCV, 2021b.
  104. Visionllm: Large language model is also an open-ended decoder for vision-centric tasks. arXiv preprint arXiv:2305.11175, 2023d.
  105. Internvideo: General video foundation models via generative and discriminative learning. arXiv preprint arXiv:2212.03191, 2022b.
  106. Autoregressive visual tracking. In CVPR, 2023.
  107. Modeling spatial-temporal clues in a hybrid deep learning framework for video classification. In ACM MM, 2015.
  108. Building an open-vocabulary video clip model with better architectures, optimization and data. TPAMI, 2024.
  109. Vidiff: Translating videos via multi-modal instructions with diffusion models. arXiv preprint arXiv:2311.18837, 2023.
  110. Video question answering via gradually refined attention over appearance and motion. In CVPR, 2017.
  111. Msr-vtt: A large video description dataset for bridging video and language. In CVPR, 2016.
  112. Learning spatio-temporal transformer for visual tracking. In ICCV, 2021.
  113. Towards grand unification of object tracking. In ECCV, 2022.
  114. Universal instance perception as object discovery and retrieval. In CVPR, 2023a.
  115. Unloc: A unified framework for video localization tasks. In ICCV, 2023b.
  116. Just ask: Learning to answer questions from millions of narrated videos. In ICCV, 2021.
  117. Vid2seq: Large-scale pretraining of a visual language model for dense video captioning. In CVPR, 2023.
  118. Xlnet: Generalized autoregressive pretraining for language understanding. In NeurIPS, 2019.
  119. Joint feature learning and relation modeling for tracking: A one-stream framework. In ECCV, 2022.
  120. Object tracking: A survey. CSUR, 2006.
  121. Merlot reserve: Neural script knowledge through vision and language and sound. In CVPR, 2022.
  122. Object-aware aggregation with bidirectional temporal graph for video captioning. In CVPR, 2019.
  123. Unifying event detection and captioning as sequence generation via pre-training. In ECCV, 2022.
  124. Ocean: Object-aware anchor-free tracking. In ECCV, 2020a.
  125. Object relational graph with teacher-recommended learning for video captioning. In CVPR, 2020b.
  126. Learn to match: Automatic matching network design for visual tracking. In ICCV, 2021a.
  127. Open-book video captioning with retrieve-copy-generate network. In CVPR, 2021b.
  128. Syntax-aware action targeting for video captioning. In CVPR, 2020.
  129. Towards automatic learning of procedures from web instructional videos. In AAAI, 2018a.
  130. End-to-end dense video captioning with masked transformer. In CVPR, 2018b.
  131. Towards universal representation for unseen action recognition. In CVPR, 2018.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Junke Wang (18 papers)
  2. Dongdong Chen (164 papers)
  3. Chong Luo (58 papers)
  4. Bo He (32 papers)
  5. Lu Yuan (130 papers)
  6. Zuxuan Wu (144 papers)
  7. Yu-Gang Jiang (223 papers)
Citations (8)
X Twitter Logo Streamline Icon: https://streamlinehq.com