Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Accurate and Fast Compressed Video Captioning (2309.12867v2)

Published 22 Sep 2023 in cs.CV and cs.AI

Abstract: Existing video captioning approaches typically require to first sample video frames from a decoded video and then conduct a subsequent process (e.g., feature extraction and/or captioning model learning). In this pipeline, manual frame sampling may ignore key information in videos and thus degrade performance. Additionally, redundant information in the sampled frames may result in low efficiency in the inference of video captioning. Addressing this, we study video captioning from a different perspective in compressed domain, which brings multi-fold advantages over the existing pipeline: 1) Compared to raw images from the decoded video, the compressed video, consisting of I-frames, motion vectors and residuals, is highly distinguishable, which allows us to leverage the entire video for learning without manual sampling through a specialized model design; 2) The captioning model is more efficient in inference as smaller and less redundant information is processed. We propose a simple yet effective end-to-end transformer in the compressed domain for video captioning that enables learning from the compressed video for captioning. We show that even with a simple design, our method can achieve state-of-the-art performance on different benchmarks while running almost 2x faster than existing approaches. Code is available at https://github.com/acherstyx/CoCap.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (39)
  1. Vivit: A video vision transformer. In IEEE/CVF International Conference on Computer Vision, pages 6816–6826, 2021.
  2. Frozen in time: A joint video and image encoder for end-to-end retrieval. In IEEE/CVF International Conference on Computer Vision, pages 1708–1718, 2021.
  3. Collecting highly parallel data for paraphrase evaluation. In Annual Meeting of the Association for Computational Linguistics, 2011.
  4. Mm-vit: Multi-modal video transformer for compressed video action recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1910–1921, 2022.
  5. Learning modality interaction for temporal sentence localization and event captioning in videos. In European Conference on Computer Vision, pages 333–351, 2020.
  6. Motion guided region message passing for video captioning. In IEEE/CVF International Conference on Computer Vision, pages 1523–1532, 2021.
  7. Meteor universal: Language specific translation evaluation for any target language. In Proceedings of the Ninth Workshop on Statistical Machine Translation, pages 376–380, 2014.
  8. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021.
  9. Text with knowledge graph augmented transformer for video captioning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18941–18951, 2023.
  10. Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6546–6555, 2018.
  11. Deep residual learning for image recognition. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
  12. Self-supervised video representation learning by context and motion decoupling. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13886–13895, 2021.
  13. Compressed video contrastive learning. Advances in Neural Information Processing Systems, 34:14176–14187, 2021.
  14. Lightweight action recognition in compressed videos. In Computer Vision–ECCV 2020 Workshops: Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16, pages 337–352. Springer, 2020.
  15. The ava-kinetics localized human actions video dataset. arXiv preprint arXiv:2005.00214, 2020.
  16. A slow-i-fast-p architecture for compressed video action recognition. In ACM International Conference on Multimedia, pages 2039–2047, 2020.
  17. Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Annual Meeting of the Association for Computational Linguistics, pages 74–81, 2004.
  18. Swinbert: End-to-end transformers with sparse attention for video captioning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17949–17958, 2022.
  19. Video swin transformer. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3192–3201, 2022.
  20. Univl: A unified video and language pre-training model for multimodal understanding and generation. arXiv preprint arXiv:2002.06353, 2020.
  21. HowTo100M: Learning a text-video embedding by watching hundred million narrated video clips. In IEEE/CVF International Conference on Computer Vision, pages 2630–2640, 2019.
  22. Spatio-temporal graph for video captioning with knowledge distillation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10867–10876, 2020.
  23. Bleu: a method for automatic evaluation of machine translation. In Annual Meeting of the Association for Computational Linguistics, pages 311–318, 2002.
  24. Support-set bottlenecks for video-text representation learning. arXiv preprint arXiv:2010.02824, 2020.
  25. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763, 2021.
  26. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in Neural Information Processing Systems, 28, 2015.
  27. Semantic grouping network for video captioning. In Association for the Advancement of Artificial Intelligence, pages 2514–2522, 2021.
  28. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.
  29. End-to-end generative pretraining for multimodal video captioning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17959–17968, 2022.
  30. NITS-VC system for vatex video captioning challenge 2020. arXiv preprint arXiv:2006.04058, 2020.
  31. Cider: Consensus-based image description evaluation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4566–4575, 2015.
  32. Vatex: A large-scale, high-quality multilingual dataset for video-and-language research. In IEEE/CVF International Conference on Computer Vision, 2019.
  33. Compressed video action recognition. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6026–6035, 2018.
  34. MSR-VTT: A large video description dataset for bridging video and language. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5288–5296, 2016.
  35. Video-text modeling with zero-shot transfer from contrastive captioners. arXiv preprint arXiv:2212.04979, 2022.
  36. Hierarchical modular network for video captioning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17939–17948, 2022.
  37. Open-book video captioning with retrieve-copy-generate network. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9837–9846, 2021.
  38. Object relational graph with teacher-recommended learning for video captioning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13275–13285, 2020.
  39. Syntax-aware action targeting for video captioning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13093–13102, 2020.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Yaojie Shen (5 papers)
  2. Xin Gu (28 papers)
  3. Kai Xu (312 papers)
  4. Heng Fan (360 papers)
  5. Longyin Wen (45 papers)
  6. Libo Zhang (105 papers)
Citations (14)
Github Logo Streamline Icon: https://streamlinehq.com