Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Zero-Shot Dense Video Captioning by Jointly Optimizing Text and Moment (2307.02682v2)

Published 5 Jul 2023 in cs.CV and cs.CL

Abstract: Dense video captioning, a task of localizing meaningful moments and generating relevant captions for videos, often requires a large, expensive corpus of annotated video segments paired with text. In an effort to minimize the annotation cost, we propose ZeroTA, a novel method for dense video captioning in a zero-shot manner. Our method does not require any videos or annotations for training; instead, it localizes and describes events within each input video at test time by optimizing solely on the input. This is accomplished by introducing a soft moment mask that represents a temporal segment in the video and jointly optimizing it with the prefix parameters of a LLM. This joint optimization aligns a frozen language generation model (i.e., GPT-2) with a frozen vision-language contrastive model (i.e., CLIP) by maximizing the matching score between the generated text and a moment within the video. We also introduce a pairwise temporal IoU loss to let a set of soft moment masks capture multiple distinct events within the video. Our method effectively discovers diverse significant events within the video, with the resulting captions appropriately describing these events. The empirical results demonstrate that ZeroTA surpasses zero-shot baselines and even outperforms the state-of-the-art few-shot method on the widely-used benchmark ActivityNet Captions. Moreover, our method shows greater robustness compared to supervised methods when evaluated in out-of-domain scenarios. This research provides insight into the potential of aligning widely-used models, such as language generation models and vision-LLMs, to unlock a new capability: understanding temporal aspects of videos.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (52)
  1. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
  2. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72, 2005.
  3. Is space-time attention all you need for video understanding? In ICML, volume 2, page 4, 2021.
  4. Localizing natural language in videos. In AAAI, 2019.
  5. Towards bridging event captioner and sentence localizer for weakly supervised dense event captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021.
  6. Sketch, ground, and refine: Top-down dense video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021.
  7. Weakly supervised dense event captioning in videos. NeurIPS, 2018.
  8. Magma–multimodal augmentation of generative models through adapter-based finetuning. arXiv preprint arXiv:2112.05253, 2021.
  9. Soda: Story oriented dense video captioning evaluation framework. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VI 16, pages 517–531. Springer, 2020.
  10. Tall: Temporal activity localization via language query. In Proceedings of the IEEE international conference on computer vision, 2017a.
  11. Video captioning with attention-based lstm and semantic consistency. IEEE Transactions on Multimedia, 2017b.
  12. Wslln: Weakly supervised natural language localization networks. Arxiv, 2019.
  13. A better use of audio-visual cues: Dense video captioning with bi-modal transformer. Arxiv, 2020a.
  14. Multi-modal dense video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2020b.
  15. Pseudo-q: Generating pseudo language queries for visual grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
  16. Language-free training for zero-shot video grounding. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023.
  17. Dense-captioning events in videos. In Proceedings of the IEEE international conference on computer vision, 2017.
  18. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888–12900. PMLR, 2022.
  19. Prefix-tuning: Optimizing continuous prompts for generation. Arxiv, 2021.
  20. Swinbert: End-to-end transformers with sparse attention for video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
  21. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023.
  22. Debug: A dense bottom-up grounding approach for natural language video localization. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019.
  23. Vlanet: Video-language alignment network for weakly-supervised video moment retrieval. In ECCV. Springer, 2020.
  24. Linearly mapping from image to text space. Arxiv, 2022.
  25. Weakly supervised video moment retrieval from text queries. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019.
  26. Local-global video-text interactions for temporal grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020.
  27. Zero-shot natural language video localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021.
  28. Video captioning with transferred semantic attributes. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2017.
  29. Text-based temporal localization of novel events. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XIV. Springer, 2022.
  30. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  31. Learning transferable visual models from natural language supervision. In ICML. PMLR, 2021.
  32. Watch, listen and tell: Multi-modal weakly supervised dense event captioning. In Proceedings of the IEEE/CVF international conference on computer vision, 2019.
  33. Proposal-free temporal moment localization of a natural-language query in video using guided attention. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, 2020.
  34. Dori: Discovering object relationships for moment localization of a natural language query in a video. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2021.
  35. Zero-shot video captioning with evolving pseudo-tokens. Arxiv, 2022a.
  36. Zerocap: Zero-shot image-to-text generation for visual-semantic arithmetic. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022b.
  37. Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems, 34:200–212, 2021.
  38. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575, 2015.
  39. Reconstruction network for video captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2018a.
  40. Prompt-based zero-shot video moment retrieval. In Proceedings of the 30th ACM International Conference on Multimedia, 2022.
  41. Bidirectional attentive fusion with context gating for dense video captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2018b.
  42. Event-centric hierarchical representation for dense video captioning. IEEE Transactions on Circuits and Systems for Video Technology, 2020.
  43. End-to-end dense video captioning with parallel decoding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021a.
  44. Video captioning via hierarchical reinforcement learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2018c.
  45. Visual co-occurrence alignment learning for weakly-supervised video moment retrieval. In Proceedings of the 29th ACM International Conference on Multimedia, 2021b.
  46. Vid2seq: Large-scale pretraining of a visual language model for dense video captioning. In CVPR, 2023.
  47. Weakly-supervised moment retrieval network for video corpus moment retrieval. In 2021 IEEE International Conference on Image Processing (ICIP). IEEE, 2021.
  48. Dense regression network for video grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020.
  49. Man: Moment alignment network for natural language moment retrieval via iterative graph adjustment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019.
  50. Unifying event detection and captioning as sequence generation via pre-training. In ECCV. Springer, 2022.
  51. Towards automatic learning of procedures from web instructional videos. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
  52. End-to-end dense video captioning as sequence generation. Arxiv, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Yongrae Jo (9 papers)
  2. Seongyun Lee (13 papers)
  3. Aiden SJ Lee (1 paper)
  4. Hyunji Lee (19 papers)
  5. Hanseok Oh (8 papers)
  6. Minjoon Seo (82 papers)
Citations (2)