Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

AutoAD III: The Prequel -- Back to the Pixels (2404.14412v1)

Published 22 Apr 2024 in cs.CV

Abstract: Generating Audio Description (AD) for movies is a challenging task that requires fine-grained visual understanding and an awareness of the characters and their names. Currently, visual LLMs for AD generation are limited by a lack of suitable training data, and also their evaluation is hampered by using performance measures not specialized to the AD domain. In this paper, we make three contributions: (i) We propose two approaches for constructing AD datasets with aligned video data, and build training and evaluation datasets using these. These datasets will be publicly released; (ii) We develop a Q-former-based architecture which ingests raw video and generates AD, using frozen pre-trained visual encoders and LLMs; and (iii) We provide new evaluation metrics to benchmark AD quality that are well-matched to human performance. Taken together, we improve the state of the art on AD generation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (84)
  1. Flamingo: a visual language model for few-shot learning. In NeurIPS, 2022.
  2. Spice: Semantic propositional image caption evaluation. In Proc. ECCV, pages 382–398. Springer, 2016.
  3. ViViT: A Video Vision Transformer. In ICCV, 2021.
  4. Condensed movies: Story based retrieval with contextual embeddings. In Proc. ACCV, 2020.
  5. Whisperx: Time-accurate speech transcription of long-form audio. In INTERSPEECH, 2023.
  6. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
  7. iPerceive: Applying common-sense reasoning to multi-modal dense video captioning and video question answering. In Proc. WACV, 2021.
  8. CLAIR: Evaluating image captions with large language models. arXiv preprint arXiv:2310.12971, 2023.
  9. Towards bridging event captioner and sentence localizer for weakly supervised dense event captioning. In Proc. CVPR, 2021.
  10. Pali-x: On scaling up a multilingual vision and language model. arXiv preprint arXiv:2305.18565, 2023.
  11. Can large language models be an alternative to human evaluations? arXiv preprint arXiv:2305.01937, 2023.
  12. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  13. Learning to evaluate image captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5804–5812, 2018.
  14. Sketch, ground, and refine: Top-down dense video captioning. In CVPR, 2021.
  15. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  16. EVA: Exploring the limits of masked visual representation learning at scale. In CVPR, 2023.
  17. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Comm. ACM, 24(6):381–395, 1981.
  18. ImageBind: One embedding space to bind them all. In CVPR, 2023.
  19. Temporal alignment networks for long-term video. In Proc. CVPR, 2022.
  20. AutoAD: Movie description in context. In Proc. CVPR, 2023a.
  21. AutoAD II: The sequel – who, when, and what in movie audio description. In Proc. ICCV, 2023b.
  22. CLIPScore: A reference-free evaluation metric for image captioning. In EMNLP, 2021.
  23. Multimodal pretraining for dense video captioning. arXiv preprint arXiv:2011.11760, 2020a.
  24. MovieNet: A holistic dataset for movie understanding. In ECCV, 2020b.
  25. A better use of audio-visual cues: Dense video captioning with bi-modal transformer. In BMVC, 2020a.
  26. Multi-modal dense video captioning. In CVPR Workshops on Multimodal Learning, 2020b.
  27. Perceiver: General perception with iterative attention. In International conference on machine learning, pages 4651–4664. PMLR, 2021.
  28. Tiger: Text-to-image grounding for image caption evaluation. arXiv preprint arXiv:1909.02050, 2019.
  29. Nubia: Neural based interchangeability assessor for text generation. arXiv preprint arXiv:2004.14667, 2020.
  30. Dense-captioning events in videos. In Proc. ICCV, pages 706–715, 2017.
  31. BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, 2022.
  32. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv:2301.12597, 2023a.
  33. LAVENDER: Unifying video-language understanding as masked language modeling. In CVPR, 2023b.
  34. Jointly localizing and describing events for dense video captioning. In Proc. CVPR, 2018.
  35. SwinBERT: End-to-end transformers with sparse attention for video captioning. In Proc. CVPR, 2022.
  36. Mm-vid: Advancing video understanding with gpt-4v(ision). arXiv preprint arXiv:2310.19773, 2023.
  37. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  38. UniViLM: A unified video and language pre-training model for multimodal understanding and generation. arXiv preprint arXiv:2002.06353, 2020.
  39. Video-chatgpt: Towards detailed video understanding via large vision and language models. arXiv preprint arXiv:2306.05424, 2023.
  40. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In Proc. ICCV, pages 2630–2640, 2019.
  41. ClipCap: CLIP prefix for image captioning. arXiv preprint arXiv:2111.09734, 2021.
  42. Streamlined dense video captioning. In Proc. CVPR, 2019.
  43. Moviecuts: A new dataset and benchmark for cut type recognition. In European Conference on Computer Vision, pages 668–685. Springer, 2022.
  44. Learning transferable visual models from natural language supervision. In Proc. ICML, 2021.
  45. Watch, listen and tell: Multi-modal weakly supervised dense event captioning. In Proc. ICCV, 2019.
  46. A dataset for movie description. In Proc. CVPR, 2015.
  47. Movie description. IJCV, 123(1):94–120, 2017.
  48. End-to-end generative pretraining for multimodal video captioning. In Proc. CVPR, pages 17959–17968, 2022.
  49. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Association for Computational Linguistics, 2018.
  50. Weakly supervised dense video captioning. In Proc. CVPR, 2017.
  51. Dense procedure captioning in narrated instructional videos. In Association for Computational Linguistics, 2019.
  52. Emscore: Evaluating video captioning via coarse-grained and fine-grained embedding matching. In Proc. CVPR, pages 17929–17938, 2022.
  53. HowToCaption: Prompting LLMs to transform video annotations at scale. arXiv:2310.04900, 2023.
  54. MAD: A scalable dataset for language grounding in videos from movie audio descriptions. In Proc. CVPR, 2022.
  55. MovieChat: From dense token to sparse memory for long video understanding. arXiv:2307.16449, 2023.
  56. Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389, 2023.
  57. Book2movie: Aligning video scenes with book chapters. In Proc. CVPR, pages 1827–1835, 2015.
  58. Movieqa: Understanding stories in movies through question-answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4631–4640, 2016.
  59. LLaMA: Open and efficient foundation language models. arXiv:2302.13971, 2023a.
  60. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
  61. Cider: Consensus-based image description evaluation. In Proc. CVPR, pages 4566–4575, 2015.
  62. Bidirectional attentive fusion with context gating for dense video captioning. In Proc. CVPR, 2018.
  63. Git: A generative image-to-text transformer for vision and language. arXiv preprint arXiv:2205.14100, 2022.
  64. Event-centric hierarchical representation for dense video captioning. IEEE Transactions on Circuits and Systems for Video Technology, 2020.
  65. End-to-end dense video captioning with parallel decoding. In Proc. ICCV, 2021a.
  66. Toward automatic audio description generation for accessible videos. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, pages 1–12, 2021b.
  67. A graph-based framework to bridge movies and synopses. In Proc. ICCV, pages 4592–4601, 2019.
  68. Vid2seq: Large-scale pretraining of a visual language model for dense video captioning. arXiv preprint arXiv:2302.14115, 2023a.
  69. The dawn of LMMs: Preliminary explorations with gpt-4v (ision). arXiv preprint arXiv:2309.17421, 9, 2023b.
  70. Improving image captioning evaluation by considering inter references variance. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 985–994, 2020.
  71. CoCa: Contrastive captioners are image-text foundation models. Transactions on Machine Learning Research, 2022.
  72. Keunwoo Peter Yu. VideoBLIP, 2023.
  73. Self-chained image-language model for video localization and question answering. In NeurIPS, 2023.
  74. MERLOT: Multimodal neural script knowledge models. In NeurIPS, 2021.
  75. MERLOT reserve: Multimodal neural script knowledge through vision and language and sound. In CVPR, 2022.
  76. Mm-narrator: Narrating long-form videos with multimodal in-context learning. arXiv preprint arXiv:2311.17435, 2023a.
  77. Video-LLaMA: An instruction-tuned audio-visual language model for video understanding. In EMNLP 2023 Demo, 2023b.
  78. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
  79. BERTScore: Evaluating text generation with bert. In Proc. ICLR, 2020.
  80. Learning video representations from large language models. In CVPR, 2023.
  81. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023.
  82. Towards automatic learning of procedures from web instructional videos. In AAAI, 2018a.
  83. End-to-end dense video captioning with masked transformer. In Proc. CVPR, 2018b.
  84. MiniGPT-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Tengda Han (23 papers)
  2. Max Bain (15 papers)
  3. Arsha Nagrani (62 papers)
  4. Gül Varol (39 papers)
  5. Weidi Xie (132 papers)
  6. Andrew Zisserman (248 papers)
Citations (12)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets