AutoAD III: The Prequel -- Back to the Pixels (2404.14412v1)
Abstract: Generating Audio Description (AD) for movies is a challenging task that requires fine-grained visual understanding and an awareness of the characters and their names. Currently, visual LLMs for AD generation are limited by a lack of suitable training data, and also their evaluation is hampered by using performance measures not specialized to the AD domain. In this paper, we make three contributions: (i) We propose two approaches for constructing AD datasets with aligned video data, and build training and evaluation datasets using these. These datasets will be publicly released; (ii) We develop a Q-former-based architecture which ingests raw video and generates AD, using frozen pre-trained visual encoders and LLMs; and (iii) We provide new evaluation metrics to benchmark AD quality that are well-matched to human performance. Taken together, we improve the state of the art on AD generation.
- Flamingo: a visual language model for few-shot learning. In NeurIPS, 2022.
- Spice: Semantic propositional image caption evaluation. In Proc. ECCV, pages 382–398. Springer, 2016.
- ViViT: A Video Vision Transformer. In ICCV, 2021.
- Condensed movies: Story based retrieval with contextual embeddings. In Proc. ACCV, 2020.
- Whisperx: Time-accurate speech transcription of long-form audio. In INTERSPEECH, 2023.
- Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
- iPerceive: Applying common-sense reasoning to multi-modal dense video captioning and video question answering. In Proc. WACV, 2021.
- CLAIR: Evaluating image captions with large language models. arXiv preprint arXiv:2310.12971, 2023.
- Towards bridging event captioner and sentence localizer for weakly supervised dense event captioning. In Proc. CVPR, 2021.
- Pali-x: On scaling up a multilingual vision and language model. arXiv preprint arXiv:2305.18565, 2023.
- Can large language models be an alternative to human evaluations? arXiv preprint arXiv:2305.01937, 2023.
- Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
- Learning to evaluate image captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5804–5812, 2018.
- Sketch, ground, and refine: Top-down dense video captioning. In CVPR, 2021.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- EVA: Exploring the limits of masked visual representation learning at scale. In CVPR, 2023.
- Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Comm. ACM, 24(6):381–395, 1981.
- ImageBind: One embedding space to bind them all. In CVPR, 2023.
- Temporal alignment networks for long-term video. In Proc. CVPR, 2022.
- AutoAD: Movie description in context. In Proc. CVPR, 2023a.
- AutoAD II: The sequel – who, when, and what in movie audio description. In Proc. ICCV, 2023b.
- CLIPScore: A reference-free evaluation metric for image captioning. In EMNLP, 2021.
- Multimodal pretraining for dense video captioning. arXiv preprint arXiv:2011.11760, 2020a.
- MovieNet: A holistic dataset for movie understanding. In ECCV, 2020b.
- A better use of audio-visual cues: Dense video captioning with bi-modal transformer. In BMVC, 2020a.
- Multi-modal dense video captioning. In CVPR Workshops on Multimodal Learning, 2020b.
- Perceiver: General perception with iterative attention. In International conference on machine learning, pages 4651–4664. PMLR, 2021.
- Tiger: Text-to-image grounding for image caption evaluation. arXiv preprint arXiv:1909.02050, 2019.
- Nubia: Neural based interchangeability assessor for text generation. arXiv preprint arXiv:2004.14667, 2020.
- Dense-captioning events in videos. In Proc. ICCV, pages 706–715, 2017.
- BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, 2022.
- BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv:2301.12597, 2023a.
- LAVENDER: Unifying video-language understanding as masked language modeling. In CVPR, 2023b.
- Jointly localizing and describing events for dense video captioning. In Proc. CVPR, 2018.
- SwinBERT: End-to-end transformers with sparse attention for video captioning. In Proc. CVPR, 2022.
- Mm-vid: Advancing video understanding with gpt-4v(ision). arXiv preprint arXiv:2310.19773, 2023.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- UniViLM: A unified video and language pre-training model for multimodal understanding and generation. arXiv preprint arXiv:2002.06353, 2020.
- Video-chatgpt: Towards detailed video understanding via large vision and language models. arXiv preprint arXiv:2306.05424, 2023.
- Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In Proc. ICCV, pages 2630–2640, 2019.
- ClipCap: CLIP prefix for image captioning. arXiv preprint arXiv:2111.09734, 2021.
- Streamlined dense video captioning. In Proc. CVPR, 2019.
- Moviecuts: A new dataset and benchmark for cut type recognition. In European Conference on Computer Vision, pages 668–685. Springer, 2022.
- Learning transferable visual models from natural language supervision. In Proc. ICML, 2021.
- Watch, listen and tell: Multi-modal weakly supervised dense event captioning. In Proc. ICCV, 2019.
- A dataset for movie description. In Proc. CVPR, 2015.
- Movie description. IJCV, 123(1):94–120, 2017.
- End-to-end generative pretraining for multimodal video captioning. In Proc. CVPR, pages 17959–17968, 2022.
- Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Association for Computational Linguistics, 2018.
- Weakly supervised dense video captioning. In Proc. CVPR, 2017.
- Dense procedure captioning in narrated instructional videos. In Association for Computational Linguistics, 2019.
- Emscore: Evaluating video captioning via coarse-grained and fine-grained embedding matching. In Proc. CVPR, pages 17929–17938, 2022.
- HowToCaption: Prompting LLMs to transform video annotations at scale. arXiv:2310.04900, 2023.
- MAD: A scalable dataset for language grounding in videos from movie audio descriptions. In Proc. CVPR, 2022.
- MovieChat: From dense token to sparse memory for long video understanding. arXiv:2307.16449, 2023.
- Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389, 2023.
- Book2movie: Aligning video scenes with book chapters. In Proc. CVPR, pages 1827–1835, 2015.
- Movieqa: Understanding stories in movies through question-answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4631–4640, 2016.
- LLaMA: Open and efficient foundation language models. arXiv:2302.13971, 2023a.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
- Cider: Consensus-based image description evaluation. In Proc. CVPR, pages 4566–4575, 2015.
- Bidirectional attentive fusion with context gating for dense video captioning. In Proc. CVPR, 2018.
- Git: A generative image-to-text transformer for vision and language. arXiv preprint arXiv:2205.14100, 2022.
- Event-centric hierarchical representation for dense video captioning. IEEE Transactions on Circuits and Systems for Video Technology, 2020.
- End-to-end dense video captioning with parallel decoding. In Proc. ICCV, 2021a.
- Toward automatic audio description generation for accessible videos. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, pages 1–12, 2021b.
- A graph-based framework to bridge movies and synopses. In Proc. ICCV, pages 4592–4601, 2019.
- Vid2seq: Large-scale pretraining of a visual language model for dense video captioning. arXiv preprint arXiv:2302.14115, 2023a.
- The dawn of LMMs: Preliminary explorations with gpt-4v (ision). arXiv preprint arXiv:2309.17421, 9, 2023b.
- Improving image captioning evaluation by considering inter references variance. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 985–994, 2020.
- CoCa: Contrastive captioners are image-text foundation models. Transactions on Machine Learning Research, 2022.
- Keunwoo Peter Yu. VideoBLIP, 2023.
- Self-chained image-language model for video localization and question answering. In NeurIPS, 2023.
- MERLOT: Multimodal neural script knowledge models. In NeurIPS, 2021.
- MERLOT reserve: Multimodal neural script knowledge through vision and language and sound. In CVPR, 2022.
- Mm-narrator: Narrating long-form videos with multimodal in-context learning. arXiv preprint arXiv:2311.17435, 2023a.
- Video-LLaMA: An instruction-tuned audio-visual language model for video understanding. In EMNLP 2023 Demo, 2023b.
- Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
- BERTScore: Evaluating text generation with bert. In Proc. ICLR, 2020.
- Learning video representations from large language models. In CVPR, 2023.
- Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023.
- Towards automatic learning of procedures from web instructional videos. In AAAI, 2018a.
- End-to-end dense video captioning with masked transformer. In Proc. CVPR, 2018b.
- MiniGPT-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
- Tengda Han (23 papers)
- Max Bain (15 papers)
- Arsha Nagrani (62 papers)
- Gül Varol (39 papers)
- Weidi Xie (132 papers)
- Andrew Zisserman (248 papers)