Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MMTrail: A Multimodal Trailer Video Dataset with Language and Music Descriptions (2407.20962v3)

Published 30 Jul 2024 in cs.CV, cs.MM, cs.SD, and eess.AS

Abstract: Massive multi-modality datasets play a significant role in facilitating the success of large video-LLMs. However, current video-language datasets primarily provide text descriptions for visual frames, considering audio to be weakly related information. They usually overlook exploring the potential of inherent audio-visual correlation, leading to monotonous annotation within each modality instead of comprehensive and precise descriptions. Such ignorance results in the difficulty of multiple cross-modality studies. To fulfill this gap, we present MMTrail, a large-scale multi-modality video-language dataset incorporating more than 20M trailer clips with visual captions, and 2M high-quality clips with multimodal captions. Trailers preview full-length video works and integrate context, visual frames, and background music. In particular, the trailer has two main advantages: (1) the topics are diverse, and the content characters are of various types, e.g., film, news, and gaming. (2) the corresponding background music is custom-designed, making it more coherent with the visual context. Upon these insights, we propose a systemic captioning framework, achieving various modality annotations with more than 27.1k hours of trailer videos. Here, to ensure the caption retains music perspective while preserving the authority of visual context, we leverage the advanced LLM to merge all annotations adaptively. In this fashion, our MMtrail dataset potentially paves the path for fine-grained large multimodal-LLM training. In experiments, we provide evaluation metrics and benchmark results on our dataset, demonstrating the high quality of our annotation and its effectiveness for model training.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (72)
  1. Frozen in time: A joint video and image encoder for end-to-end retrieval. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 1708–1718, 2021.
  2. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1728–1738, 2021.
  3. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Jade Goldstein, Alon Lavie, Chin-Yew Lin, and Clare Voss, editors, Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, Ann Arbor, Michigan, June 2005. Association for Computational Linguistics.
  4. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023.
  5. Align your latents: High-resolution video synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22563–22575, 2023.
  6. Videocrafter1: Open diffusion models for high-quality video generation. arXiv preprint arXiv:2310.19512, 2023.
  7. Vggsound: A large-scale audio-visual dataset. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 721–725. IEEE, 2020.
  8. Video chatcaptioner: Towards the enriched spatiotemporal descriptions. arXiv preprint arXiv:2304.04227, 2023.
  9. Panda-70m: Captioning 70m videos with multiple cross-modality teachers. arXiv preprint arXiv:2402.19479, 2024.
  10. Chatillusion: Efficient-aligning interleaved generation ability with visual instruction model. arXiv preprint arXiv:2311.17963, 2023.
  11. Simple and controllable music generation. Advances in Neural Information Processing Systems, 36, 2024.
  12. Lp-musiccaps: Llm-based pseudo music captioning. arXiv preprint arXiv:2307.16372, 2023.
  13. An image is worth 16x16 words: Transformers for image recognition at scale, 2021.
  14. Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 776–780. IEEE, 2017.
  15. Llms meet multimodal generation and editing: A survey. arXiv preprint arXiv:2405.19334, 2024.
  16. Animate-a-story: Storytelling with retrieval-augmented video generation. arXiv preprint arXiv:2307.06940, 2023.
  17. Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:2211.13221, 2022.
  18. Streamingt2v: Consistent, dynamic, and extendable long video generation from text. arXiv preprint arXiv:2403.14773, 2024.
  19. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022.
  20. Video diffusion models. Advances in Neural Information Processing Systems, 35:8633–8646, 2022.
  21. VBench: Comprehensive benchmark suite for video generative models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024.
  22. Chat-univi: Unified visual representation empowers large language models with image and video understanding. arXiv preprint arXiv:2311.08046, 2023.
  23. Fr\\\backslash\’echet audio distance: A metric for evaluating music enhancement algorithms. arXiv preprint arXiv:1812.08466, 2018.
  24. Videopoet: A large language model for zero-shot video generation. arXiv preprint arXiv:2312.14125, 2023.
  25. Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28:2880–2894, 2020.
  26. Dense-captioning events in videos. 2017 IEEE International Conference on Computer Vision (ICCV), pages 706–715, 2017.
  27. Mini-gemini: Mining the potential of multi-modality vision language models. arXiv:2403.18814, 2023.
  28. Video-llava: Learning united visual representation by alignment before projection. arXiv preprint arXiv:2311.10122, 2023.
  29. Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04), pages 605–612, Barcelona, Spain, July 2004.
  30. Videodirectorgpt: Consistent multi-scene video generation via llm-guided planning. arXiv preprint arXiv:2309.15091, 2023.
  31. Tsm: Temporal shift module for efficient video understanding. In Proceedings of the IEEE/CVF international conference on computer vision, pages 7083–7093, 2019.
  32. Swinbert: End-to-end transformers with sparse attention for video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17949–17958, 2022.
  33. Microsoft coco: Common objects in context, 2015.
  34. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024.
  35. Tam: Temporal adaptive module for video recognition. In Proceedings of the IEEE/CVF international conference on computer vision, pages 13708–13718, 2021.
  36. Videodrafter: Content-consistent multi-scene video generation with llm. arXiv preprint arXiv:2401.01256, 2024.
  37. Video-chatgpt: Towards detailed video understanding via large vision and language models, 2023.
  38. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 2630–2640, 2019.
  39. Learning audio-video modalities from image captions. In European Conference on Computer Vision, 2022.
  40. Bleu: a method for automatic evaluation of machine translation. In Pierre Isabelle, Eugene Charniak, and Dekang Lin, editors, Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA, July 2002. Association for Computational Linguistics.
  41. Movie description. International Journal of Computer Vision, 123:94 – 120, 2016.
  42. Hybrid transformers for music source separation. In ICASSP 23, 2023.
  43. How2: A large-scale dataset for multimodal language understanding. ArXiv, abs/1811.00347, 2018.
  44. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs, 2021.
  45. Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792, 2022.
  46. Moviechat: From dense token to sparse memory for long video understanding. arXiv preprint arXiv:2307.16449, 2023.
  47. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
  48. Learning video representations from textual web supervision. ArXiv, abs/2007.14937, 2020.
  49. Vidmuse: A simple video-to-music generation framework with long-short-term modeling. arXiv preprint arXiv:2406.04321, 2024.
  50. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  51. Cider: Consensus-based image description evaluation, 2015.
  52. Gen-l-video: Multi-text to long video generation via temporal co-denoising. arXiv preprint arXiv:2305.18264, 2023.
  53. Videofactory: Swap attention in spatiotemporal diffusions for text-to-video generation. ArXiv, abs/2305.10874, 2023.
  54. Vatex: A large-scale, high-quality multilingual dataset for video-and-language research. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 4580–4590, 2019.
  55. Internvid: A large-scale video-text dataset for multimodal understanding and generation. ArXiv, abs/2307.06942, 2023.
  56. Star: A benchmark for situated reasoning in real-world videos. In Thirty-fifth conference on neural information processing systems datasets and benchmarks track (Round 2), 2021.
  57. Long-term feature banks for detailed video understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 284–293, 2019.
  58. Pandora: Towards general world model with natural language actions and video states. arXiv preprint arXiv:2406.09455, 2024.
  59. Msr-vtt: A large video description dataset for bridging video and language. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5288–5296, 2016.
  60. Advancing high-resolution video-language representation with large-scale video transcriptions. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5026–5035, 2021.
  61. Learning to answer visual questions from web videos. arXiv preprint arXiv:2205.05019, 2022.
  62. Zero-shot video question answering via frozen bidirectional language models. Advances in Neural Information Processing Systems, 35:124–141, 2022.
  63. Coca: Contrastive captioners are image-text foundation models. Trans. Mach. Learn. Res., 2022, 2022.
  64. Chatmusician: Understanding and generating music intrinsically with llm. arXiv preprint arXiv:2402.16153, 2024.
  65. Merlot: Multimodal neural script knowledge models. In Neural Information Processing Systems, 2021.
  66. Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858, 2023.
  67. Vision transformer with quadrangle attention. arXiv preprint arXiv:2303.15105, 2023.
  68. Bertscore: Evaluating text generation with bert, 2020.
  69. Towards video text visual question answering: benchmark and baseline. Advances in Neural Information Processing Systems, 35:35549–35562, 2022.
  70. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023.
  71. Magicvideo: Efficient video generation with latent diffusion models. arXiv preprint arXiv:2211.11018, 2022.
  72. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (19)
  1. Xiaowei Chi (21 papers)
  2. Yatian Wang (8 papers)
  3. Aosong Cheng (4 papers)
  4. Pengjun Fang (1 paper)
  5. Zeyue Tian (12 papers)
  6. Yingqing He (23 papers)
  7. Zhaoyang Liu (42 papers)
  8. Xingqun Qi (21 papers)
  9. Jiahao Pan (13 papers)
  10. Rongyu Zhang (25 papers)
  11. Mengfei Li (10 papers)
  12. Ruibin Yuan (43 papers)
  13. Yanbing Jiang (4 papers)
  14. Wei Xue (150 papers)
  15. Wenhan Luo (88 papers)
  16. Qifeng Chen (187 papers)
  17. Shanghang Zhang (173 papers)
  18. Qifeng Liu (28 papers)
  19. Yike Guo (144 papers)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com