Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers (2402.19479v1)

Published 29 Feb 2024 in cs.CV

Abstract: The quality of the data and annotation upper-bounds the quality of a downstream model. While there exist large text corpora and image-text pairs, high-quality video-text data is much harder to collect. First of all, manual labeling is more time-consuming, as it requires an annotator to watch an entire video. Second, videos have a temporal dimension, consisting of several scenes stacked together, and showing multiple actions. Accordingly, to establish a video dataset with high-quality captions, we propose an automatic approach leveraging multimodal inputs, such as textual video description, subtitles, and individual video frames. Specifically, we curate 3.8M high-resolution videos from the publicly available HD-VILA-100M dataset. We then split them into semantically consistent video clips, and apply multiple cross-modality teacher models to obtain captions for each video. Next, we finetune a retrieval model on a small subset where the best caption of each video is manually selected and then employ the model in the whole dataset to select the best caption as the annotation. In this way, we get 70M videos paired with high-quality text captions. We dub the dataset as Panda-70M. We show the value of the proposed dataset on three downstream tasks: video captioning, video and text retrieval, and text-driven video generation. The models trained on the proposed data score substantially better on the majority of metrics across all the tasks.

An Analysis of Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers

The paper introduces Panda-70M, a large-scale video dataset superior in quality and size compared to existing video-language datasets. By leveraging multimodal inputs, such as textual video descriptions, subtitles, and individual video frames, the proposed dataset provides high-quality annotations for over 70 million videos, effectively bridging the gap in available video-text data.

Methodology

The core innovation lies in utilizing an automated process to caption video content through the use of multiple cross-modality teacher models. A substantial advantage of this approach is the ability to automate the captioning of a large number of videos, circumventing the labor-intensive and time-consuming practice of manual annotation. The paper emphasizes that unlike typical video datasets annotated using Automatic Speech Recognition (ASR) which results in numerous misalignments between video content and captions, Panda-70M stands out due to its robust annotation strategy.

The authors curated 3.8 million high-resolution videos from publicly available datasets and segmented these into semantically consistent video clips. To generate captions, multiple cross-modality teacher models were applied, producing diverse sets of candidate captions. Subsequently, a retrieval model was finetuned on a subset of videos, allowing for the selection of the caption that best aligns with the visual content. This methodology ensures a rich, precise pairing of textual descriptions with video content.

Results and Implications

The dataset's efficacy is demonstrated on three primary tasks: video captioning, video and text retrieval, and text-driven video generation. Models trained on Panda-70M showed substantial performance improvements across multiple metrics. Importantly, the paper provides concrete numerical results, showcasing a marked increase in the accuracy and relevance of machine-generated captions and improved outcomes in related video-language tasks.

The paper's contribution has potential implications both in practical applications and theoretical explorations. On a practical level, Panda-70M provides a valuable resource for training more accurate video analysis models that can be deployed in various real-world applications, such as video content moderation, automated video summaries, and improved accessibility features. Theoretically, this dataset opens avenues for further exploration into cross-modal learning strategies and the refinement of multimodal models, taking advantage of the rich annotations provided.

Future Prospects

While the dataset represents a significant enhancement to available resources within AI research, the authors note areas for further improvement. These include expanding the dataset to encompass a broader range of video types beyond vocal-intensive content to include more varied data types, potentially offering a more comprehensive training ground for models. Additionally, considering longer videos and dense captions could enhance its applicability in tasks that require understanding more extensive narrative or intricate video content.

The research undoubtedly advances the field by addressing the data bottleneck in video-LLMing, yet it also leaves open questions regarding the scalability of similar methods and the potential biases inherent in automated annotation processes. This paper lays foundational work for the continued expansion and refinement of video-text datasets, crucial for progressing towards more nuanced AI comprehension of multimodal data.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (94)
  1. Pyscenedetect. https://github.com/Breakthrough/PySceneDetect.
  2. Flamingo: a visual language model for few-shot learning. NeurIPS, 2022.
  3. Localizing moments in video with natural language. In ICCV, 2017.
  4. ediff-i: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint, 2022.
  5. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In ACL workshop, 2005.
  6. BEit: BERT pre-training of image transformers. In ICLR, 2022.
  7. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023a.
  8. Align your latents: High-resolution video synthesis with latent diffusion models. In CVPR, 2023b.
  9. Generating long videos of dynamic scenes. NeurIPS, 2022.
  10. Coyo-700m: Image-text pair dataset. https://github.com/kakaobrain/coyo-dataset, 2022.
  11. Activitynet: A large-scale video benchmark for human activity understanding. In CVPR, 2015.
  12. Conceptual 12M: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In CVPR, 2021.
  13. Collecting highly parallel data for paraphrase evaluation. In ACL, 2011.
  14. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint, 2023.
  15. A simple framework for contrastive learning of visual representations. In ICML, 2020.
  16. Incremental false negative detection for contrastive learning. In ICLR, 2022.
  17. Vindlu: A recipe for effective video-and-language pretraining. In CVPR, 2023.
  18. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, 2023.
  19. Scaling instruction-finetuned language models. arXiv preprint, 2022.
  20. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint, 2018.
  21. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
  22. Eva: Exploring the limits of masked visual representation learning at scale. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19358–19369, 2023.
  23. Preserve your own correlation: A noise prior for video diffusion models. In ICCV, 2023.
  24. Bridging video-text retrieval with multiple choice questions. In CVPR, 2022.
  25. Imagebind: One embedding space to bind them all. In CVPR, 2023.
  26. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint, 2023.
  27. Masked autoencoders are scalable vision learners. In CVPR, 2022.
  28. Latent video diffusion models for high-fidelity long video generation. arXiv preprint, 2023.
  29. Imagen video: High definition video generation with diffusion models. arXiv preprint, 2022.
  30. Cogvideo: Large-scale pretraining for text-to-video generation via transformers. arXiv preprint, 2022.
  31. Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, 2021.
  32. Acav100m: Automatic curation of large-scale datasets for audio-visual video representation learning. In ICCV, 2021.
  33. Less is more: Clipbert for video-and-language learning via sparse sampling. In CVPR, 2021.
  34. Align and prompt: Video-and-language pre-training with entity prompts. In CVPR, 2022a.
  35. Align before fuse: Vision and language representation learning with momentum distillation. NeurIPS, 2021.
  36. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, 2022b.
  37. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint, 2023a.
  38. Videochat: Chat-centric video understanding. arXiv preprint, 2023b.
  39. Unmasked teacher: Towards training-efficient video foundation models. ICCV, 2023c.
  40. Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In ACL, 2004.
  41. Swinbert: End-to-end transformers with sparse attention for video captioning. In CVPR, 2022.
  42. Microsoft coco: Common objects in context. In ECCV, 2014.
  43. Visual instruction tuning. arXiv preprint, 2023.
  44. Use what you have: Video retrieval using representations from collaborative experts. arXiv preprint, 2019.
  45. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, 2021.
  46. Decoupled weight decay regularization. arXiv preprint, 2017a.
  47. SGDR: Stochastic gradient descent with warm restarts. In ICLR, 2017b.
  48. Univl: A unified video and language pre-training model for multimodal understanding and generation. arXiv preprint, 2020.
  49. Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning. Neurocomputing, 2022.
  50. X-clip: End-to-end multi-grained contrastive learning for video-text retrieval. In ACM MM, 2022.
  51. Video-chatgpt: Towards detailed video understanding via large vision and language models. arXiv preprint, 2023.
  52. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In ICCV, 2019.
  53. OpenAI. Gpt-4 technical report. arXiv preprint, 2023.
  54. Bleu: a method for automatic evaluation of machine translation. In ACL, 2002.
  55. Learning transferable visual models from natural language supervision. In ICML, 2021.
  56. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 2020.
  57. Zero-shot text-to-image generation. In ICML, 2021.
  58. A dataset for movie description. In CVPR, 2015.
  59. High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
  60. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In CVPR, 2023.
  61. Photorealistic text-to-image diffusion models with deep language understanding. NeurIPS, 2022.
  62. Laion-5b: An open large-scale dataset for training next generation image-text models. NeurIPS, 2022.
  63. End-to-end generative pretraining for multimodal video captioning. In CVPR, 2022.
  64. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL, 2018.
  65. Make-a-video: Text-to-video generation without text-video data. In ICLR, 2023.
  66. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
  67. Llama: Open and efficient foundation language models. arXiv preprint, 2023.
  68. Towards accurate generative models of video: A new metric & challenges. arXiv preprint, 2018.
  69. Cider: Consensus-based image description evaluation. In CVPR, 2015.
  70. Modelscope text-to-video technical report. arXiv preprint, 2023a.
  71. End-to-end dense video captioning with parallel decoding. In ICCV, 2021.
  72. Videofactory: Swap attention in spatiotemporal diffusions for text-to-video generation. arXiv preprint, 2023b.
  73. Vatex: A large-scale, high-quality multilingual dataset for video-and-language research. In ICCV, 2019.
  74. Internvideo: General video foundation models via generative and discriminative learning. arXiv preprint, 2022.
  75. Internvid: A large-scale video-text dataset for multimodal understanding and generation. arXiv preprint arXiv:2307.06942, 2023c.
  76. Godiva: Generating open-domain videos from natural descriptions. arXiv preprint, 2021.
  77. Grit: A generative region-to-text transformer for object understanding. arXiv preprint, 2022.
  78. mplug-2: A modularized multi-modal foundation model across text, image and video. arXiv preprint, 2023.
  79. Msr-vtt: A large video description dataset for bridging video and language. In CVPR, 2016.
  80. Advancing high-resolution video-language representation with large-scale video transcriptions. In CVPR, 2022.
  81. Vid2seq: Large-scale pretraining of a visual language model for dense video captioning. In CVPR, 2023.
  82. Coca: Contrastive captioners are image-text foundation models. arXiv preprint, 2022a.
  83. Scaling autoregressive models for content-rich text-to-image generation. TMLR, 2022b.
  84. Generating videos with dynamics-aware implicit generative adversarial networks. arXiv preprint arXiv:2202.10571, 2022c.
  85. A joint sequence fusion model for video question answering and retrieval. In ECCV, 2018.
  86. Florence: A new foundation model for computer vision. arXiv preprint, 2021.
  87. Merlot: Multimodal neural script knowledge models. NeurIPS, 2021.
  88. Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint, 2023.
  89. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018.
  90. Opt: Open pre-trained transformer language models. arXiv preprint, 2022.
  91. Bertscore: Evaluating text generation with bert. arXiv preprint, 2019.
  92. Magicvideo: Efficient video generation with latent diffusion models. arXiv preprint, 2022.
  93. Towards automatic learning of procedures from web instructional videos. In AAAI, 2018.
  94. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Tsai-Shien Chen (9 papers)
  2. Aliaksandr Siarohin (58 papers)
  3. Willi Menapace (33 papers)
  4. Ekaterina Deyneka (2 papers)
  5. Hsiang-wei Chao (1 paper)
  6. Byung Eun Jeon (1 paper)
  7. Yuwei Fang (31 papers)
  8. Hsin-Ying Lee (60 papers)
  9. Jian Ren (97 papers)
  10. Ming-Hsuan Yang (376 papers)
  11. Sergey Tulyakov (108 papers)
Citations (90)