Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation (2407.02371v2)

Published 2 Jul 2024 in cs.CV

Abstract: Text-to-video (T2V) generation has recently garnered significant attention thanks to the large multi-modality model Sora. However, T2V generation still faces two important challenges: 1) Lacking a precise open sourced high-quality dataset. The previous popular video datasets, e.g. WebVid-10M and Panda-70M, are either with low quality or too large for most research institutions. Therefore, it is challenging but crucial to collect a precise high-quality text-video pairs for T2V generation. 2) Ignoring to fully utilize textual information. Recent T2V methods have focused on vision transformers, using a simple cross attention module for video generation, which falls short of thoroughly extracting semantic information from text prompt. To address these issues, we introduce OpenVid-1M, a precise high-quality dataset with expressive captions. This open-scenario dataset contains over 1 million text-video pairs, facilitating research on T2V generation. Furthermore, we curate 433K 1080p videos from OpenVid-1M to create OpenVidHD-0.4M, advancing high-definition video generation. Additionally, we propose a novel Multi-modal Video Diffusion Transformer (MVDiT) capable of mining both structure information from visual tokens and semantic information from text tokens. Extensive experiments and ablation studies verify the superiority of OpenVid-1M over previous datasets and the effectiveness of our MVDiT.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (37)
  1. Video generation models as world simulators. 2024.
  2. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4195–4205, 2023.
  3. Open-sora-plan, apr 2024.
  4. Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:2309.15103, 2023.
  5. Vdt: General-purpose video diffusion transformers via mask modeling. In The Twelfth International Conference on Learning Representations, 2023.
  6. Latte: Latent diffusion transformer for video generation. arXiv preprint arXiv:2401.03048, 2024.
  7. Modelscope text-to-video technical report. arXiv preprint arXiv:2308.06571, 2023.
  8. Gentron: Delving deep into diffusion transformers for image and video generation. arXiv preprint arXiv:2312.04557, 2023.
  9. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1728–1738, 2021.
  10. Panda-70m: Captioning 70m videos with multiple cross-modality teachers. arXiv preprint arXiv:2402.19479, 2024.
  11. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024.
  12. Celebv-text: A large-scale facial text-video dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14805–14814, 2023.
  13. Magictime: Time-lapse video generation models as metamorphic simulators. arXiv preprint arXiv:2404.05014, 2024.
  14. Faceforensics++: Learning to detect manipulated facial images. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1–11, 2019.
  15. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
  16. Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2364–2373, 2018.
  17. First order motion model for image animation. Advances in neural information processing systems, 32, 2019.
  18. Text2video-zero: Text-to-image diffusion models are zero-shot video generators. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15954–15964, 2023.
  19. Preserve your own correlation: A noise prior for video diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22930–22941, 2023.
  20. Videocrafter1: Open diffusion models for high-quality video generation. arXiv preprint arXiv:2310.19512, 2023.
  21. Videocrafter2: Overcoming data limitations for high-quality video diffusion models. arXiv preprint arXiv:2401.09047, 2024.
  22. Video probabilistic diffusion models in projected latent space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18456–18466, 2023.
  23. Make pixels dance: High-dynamic video generation. arXiv preprint arXiv:2311.10982, 2023.
  24. Pixart-α𝛼\alphaitalic_α: Fast training of diffusion transformer for photorealistic text-to-image synthesis, 2023.
  25. Celebv-hq: A large-scale video facial attributes dataset. In European conference on computer vision, pages 650–667. Springer, 2022.
  26. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning. PMLR, 2021.
  27. Revisiting weak-to-strong consistency in semi-supervised semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2023.
  28. Exploring video quality assessment on user generated contents from aesthetic and technical perspectives. In IEEE/CVF International Conference on Computer Vision. IEEE, 2023.
  29. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023.
  30. Visual instruction tuning. In Advances in Neural Information Processing Systems, 2023.
  31. Raft: Recurrent all pairs field transforms for optical flow. In European Conference Computer Vision. Springer, 2020.
  32. Scaling rectified flow transformers for high-resolution image synthesis. arXiv preprint arXiv:2403.03206, 2024.
  33. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67, 2020.
  34. Evalcrafter: Benchmarking and evaluating large video generation models. arXiv preprint arXiv:2310.11440, 2023.
  35. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  36. Show-1: Marrying pixel and latent diffusion models for text-to-video generation. arXiv preprint arXiv:2309.15818, 2023.
  37. Videocomposer: Compositional video synthesis with motion controllability. 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Kepan Nan (5 papers)
  2. Rui Xie (59 papers)
  3. Penghao Zhou (6 papers)
  4. Tiehan Fan (3 papers)
  5. Zhenheng Yang (30 papers)
  6. Zhijie Chen (54 papers)
  7. Xiang Li (1002 papers)
  8. Jian Yang (503 papers)
  9. Ying Tai (88 papers)
Citations (26)
Youtube Logo Streamline Icon: https://streamlinehq.com