Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
72 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks (2306.04362v1)

Published 7 Jun 2023 in cs.CV and cs.CL
Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks

Abstract: To promote the development of Vision-Language Pre-training (VLP) and multimodal LLM in the Chinese community, we firstly release the largest public Chinese high-quality video-language dataset named Youku-mPLUG, which is collected from Youku, a well-known Chinese video-sharing website, with strict criteria of safety, diversity, and quality. Youku-mPLUG contains 10 million Chinese video-text pairs filtered from 400 million raw videos across a wide range of 45 diverse categories for large-scale pre-training. In addition, to facilitate a comprehensive evaluation of video-LLMs, we carefully build the largest human-annotated Chinese benchmarks covering three popular video-language tasks of cross-modal retrieval, video captioning, and video category classification. Youku-mPLUG can enable researchers to conduct more in-depth multimodal research and develop better applications in the future. Furthermore, we release popular video-language pre-training models, ALPRO and mPLUG-2, and our proposed modularized decoder-only model mPLUG-video pre-trained on Youku-mPLUG. Experiments show that models pre-trained on Youku-mPLUG gain up to 23.1% improvement in video category classification. Besides, mPLUG-video achieves a new state-of-the-art result on these benchmarks with 80.5% top-1 accuracy in video category classification and 68.9 CIDEr score in video captioning, respectively. Finally, we scale up mPLUG-video based on the frozen Bloomz with only 1.7% trainable parameters as Chinese multimodal LLM, and demonstrate impressive instruction and video understanding ability. The zero-shot instruction understanding experiment indicates that pretraining with Youku-mPLUG can enhance the ability to comprehend overall and detailed visual semantics, recognize scene text, and leverage open-domain knowledge.

Overview of the Youku-mPLUG Dataset and Model

This paper introduces the Youku-mPLUG project, which aims to advance multimodal LLMing (LLM) within the Chinese linguistic and cultural context using video-language data. The dataset, a substantial contribution to the field, comprises 10 million Chinese video-text pairs, affirming its position as a formidable resource for training Chinese video-language pre-training models (VLP). Complemented by an extensive set of benchmarks, this work embodies a significant step toward bridging the gap between the availability of Chinese and English video-language datasets.

Dataset Features

The Youku-mPLUG dataset is curated from the Chinese video-sharing platform Youku, employing stringent selection criteria to ensure quality, diversity, and safety. From an initial corpus of 400 million raw videos, the dataset was distilled down to 10 million high-quality video-text pairs. Videos span 45 categories across 20 super-categories, ensuring a balanced representation of various themes and subjects. Specific filters—covering video and text length, content integrity, and semantic quality—are employed, leveraging Chinese pre-trained models to ensure consistency and relevance.

Benchmark and Model Contributions

To evaluate video-LLMs comprehensively, the authors have introduced multiple benchmarks tailored to critical tasks: cross-modal retrieval, video captioning, and video category classification. The benchmarks set a new standard with their size and annotation quality in the Chinese language context.

In terms of modeling, several pre-trained models—ALPRO, mPLUG-2, and the newly proposed mPLUG-video—are introduced. The mPLUG-video, a modularized decoder-only architecture, is particularly noteworthy. It operates with minimal trainable parameters, interfacing a video encoder with a pre-trained LLM decoder. The modular approach accommodates efficient integration of visual data with text, allowing for advancements in performance with nominal computational overhead.

Experimental Evaluation

Substantial performance improvements with the proposed models are observed across all evaluation benchmarks. Notably, models trained on the Youku-mPLUG dataset achieved a 23.1% improvement in video category classification accuracy. Furthermore, mPLUG-video set new state-of-the-art records on these benchmarks, with an 80.5% top-1 accuracy in classification and a 68.9 CIDEr score in video captioning.

Zero-shot video instruction understanding tests reveal meaningful increments in visual semantic comprehension, scene text recognition, and the leveraging of open-domain knowledge, showcasing Youku-mPLUG’s robust impact on multimodal LLM capabilities.

Implications and Future Directions

The Youku-mPLUG project remedies the scarcity of high-quality public Chinese video-language datasets, thereby unlocking new research and application possibilities in the Chinese VLP community. By offering pre-trained models and evaluation frameworks for public use, the proposed work lays a foundation for significant exploration and technological development in Chinese language understanding through multimodal data.

These contributions herald potential applications in areas such as automated video summarization, content-based video retrieval, and enhanced natural language video interaction systems, directly enriching user experiences in digital and educational platforms.

Looking forward, the scalability and robustness in performance demonstrated through mPLUG-video and Youku-mPLUG provide fertile ground for further exploring modular architectures in handling dynamic, large-scale datasets. This direction could see more adaptive models that leverage varying amounts of visual and textual input, fine-tuned for specific domain constraints or user needs. There's also an inherent opportunity to expand the dataset with more nuanced cultural and subtitling data, stretching the use of video-LLMs to cater to increasingly complex narrative forms and styles.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (16)
  1. Haiyang Xu (67 papers)
  2. Qinghao Ye (31 papers)
  3. Xuan Wu (59 papers)
  4. Ming Yan (190 papers)
  5. Yuan Miao (24 papers)
  6. Jiabo Ye (17 papers)
  7. Guohai Xu (21 papers)
  8. Anwen Hu (22 papers)
  9. Yaya Shi (13 papers)
  10. Guangwei Xu (18 papers)
  11. Chenliang Li (92 papers)
  12. Qi Qian (54 papers)
  13. Maofei Que (4 papers)
  14. Ji Zhang (176 papers)
  15. Xiao Zeng (13 papers)
  16. Fei Huang (408 papers)
Citations (14)