Overview of the Youku-mPLUG Dataset and Model
This paper introduces the Youku-mPLUG project, which aims to advance multimodal LLMing (LLM) within the Chinese linguistic and cultural context using video-language data. The dataset, a substantial contribution to the field, comprises 10 million Chinese video-text pairs, affirming its position as a formidable resource for training Chinese video-language pre-training models (VLP). Complemented by an extensive set of benchmarks, this work embodies a significant step toward bridging the gap between the availability of Chinese and English video-language datasets.
Dataset Features
The Youku-mPLUG dataset is curated from the Chinese video-sharing platform Youku, employing stringent selection criteria to ensure quality, diversity, and safety. From an initial corpus of 400 million raw videos, the dataset was distilled down to 10 million high-quality video-text pairs. Videos span 45 categories across 20 super-categories, ensuring a balanced representation of various themes and subjects. Specific filters—covering video and text length, content integrity, and semantic quality—are employed, leveraging Chinese pre-trained models to ensure consistency and relevance.
Benchmark and Model Contributions
To evaluate video-LLMs comprehensively, the authors have introduced multiple benchmarks tailored to critical tasks: cross-modal retrieval, video captioning, and video category classification. The benchmarks set a new standard with their size and annotation quality in the Chinese language context.
In terms of modeling, several pre-trained models—ALPRO, mPLUG-2, and the newly proposed mPLUG-video—are introduced. The mPLUG-video, a modularized decoder-only architecture, is particularly noteworthy. It operates with minimal trainable parameters, interfacing a video encoder with a pre-trained LLM decoder. The modular approach accommodates efficient integration of visual data with text, allowing for advancements in performance with nominal computational overhead.
Experimental Evaluation
Substantial performance improvements with the proposed models are observed across all evaluation benchmarks. Notably, models trained on the Youku-mPLUG dataset achieved a 23.1% improvement in video category classification accuracy. Furthermore, mPLUG-video set new state-of-the-art records on these benchmarks, with an 80.5% top-1 accuracy in classification and a 68.9 CIDEr score in video captioning.
Zero-shot video instruction understanding tests reveal meaningful increments in visual semantic comprehension, scene text recognition, and the leveraging of open-domain knowledge, showcasing Youku-mPLUG’s robust impact on multimodal LLM capabilities.
Implications and Future Directions
The Youku-mPLUG project remedies the scarcity of high-quality public Chinese video-language datasets, thereby unlocking new research and application possibilities in the Chinese VLP community. By offering pre-trained models and evaluation frameworks for public use, the proposed work lays a foundation for significant exploration and technological development in Chinese language understanding through multimodal data.
These contributions herald potential applications in areas such as automated video summarization, content-based video retrieval, and enhanced natural language video interaction systems, directly enriching user experiences in digital and educational platforms.
Looking forward, the scalability and robustness in performance demonstrated through mPLUG-video and Youku-mPLUG provide fertile ground for further exploring modular architectures in handling dynamic, large-scale datasets. This direction could see more adaptive models that leverage varying amounts of visual and textual input, fine-tuned for specific domain constraints or user needs. There's also an inherent opportunity to expand the dataset with more nuanced cultural and subtitling data, stretching the use of video-LLMs to cater to increasingly complex narrative forms and styles.