WanJuan: A Comprehensive Multimodal Dataset for Advancing English and Chinese Large Models (2308.10755v3)
Abstract: The rise in popularity of ChatGPT and GPT-4 has significantly accelerated the development of large models, leading to the creation of numerous impressive LLMs(LLMs) and multimodal LLMs (MLLMs). These cutting-edge models owe their remarkable performance to high-quality data. However, the details of the training data used in leading paradigms are often kept confidential. This lack of transparency, coupled with the scarcity of open-source data, impedes further developments within the community. As a response, this paper presents "Wan Juan", a large-scale multimodal dataset composed of both Chinese and English data, collected from a wide range of web sources. The dataset incorporates text, image-text, and video modalities, with a total volume exceeding 2TB. It was utilized in the training of InternLM, a model that demonstrated significant advantages in multi-dimensional evaluations when compared to models of a similar scale. All data can be accessed at https://opendatalab.org.cn/WanJuan1.0.
- Conghui He (114 papers)
- Zhenjiang Jin (7 papers)
- Chao Xu (283 papers)
- Jiantao Qiu (14 papers)
- Bin Wang (750 papers)
- Wei Li (1121 papers)
- Hang Yan (86 papers)
- Jiaqi Wang (218 papers)
- Dahua Lin (336 papers)