Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

WanJuan: A Comprehensive Multimodal Dataset for Advancing English and Chinese Large Models (2308.10755v3)

Published 21 Aug 2023 in cs.CL and cs.CV

Abstract: The rise in popularity of ChatGPT and GPT-4 has significantly accelerated the development of large models, leading to the creation of numerous impressive LLMs(LLMs) and multimodal LLMs (MLLMs). These cutting-edge models owe their remarkable performance to high-quality data. However, the details of the training data used in leading paradigms are often kept confidential. This lack of transparency, coupled with the scarcity of open-source data, impedes further developments within the community. As a response, this paper presents "Wan Juan", a large-scale multimodal dataset composed of both Chinese and English data, collected from a wide range of web sources. The dataset incorporates text, image-text, and video modalities, with a total volume exceeding 2TB. It was utilized in the training of InternLM, a model that demonstrated significant advantages in multi-dimensional evaluations when compared to models of a similar scale. All data can be accessed at https://opendatalab.org.cn/WanJuan1.0.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Conghui He (114 papers)
  2. Zhenjiang Jin (7 papers)
  3. Chao Xu (283 papers)
  4. Jiantao Qiu (14 papers)
  5. Bin Wang (750 papers)
  6. Wei Li (1121 papers)
  7. Hang Yan (86 papers)
  8. Jiaqi Wang (218 papers)
  9. Dahua Lin (336 papers)
Citations (30)