Datasets for Large Language Models: A Comprehensive Survey (2402.18041v1)

Published 28 Feb 2024 in cs.CL and cs.AI

Abstract: This paper embarks on an exploration into the LLM datasets, which play a crucial role in the remarkable advancements of LLMs. The datasets serve as the foundational infrastructure analogous to a root system that sustains and nurtures the development of LLMs. Consequently, examination of these datasets emerges as a critical topic in research. In order to address the current lack of a comprehensive overview and thorough analysis of LLM datasets, and to gain insights into their current status and future trends, this survey consolidates and categorizes the fundamental aspects of LLM datasets from five perspectives: (1) Pre-training Corpora; (2) Instruction Fine-tuning Datasets; (3) Preference Datasets; (4) Evaluation Datasets; (5) Traditional NLP Datasets. The survey sheds light on the prevailing challenges and points out potential avenues for future investigation. Additionally, a comprehensive review of the existing available dataset resources is also provided, including statistics from 444 datasets, covering 8 language categories and spanning 32 domains. Information from 20 dimensions is incorporated into the dataset statistics. The total data size surveyed surpasses 774.5 TB for pre-training corpora and 700M instances for other datasets. We aim to present the entire landscape of LLM text datasets, serving as a comprehensive reference for researchers in this field and contributing to future studies. Related resources are available at: https://github.com/lmmlzn/Awesome-LLMs-Datasets.

PDF Abstract

A Comprehensive Survey on Datasets for LLMs

Introduction

The exponential growth in the capabilities of LLMs has garnered significant attention from the research community. Central to this development is the diversity and quality of datasets used for training, fine-tuning, and evaluating these models. A new survey presents a thorough examination of datasets across various dimensions including pre-training corpora, instruction fine-tuning datasets, preference datasets, and evaluation datasets, providing an essential reference for researchers in the field.

Pre-training Corpora: Foundation for Language Understanding

Pre-training corpora serve as the foundational layer for LLMs, offering vast amounts of text data to learn from. This survey distinguishes between general pre-training corpora, which consist of mixed data from numerous domains, and domain-specific pre-training corpora tailored for particular fields. The notable characteristics of these datasets include their massive scale and diversity, which directly influence the models' generalization abilities and performance on downstream tasks.

Instruction Fine-tuning Datasets: Improving Model Responsiveness

Instruction fine-tuning datasets consist of text pairs that guide LLMs in understanding and executing specific instructions. These datasets are pivotal in enhancing the model's capacity to follow human commands more accurately. The datasets cover both general instructions not limited by domain and domain-specific instructions for areas like medical, legal, and educational fields. Through fine-tuning on these datasets, models become more adept at task-specific operations, demonstrating improved adaptability and task performance.

Preference Datasets: Aligning Models with Human Judgment

Preference datasets are designed to align model outputs with human preferences more closely. These datasets include responses to similar instructions evaluated based on human or model-generated feedback, using methods like voting, sorting, and scoring. The aim is to refine models' outputs to be more useful, honest, and safe, according to human standards. Training with these datasets helps in model alignment, a critical step in developing models that act in concord with human values and expectations.

Evaluation Datasets: Assessing Model Performance

Evaluation datasets play a crucial role in gauging model performance across a spectrum of tasks. This survey categorizes these datasets into domains like general language understanding, reasoning, knowledge, law, and more. These datasets not only provide a benchmark for assessing the breadth of models' capabilities but also highlight areas requiring further improvement. By comparing model performance on these datasets, researchers can identify strengths and shortcomings, informing future developments and optimizations.

Conclusion

The landscape of LLM datasets is vast, encompassing pre-training, instruction fine-tuning, preference, and evaluation datasets, each serving distinct purposes in model development and assessment. This survey offers a comprehensive overview, presenting a structured analysis of datasets across multiple dimensions. By understanding the role and characteristics of these datasets, researchers can make informed choices in dataset selection and use, fueling advancements in LLM research and application.