1.5-Pints Technical Report: Pretraining in Days, Not Months -- Your Language Model Thrives on Quality Data

Published 7 Aug 2024 in cs.CL | (2408.03506v1)

Abstract: This paper presents a compute-efficient approach to pre-training a LLM-the "1.5-Pints"-in only 9 days, while outperforming state-of-the-art models as an instruction-following assistant.Based on MT-Bench (a benchmark that emulates human judgments), 1.5-Pints outperforms Apple's OpenELM and Microsoft's Phi.This is achieved by a carefully curated pre-training dataset of 57 billion tokens, using a mix of automated workflows and manual human review. The selection of the dataset prioritizes content that is considered expository and "textbook-like" to aid the model in reasoning and logical deduction, culminating in its overall ability as a strong and versatile AI model. In terms of the model architecture, we employed a modified Mistral tokenizer, alongside a Llama-2 architecture for wider compatibility. For training, we adopted the methodologies used by StableLM, TinyLlama, and Huggingface Zephyr. 1.5-Pints demonstrates that by focusing on data quality over quantity in LLM training, we can significantly reduce training time and resources required. We believe this approach will not only make pre-training more accessible but also reduce our carbon footprint. Our findings and resources from this research are open-sourced, aiming to facilitate further advancements in the field. The 1.5-Pints model is available in two versions: 2K and 16K context windows.

Abstract PDF HTML Upgrade to Chat

Citations (1)

View on Semantic Scholar

Summary

The paper demonstrates that a carefully curated dataset can drastically reduce pre-training time while achieving SOTA benchmarks.
The model uses 57 billion tokens from high-quality sources to enhance reasoning and logical deduction capabilities.
A 1.57B parameter model based on Llama-2, with an optimized tokenizer, was pre-trained in just 9 days on 8 A100 GPUs.

A Compute-Efficient Approach to Pre-Train LLMs: A Detailed Examination of "1.5-Pints Technical Report"

Introduction

This essay provides a comprehensive overview of the paper titled, "1.5-Pints Technical Report: Pretraining in Days, Not Months — Your LLM Thrives on Quality Data." Authored by Calvin Tan and Jerome Wang from Pints.ai Labs, this report introduces a LLM pre-training method termed "1.5-Pints," which emphasizes the importance of data quality over data quantity. Notably, the 1.5-Pints model achieves state-of-the-art (SOTA) performance metrics on MT-Bench, surpassing models like Apple's OpenELM and Microsoft's Phi while significantly reducing computational resources.

Core Contributions

The primary thesis of the paper posits that a carefully curated dataset can outstrip voluminous, albeit lower quality, data regarding training efficiency and model performance.

Data Quality Over Quantity

The authors curated a pre-training dataset comprising 57 billion tokens from expository and "textbook-like" sources. These high-quality sources include research papers, copyright-free books, parliamentary debates, and synthetically generated content. This focus on expository prose significantly aids the model in reasoning and logical deduction. The dataset compilation involved automated workflows and manual human review, ensuring the highest quality of data.

Training Efficiency

One of the standout features of the 1.5-Pints approach is its remarkable training efficiency. The model, consisting of 1.57 billion parameters, was pre-trained in just 9 days using 8 A100 GPUs, contrasting sharply with typical training paradigms that span months. This approach drastically reduces both time and energy costs, rendering SOTA LLMs more accessible and environmentally friendly.

Model Architecture and Techniques

The model uses the Llama-2 architecture with enhancements, including an enlarged Multi-layer Perceptron (MLP) layer. Additionally, it employs a modified Mistral tokenizer, which demonstrates superior token compression (approx. 3.61% fewer tokens compared to Llama-2’s tokenizer), a critical aspect for efficient training. Fine-tuning and Direct Preference Optimization (DPO) methodologies were performed to align the model closely with human preferences.

Results

The effectiveness of this approach is quantitatively demonstrated using the MT-Bench benchmark. The 1.5-Pints model achieved a score of 3.73 for its 2K context window variant and 3.40 for the 16K context window variant. In comparison, OpenELM-1B-Instruct scored 3.34, and Microsoft's Phi-1.5 scored 3.33. These results underscore that the 1.5-Pints model can outperform competitors while using significantly fewer pre-training tokens.

Implications and Future Directions

The implications of this research span both practical and theoretical domains. Practically, reducing the training time and resource requirements for LLMs democratizes access to high-performing LLMs, facilitating broader spread and adoption of AI technologies. Moreover, it contributes to environmental sustainability by minimizing the carbon footprint associated with extensive AI training cycles.

Theoretically, this work reinforces the importance of data curation in machine learning. It suggests a promising direction where smaller, high-quality datasets can achieve better performance than larger, less-filtered corpora. Future research could explore further optimizations in data curation, leveraging techniques like Retrieval-Augmented Generation (RAG) and Knowledge Graph integration to refine the quality of pre-training datasets continually.

Conclusion

This paper exemplifies the crucial role of data quality in the pre-training of LLMs. Through the 1.5-Pints model, Calvin Tan and Jerome Wang have set a new benchmark in resource-efficient AI development. Their approach reduces training time dramatically and achieves state-of-the-art performance, laying the groundwork for future innovations that could further enhance the accessibility and efficacy of AI models. By open-sourcing their findings, the authors encourage ongoing advancements and collaborations within the research community.

Markdown Report Issue