Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
134 tokens/sec
GPT-4o
9 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

To Code, or Not To Code? Exploring Impact of Code in Pre-training (2408.10914v1)

Published 20 Aug 2024 in cs.CL

Abstract: Including code in the pre-training data mixture, even for models not specifically designed for code, has become a common practice in LLMs pre-training. While there has been anecdotal consensus among practitioners that code data plays a vital role in general LLMs' performance, there is only limited work analyzing the precise impact of code on non-code tasks. In this work, we systematically investigate the impact of code data on general performance. We ask "what is the impact of code data used in pre-training on a large variety of downstream tasks beyond code generation". We conduct extensive ablations and evaluate across a broad range of natural language reasoning tasks, world knowledge tasks, code benchmarks, and LLM-as-a-judge win-rates for models with sizes ranging from 470M to 2.8B parameters. Across settings, we find a consistent results that code is a critical building block for generalization far beyond coding tasks and improvements to code quality have an outsized impact across all tasks. In particular, compared to text-only pre-training, the addition of code results in up to relative increase of 8.2% in natural language (NL) reasoning, 4.2% in world knowledge, 6.6% improvement in generative win-rates, and a 12x boost in code performance respectively. Our work suggests investments in code quality and preserving code during pre-training have positive impacts.

Citations (6)

Summary

  • The paper finds that adding code data in LLM pre-training boosts non-code tasks, with gains of 8.2% in natural language reasoning and a 12x improvement in code performance.
  • It shows that initializing models with a balanced code-text mixture outperforms text-only baselines, elevating both world knowledge and generative quality.
  • The study identifies a 25% code data ratio and highlights the impact of high-quality synthetic code, guiding optimal pre-training strategies for versatile LLMs.

Overview of "To Code, or Not To Code? Exploring Impact of Code in Pre-training"

The paper "To Code, or Not To Code? Exploring Impact of Code in Pre-training" by Viraat Aryabumi et al. provides a thorough investigation of the role of code in the pre-training mixtures of LLMs, specifically focusing on its impact on downstream tasks that extend beyond code generation. The paper is predicated on anecdotal consensus among LLM practitioners about the importance of code data for improving general performance, but it aims to systematically and empirically analyze this impact across various tasks and model sizes.

Key Findings and Contributions

The authors address multiple aspects of code data utilization in pre-training through a series of well-defined, large-scale experiments. These include examining the initialization strategies, varying proportions of code, the quality and properties of code datasets, and the introduction of code in the pre-training cooldown phase. The results confirm several significant points:

  1. Importance of Code in Pre-training:
    • The inclusion of code data significantly boosts performance in non-code tasks. The best model configuration, balanced→text followed by cooldown with code, outperformed the text-only pre-training baseline with a relative increase of 8.2% in natural language (NL) reasoning, 4.2% in world knowledge, a 6.6% improvement in generative win-rates, and a 12x boost in code performance.
  2. Impact of Initialization:
    • Models initialized with code-pretrained models (code→text and balanced→text) generally showed improved performance over those that do not include code, underlining the benefit of starting with mixed pre-training, even when the focus is on non-code tasks.
  3. Optimal Proportion of Code:
    • The paper found that a proportion of 25% code data in the pre-training mix maximized performance across NL reasoning tasks, with higher proportions improving code generation performance linearly but potentially degrading world knowledge task performance.
  4. Quality and Type of Code Data:
    • Including high-quality, synthetically generated code data even in small proportions had a significant positive impact. The authors report a 9% improvement in NL reasoning and a 44.9% increase in code performance when synthetic data was included.
  5. Role of Cooldown Phase:
    • Enhancements were observed when code data was included in the cooldown phase, a stage where high-quality datasets are up-weighted, with improvements of 3.6% in NL reasoning, 10.1% in world knowledge, and 20% in code performance relative to the model before the cooldown phase.

Methodology and Evaluation

The experimental framework of this paper is robust, involving models of varying sizes, namely 470M to 2.8B parameters, evaluated across a wide spectrum of tasks including natural language reasoning, world knowledge, code benchmarks, and generative quality assessed via LLM-as-a-judge win-rates.

  1. Pre-training Data:
    • The authors used a variety of code data sources including web-based code, markup-style data, synthetic code data, and code-adjacent datasets. Text data was drawn from the SlimPajama dataset, excluding code-related documents to isolate the effect of code data.
  2. Training Strategy:
    • Models were subjected to continuous pre-training and a unique cooldown phase, with controlled variations in learning rate schedules and data weightings to assess the specific contributions of each stage.
  3. Evaluation Suite:
    • The evaluation suite consisted of benchmarks that tested world knowledge (e.g., TriviaQA, Natural Questions Open), NL reasoning (e.g., BoolQ, PiQA, HellaSwag), and code performance (e.g., HumanEval-Python, MBPP). Generative performance was also assessed using win-rates from LLM-as-a-judge evaluations.

Implications and Future Directions

The outcomes of this paper support the significant role of code in pre-training mixtures, beyond the domain of code generation. This not only suggests a reassessment of current pre-training data compositions but also points to strategic investments in quality-controlled and synthetic code data.

From a practical perspective, the insights from this work could guide the design of more versatile LLMs that are capable of excelling across diverse tasks, including those requiring sophisticated reasoning and general knowledge. Future research could expand on these insights by exploring larger model scales, investigating the role of code in safety and ethical considerations, and dynamically adjusting the proportion of code data during different pre-training phases.

Conclusion

This paper provides a comprehensive and empirical basis for the inclusion of code data in LLM pre-training, highlighting its multifaceted benefits and offering pragmatic guidelines on optimizing pre-training recipes. The thorough experimental methodology and extensive evaluation strengthen the credibility of the findings, which collectively advance our understanding of the critical role of code in augmenting LLM performance.

Youtube Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com