- The paper establishes that adding code data during pretraining boosts LLM performance on structured tasks like compositional generalization and arithmetic operations.
- The study employs competitive and additive pretraining settings to reveal nuanced trade-offs in performance as code proportions vary.
- The results highlight that while code pretraining specializes LLMs for structured outputs, it can detract from their natural language understanding and real-world knowledge tasks.
Introduction
The paper "How Does Code Pretraining Affect LLM Task Performance?" presents a detailed investigation into the impacts of incorporating code into the pretraining datasets of LLMs. The paper aims to establish a causal link between the inclusion of code data and the performance of LLMs on diverse downstream tasks.
Methodology
The authors employ two distinct pretraining settings: competitive and additive. In the competitive setting, they maintain a constant total volume of data while varying the proportion of code and natural language. In the additive setting, the volume of natural language data is fixed, while additional code data is added, increasing the total volume of training data. This approach allows the authors to analyze the marginal utility of code pretraining under different conditions.
Experimental Setup
The paper uses 12-layer decoder-only transformer models, each with approximately 374 million parameters. These models are pretrained on mixed datasets consisting of English text from the Colossal Cleaned Common Crawl (C4) and cleaned code from GitHub, in proportions ranging from 0% to 90% for the competitive setting and 0% to 50% for the additive setting. The performance of these models is evaluated on a range of tasks, including compositional generalization benchmarks (COGS, COGS-vf, English Passivization) and non-code tasks from the BigBench benchmark.
Results
Compositional Generalization
The findings indicate that higher proportions of code in the pretraining dataset can improve performance on tasks with structured output domains. For instance, accuracy on the generalization sets of COGS and COGS-vf—both of which involve transforming natural-language inputs into formal semantic representations—improved with increased code mixture. This effect was particularly pronounced in the structural generalization examples from COGS-vf. However, the influence of code pretraining on the well-formedness of outputs was minimal, suggesting that the improvements are likely due to better argument distribution generalization rather than syntactic well-formedness.
Arithmetic Tasks
Interestingly, the models also demonstrated enhanced performance on arithmetic tasks involving multi-digit operations as the proportion of code increased. This trend was generally positive across both settings but showed a peak followed by a decline in performance in the competitive setting as the proportion of code exceeded 50%.
Linguistic and Real-World Knowledge Tasks
Conversely, increased exposure to code tended to harm performance on tasks requiring purely linguistic or real-world knowledge. Example tasks from this category include English Passivization, BigBench's Common Morpheme, Fantasy Reasoning, General Knowledge, and Implicatures tests. The negative impact was evident in both the competitive and additive settings, indicating a potential trade-off between code-related improvements and performance on natural language understanding tasks.
To quantify the overall impact of code pretraining, the authors performed permutation tests on the slopes derived from best-fit lines of task performance versus code mixture. The results showed a statistically significant increase in performance variance and upper-quartile performance on multiple-choice tasks in the competitive setting. This suggests that while code pretraining improves performance on some tasks, it also increases the overall variability in performance, potentially making models more specialized but less generalizable.
Discussion
The paper provides important insights into the benefits and trade-offs of incorporating code into the pretraining datasets of LLMs. The observed improvements in compositional generalization and arithmetic tasks suggest that the structured nature of code can enhance a model's ability to handle formally structured tasks. However, the adverse effects on linguistic and real-world knowledge tasks highlight the need for careful consideration when balancing code and natural language data in pretraining corpora.
Implications and Future Research
These findings have several practical and theoretical implications. For developers, the results indicate that incorporating code into the pretraining corpus can be beneficial for specific applications, particularly those involving structured data or arithmetic. Theoretically, the paper raises questions about the inductive biases introduced by code pretraining and how these might interact with other forms of data.
Future research could explore larger models to understand if these trends hold at greater scales, investigate the impacts of code from different programming languages, and examine how variable proportions of code affect models' long-term performance in dynamic, real-world applications.
Conclusion
The paper provides a rigorous analysis of the impacts of code pretraining on LLMs. The findings contribute to a nuanced understanding of how code influences model capabilities, offering valuable insights for both researchers and practitioners in the field of AI.