How Does Code Pretraining Affect Language Model Task Performance? (2409.04556v2)

Published 6 Sep 2024 in cs.CL and cs.LG

Abstract: LLMs are increasingly trained on corpora containing both natural language and non-linguistic data like source code. Aside from aiding programming-related tasks, anecdotal evidence suggests that including code in pretraining corpora may improve performance on other, unrelated tasks, yet to date no work has been able to establish a causal connection by controlling between language and code data. Here we do just this. We pretrain LLMs on datasets which interleave natural language and code in two different settings: additive, in which the total volume of data seen during pretraining is held constant; and competitive, in which the volume of language data is held constant. We study how the pretraining mixture affects performance on (a) a diverse collection of tasks included in the BigBench benchmark, and (b) compositionality, measured by generalization accuracy on semantic parsing and syntactic transformations. We find that pretraining on higher proportions of code improves performance on compositional tasks involving structured output (like semantic parsing), and mathematics. Conversely, increase code mixture can harm performance on other tasks, including on tasks that requires sensitivity to linguistic structure such as syntax or morphology, and tasks measuring real-world knowledge.

Citations (1)

View on Semantic Scholar

Summary

The paper establishes that adding code data during pretraining boosts LLM performance on structured tasks like compositional generalization and arithmetic operations.
The study employs competitive and additive pretraining settings to reveal nuanced trade-offs in performance as code proportions vary.
The results highlight that while code pretraining specializes LLMs for structured outputs, it can detract from their natural language understanding and real-world knowledge tasks.

An Analysis of Code Pretraining's Impact on LLM Task Performance

Introduction

The paper "How Does Code Pretraining Affect LLM Task Performance?" presents a detailed investigation into the impacts of incorporating code into the pretraining datasets of LLMs. The paper aims to establish a causal link between the inclusion of code data and the performance of LLMs on diverse downstream tasks.

Methodology

The authors employ two distinct pretraining settings: competitive and additive. In the competitive setting, they maintain a constant total volume of data while varying the proportion of code and natural language. In the additive setting, the volume of natural language data is fixed, while additional code data is added, increasing the total volume of training data. This approach allows the authors to analyze the marginal utility of code pretraining under different conditions.

Experimental Setup

The paper uses 12-layer decoder-only transformer models, each with approximately 374 million parameters. These models are pretrained on mixed datasets consisting of English text from the Colossal Cleaned Common Crawl (C4) and cleaned code from GitHub, in proportions ranging from 0% to 90% for the competitive setting and 0% to 50% for the additive setting. The performance of these models is evaluated on a range of tasks, including compositional generalization benchmarks (COGS, COGS-vf, English Passivization) and non-code tasks from the BigBench benchmark.

Results

Compositional Generalization

The findings indicate that higher proportions of code in the pretraining dataset can improve performance on tasks with structured output domains. For instance, accuracy on the generalization sets of COGS and COGS-vf—both of which involve transforming natural-language inputs into formal semantic representations—improved with increased code mixture. This effect was particularly pronounced in the structural generalization examples from COGS-vf. However, the influence of code pretraining on the well-formedness of outputs was minimal, suggesting that the improvements are likely due to better argument distribution generalization rather than syntactic well-formedness.

Arithmetic Tasks

Interestingly, the models also demonstrated enhanced performance on arithmetic tasks involving multi-digit operations as the proportion of code increased. This trend was generally positive across both settings but showed a peak followed by a decline in performance in the competitive setting as the proportion of code exceeded 50%.

Linguistic and Real-World Knowledge Tasks

Conversely, increased exposure to code tended to harm performance on tasks requiring purely linguistic or real-world knowledge. Example tasks from this category include English Passivization, BigBench's Common Morpheme, Fantasy Reasoning, General Knowledge, and Implicatures tests. The negative impact was evident in both the competitive and additive settings, indicating a potential trade-off between code-related improvements and performance on natural language understanding tasks.

Aggregate Performance Analysis

To quantify the overall impact of code pretraining, the authors performed permutation tests on the slopes derived from best-fit lines of task performance versus code mixture. The results showed a statistically significant increase in performance variance and upper-quartile performance on multiple-choice tasks in the competitive setting. This suggests that while code pretraining improves performance on some tasks, it also increases the overall variability in performance, potentially making models more specialized but less generalizable.

Discussion

The paper provides important insights into the benefits and trade-offs of incorporating code into the pretraining datasets of LLMs. The observed improvements in compositional generalization and arithmetic tasks suggest that the structured nature of code can enhance a model's ability to handle formally structured tasks. However, the adverse effects on linguistic and real-world knowledge tasks highlight the need for careful consideration when balancing code and natural language data in pretraining corpora.

Implications and Future Research

These findings have several practical and theoretical implications. For developers, the results indicate that incorporating code into the pretraining corpus can be beneficial for specific applications, particularly those involving structured data or arithmetic. Theoretically, the paper raises questions about the inductive biases introduced by code pretraining and how these might interact with other forms of data.

Future research could explore larger models to understand if these trends hold at greater scales, investigate the impacts of code from different programming languages, and examine how variable proportions of code affect models' long-term performance in dynamic, real-world applications.

Conclusion

The paper provides a rigorous analysis of the impacts of code pretraining on LLMs. The findings contribute to a nuanced understanding of how code influences model capabilities, offering valuable insights for both researchers and practitioners in the field of AI.