On the generalization of language models from in-context learning and finetuning: a controlled study (2505.00661v2)

Published 1 May 2025 in cs.CL, cs.AI, and cs.LG

Abstract: LLMs exhibit exciting capabilities, yet can show surprisingly narrow generalization from finetuning. E.g. they can fail to generalize to simple reversals of relations they are trained on, or fail to make simple logical deductions based on trained information. These failures to generalize from fine-tuning can hinder practical application of these models. On the other hand, LLMs' in-context learning shows different inductive biases, and can generalize better in some cases. Here, we explore these differences in generalization between in-context- and fine-tuning-based learning. To do so, we constructed several novel datasets to evaluate and improve models' abilities to generalize from finetuning data. The datasets are designed to create clean tests of generalization, by isolating the knowledge in the dataset from that in pretraining. We expose pretrained large models to controlled subsets of the information in these datasets -- either in context, or through fine-tuning -- and evaluate their performance on test sets that require various types of generalization. We find overall that in data-matched settings, in-context learning can generalize more flexibly than fine-tuning (though we also find some qualifications of prior findings, such as cases when fine-tuning can generalize to reversals embedded in a larger structure of knowledge). We build on these findings to propose a method to enable improved generalization from fine-tuning: adding in-context inferences to finetuning data. We show that this method improves generalization across various splits of our datasets and other benchmarks. Our results have implications for understanding the inductive biases of different modes of learning in LLMs, and practically improving their performance.

PDF Abstract

This paper investigates the differences in generalization patterns exhibited by LLMs when learning from data via in-context learning (ICL) versus fine-tuning. The authors use controlled synthetic datasets specifically designed to contain novel knowledge structures (like relation reversals, syllogisms, and semantic hierarchies) that are unlikely to have been encountered during pre-training. This approach allows for a clean evaluation of how LLMs generalize new information acquired through different learning mechanisms.

The core problem addressed is the observation that while fine-tuning can lead to strong performance on trained tasks, it sometimes results in surprisingly narrow generalization, such as failing simple relation reversals. In contrast, ICL often demonstrates more flexible generalization in such cases. The paper aims to understand these differences and propose practical methods to improve fine-tuning generalization.

To achieve this, the authors constructed several datasets:

Simple Reversals and Syllogisms: Datasets featuring basic comparisons or syllogistic structures using nonsense words to ensure the knowledge is novel.
Reversal Curse Paper Dataset: A dataset based on prior work (Shao et al., 19 Feb 2024 ) containing fictional celebrity names and descriptions, used to paper the reversal phenomenon.
Semantic Structure Benchmark: A more complex dataset representing a hierarchical knowledge structure (like animal/object categories with properties and relations) where all nouns, adjectives, and verbs are replaced with nonsense terms derived from plausible phoneme combinations (Wei et al., 2023 ). This dataset allows testing generalization on rephrased facts, reversals, syllogisms, and category holdouts.
Process Knowledge Benchmark: A dataset for testing the ability to learn and apply a novel pseudo-mathematical procedure ("derivatoids"), exploring compositional generalization of rules.

The evaluation primarily uses multiple-choice likelihood scoring on held-out test sets requiring different types of generalization.

For fine-tuning, the authors typically train Gemini 1.5 Flash (or Gemini 1.5 Flash-8B) models with standard hyperparameters (batch size 8 or 16, learning rate $3 \cdot 10^{-4}$ or $1 \cdot 10^{-5}$ ) for 200-1000 steps.

For in-context evaluation, the entire training dataset (or large subsamples for larger datasets to manage context length) is concatenated and provided as context to the instruction-tuned model, which is then prompted to answer questions based on this context.

A key practical contribution is the proposed dataset augmentation method. This method leverages the generalization capabilities of ICL to create additional training data for fine-tuning. The idea is to use an LLM (the same one being trained or a more capable one) to generate inferences (like reversals, rephrasings, or logical consequences) from the original training data by processing it in context. This generated "augmented" data is then added to the fine-tuning dataset. Two main augmentation strategies are explored:

Local (sentence) augmentation: Using a prompt to generate rephrasings or reversals for individual sentences from the training data. The prompt provides examples of how to rephrase or reverse statements.

LOCAL_PROMPT = (
'Please generate possible novel statements and rephrasings that can be '
'inferred from each sentence on its own. Even a simple sentence has '
'some alternative phrasings that are logically equivalent. Please '
'only use logic and language to draw your conclusions, regardless of '
'whether the entities in question actually exist.\n', 'Statement: trillips are taller than zax.', 'Inferences: trillips have greater height than zax. zax are shorter'
' than trillips. zax have lower heights than trillips.', 'Statement: Note: Engineering is simpler than science.', 'Inferences: Science is more complex than engineering. Engineering is '
'less complex than science. Engineering is not as complex as science.', 'Statement: "{text_to_augment}"', )

Global (document) augmentation: Using a prompt that includes the full training dataset in context, then presenting a specific document and prompting the model to generate inferences by linking that document to the overall context.

GLOBAL_PROMPT = (
'I am going to give you a bunch of documents, and then please use them as '
'context to reason through all the logical consequences you can produce '
'from a final target document. First, here are the source documents:\n\n'
'{full_context}\n\n'
'Now, please use that context to help me rephrase this document or reason '
'through the consequences of the statements it contains. Please state all '
'consequences as explicitly as possible, following the format of the '
'source documents, and be as complete as possible.\n'
'{target_document}\n'
)

The paper also investigates sentence-splitting, where multi-sentence training documents are broken down into individual sentence-level examples for fine-tuning. This is found to be particularly beneficial when using augmented data.

The experimental results demonstrate several key findings:

ICL vs. Fine-tuning: In controlled settings with novel data, ICL often generalizes more flexibly than standard fine-tuning, especially on tasks requiring systematic generalization like reversals and syllogisms. For instance, on the Reversal Curse dataset (Shao et al., 19 Feb 2024 ), ICL achieves near ceiling performance on reversals, while fine-tuning on the forward direction shows near zero accuracy. On simple nonsense reversals and syllogisms, ICL consistently outperforms fine-tuning. On the Semantic Structure benchmark, ICL shows benefits on reversals and syllogisms, though category holdouts remain challenging.
Augmented Fine-tuning Effectiveness: Augmenting the fine-tuning dataset with in-context inferences significantly improves the generalization performance of fine-tuned models. This augmented fine-tuning approach often matches or surpasses the performance of ICL on the original training data across various tasks and datasets. This indicates that using train-time inference to generate more diverse or explicitly related examples for fine-tuning is a practical strategy for improving generalization.
Sentence Splitting Impact: Splitting training documents into sentence-level examples improves fine-tuning performance, especially when combined with data augmentation. This suggests that presenting facts or inferences in isolated contexts can enhance learning, potentially by avoiding "explaining-away" phenomena where the model relies on existing context cues rather than learning the information directly.
Nonsensification and ICL: Preliminary results suggest that ICL performance on tasks like the Reversal Curse dataset degrades when entity names are replaced with nonsense terms. This indicates that LLMs might rely on their pre-training knowledge and priors for effective long-context processing and ICL, and novel, out-of-distribution terms can interfere with this process.
Model Size Effects: The benefits of augmented fine-tuning are observed across different model scales (Gemini 1.5 Flash and Flash-8B), suggesting the approach is generally applicable. Smaller models show weaker ICL performance compared to larger ones, consistent with prior work.
Process Knowledge: Exploratory experiments on the Process benchmark show that ICL is more data-efficient in the low-shot regime than standard SFT for compositional generalization. Programmatic augmentation can improve SFT performance, suggesting that while challenging, procedural generalization can also benefit from data augmentation strategies.

Practical Implications:

These findings have significant implications for adapting LLMs to downstream tasks involving novel information:

Choosing Adaptation Strategy: When flexible generalization to variations (like reversals) or logical inferences from newly learned facts is critical, ICL appears to be the default better option compared to basic fine-tuning. However, ICL can be computationally expensive due to long context windows and may be sensitive to the novelty of terms.
Improving Fine-tuning Generalization: The proposed data augmentation method provides a practical technique to enhance fine-tuning. By using the LLM itself to generate explicit variations or inferences from the training data, practitioners can create richer datasets that lead to better generalization without relying solely on manual data collection or simple programmatic rephrasing. This is particularly useful when only limited original training data is available or when targeting specific types of generalization failure modes observed in standard fine-tuning (like the reversal curse).
Data Preparation: Techniques like sentence-splitting, especially when applied to augmented datasets, should be considered during data preparation for fine-tuning complex documents to potentially improve learning efficiency and generalization.
Compute Allocation: The results suggest that allocating compute towards train-time inference (for data augmentation) can be a valuable strategy for improving model performance and generalization at test time, complementing traditional scaling methods like training larger models or training on more original data.

The work highlights that leveraging the distinct strengths of ICL (flexible inference on novel data) and fine-tuning (efficient encoding of learned patterns) through synergistic methods like augmented fine-tuning is a promising direction for building more capable and reliable AI systems for real-world applications involving learning new knowledge. While the use of synthetic/nonsense data limits direct extrapolation to all real-world scenarios, it provides valuable controlled insights into the underlying learning mechanisms.