Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

134 tokens/sec

GPT-4o

9 tokens/sec

Gemini 2.5 Pro Pro

47 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

$\textit{New News}$: System-2 Fine-tuning for Robust Integration of New Knowledge (2505.01812v1)

Published 3 May 2025 in cs.CL, cs.AI, and cs.LG

Abstract: Humans and intelligent animals can effortlessly internalize new information ("news") and accurately extract the implications for performing downstream tasks. While LLMs can achieve this through in-context learning (ICL) when the news is explicitly given as context, fine-tuning remains challenging for the models to consolidate learning in weights. In this paper, we introduce $\textit{New News}$, a dataset composed of hypothetical yet plausible news spanning multiple domains (mathematics, coding, discoveries, leaderboards, events), accompanied by downstream evaluation questions whose correct answers critically depend on understanding and internalizing the news. We first demonstrate a substantial gap between naive fine-tuning and in-context learning (FT-ICL gap) on our news dataset. To address this gap, we explore a suite of self-play data generation protocols -- paraphrases, implications and Self-QAs -- designed to distill the knowledge from the model with context into the weights of the model without the context, which we term $\textit{System-2 Fine-tuning}$ (Sys2-FT). We systematically evaluate ICL and Sys2-FT performance across data domains and model scales with the Qwen 2.5 family of models. Our results demonstrate that the self-QA protocol of Sys2-FT significantly improves models' in-weight learning of the news. Furthermore, we discover the $\textit{contexual shadowing effect}$, where training with the news $\textit{in context}$ followed by its rephrases or QAs degrade learning of the news. Finally, we show preliminary evidence of an emerging scaling law of Sys2-FT.

Summary

The paper demonstrates that Sys2-FT effectively bridges the FT-ICL gap by converting in-context learning signals into model-internalized knowledge.
Self-QA emerges as the most effective protocol, significantly boosting performance in quantitative domains through active question generation and answering.
Including source context during fine-tuning triggers a contextual shadowing effect that undermines the model’s ability to internalize new information.

This paper introduces New News (2505.01812), a dataset and methodology designed to address a significant challenge in LLMs: reliably integrating new information into their weights through fine-tuning. While LLMs are adept at using new facts provided in context (in-context learning or ICL), consolidating this knowledge permanently in the model's parameters via fine-tuning (FT) remains difficult.

The Problem: The FT-ICL Gap

The authors demonstrate a substantial gap (the FT-ICL gap) between the performance of models relying on ICL with new information and models that have been naively fine-tuned on that same information. On the New News dataset, which comprises hypothetical but plausible "news" articles across domains like mathematics, coding, scientific discoveries, leaderboards, and events, along with downstream questions requiring the model to reason based on this news, naive fine-tuning shows significantly poorer performance compared to when the news is simply provided in the prompt as context.

System-2 Fine-tuning (Sys2-FT)

Inspired by human cognitive processes like memory consolidation, rehearsal, and self-explanation, the paper proposes System-2 Fine-tuning (Sys2-FT). The core idea is to leverage the LLM's own ICL capabilities to generate synthetic "replay elements" related to the new news. This generated data is then used for fine-tuning, aiming to distill the knowledge gained from ICL into the model's weights.

The paper explores several Sys2-FT protocols for generating these replay elements:

Paraphrase: The model is prompted to generate diverse rephrasings of the original news.
Implication: The model is prompted to reason about and generate text detailing the downstream consequences and implications of the news.
Self-QA: This two-step process involves first prompting the model (given the news) to generate relevant questions about the news, and then in a separate interaction (still given the news as context), generating answers to these questions. The resulting QA pairs are used as fine-tuning data.

The generated data for each protocol is formatted into a multi-turn conversation suitable for standard supervised fine-tuning (SFT), where the loss is computed on the assistant's response tokens.

Key Findings and Practical Implications

Sys2-FT bridges the FT-ICL gap: Experiments using the Qwen 2.5 model family show that Sys2-FT methods consistently outperform naive fine-tuning on the New News dataset.
Self-QA is the most effective protocol: Among the explored methods, the Self-QA protocol proves to be the most robust and effective for internalizing new knowledge, particularly for quantitative domains like Math and Coding. This suggests that requiring the model to actively generate and answer questions about the new information is a powerful form of rehearsal and consolidation.
Contextual Shadowing Effect: A surprising and critical finding for implementation is the "Contextual Shadowing Effect." When the original news is included as a context prefix before the generated replay elements during fine-tuning (e.g., including the news at the start of the conversation before the generated QA pairs), it significantly degrades the model's ability to internalize the knowledge into its weights. The authors hypothesize this is because the model, already able to access the information via the prefix, finds the task of predicting the replay elements trivial, "shadowing" the learning signal. This highlights the importance of carefully structuring fine-tuning data to avoid trivializing the learning task.
The Curse of Overexposure: The paper also observes instances where fine-tuning with the news negatively impacts the model's ICL performance with the same news. This "curse of overexposure" suggests a complex interaction between in-weight and in-context knowledge representations, where over-training on specific facts in weights might interfere with the ICL mechanism.
Scaling of Sys2-FT: Preliminary evidence suggests an emerging scaling law for Sys2-FT evaluation accuracy with respect to compute, particularly for larger models (3B+). This indicates that Sys2-FT is a scalable approach for knowledge integration.
Importance of Model and Data Quality: Analyzing performance across models trained on data generated by different models shows that successful knowledge integration requires both a sufficiently capable base model and high-quality generated data. Stronger models can benefit from data generated by weaker models, demonstrating "Weak-to-Strong generalization" in this context.

Implementation Considerations

Implementing Sys2-FT involves several steps:

Data Generation:
- Choose a base model capable of good ICL performance. The prompts and quality of generated data are crucial, with Self-QA being recommended.
- Implement prompting strategies (similar to those detailed in Appendix A) to generate paraphrases, implications, or QA pairs based on the "news" you want to integrate.
- Generate a sufficiently diverse set of replay elements for each piece of news. The paper used 1024 conversations per news item.
- Randomize prompts and conversation structures during generation to enhance diversity and avoid overfitting to specific prompt formats.
Data Formatting:
- Format the generated replay elements into standard SFT conversation turns (User-Assistant pairs).
- Crucially, avoid including the original "news" as a direct prefix in the conversation for the fine-tuning data itself, due to the contextual shadowing effect. The model should learn to produce the information or answer questions about it without the original news explicitly in the prompt.
Fine-tuning:
- Use standard SFT techniques (e.g., next token prediction with cross-entropy loss).
- Calculate loss only on the assistant's generated tokens.
- Leverage efficient fine-tuning methods like LoRA (Low-Rank Adaptation) for reduced computational requirements and storage (checkpoints saving only LoRA adapters). The paper used LoRA with r=16, alpha=32.
- Monitor training dynamics carefully, as abrupt changes and temporary drops in ICL performance (curse of overexposure) can occur.
Evaluation:
- Evaluate the fine-tuned model's ability to answer downstream questions about the news without providing the news in the prompt context.
- Compare performance against a base model (pre-fine-tuning), a naive fine-tuning baseline, and the ICL performance (providing the news in the prompt during evaluation).

Trade-offs and Limitations

Computational Cost: Generating synthetic data using a large model can be computationally expensive.
Data Quality: The quality of the generated replay elements depends heavily on the capability of the model used for generation and the effectiveness of the prompts.
Generalization Beyond Training Domains: The paper notes that benefits were most pronounced in quantitative domains. Applying Sys2-FT effectively to more subjective or open-ended domains might require further research into appropriate replay element types and generation strategies.
Understanding Curses: The exact mechanisms behind contextual shadowing and the curse of overexposure require further investigation, which could lead to more robust training strategies.

In summary, New News (2505.01812) presents a practical approach, Sys2-FT (specifically the Self-QA protocol), for improving LLMs' ability to internalize new knowledge via fine-tuning. It also uncovers important phenomena like contextual shadowing that inform how training data should be structured when aiming to consolidate new information in model weights. This research paves the way for developing models that can more effectively learn and adapt to a continuously changing world.

PDF Markdown

Tweets

https://twitter.com/corefpark/status/1919811679636095419

https://twitter.com/fly51fly/status/1921678495303840053

https://twitter.com/Hidenori8Tanaka/status/1919925682131374180

https://twitter.com/StenRuediger/status/1922395942570909829

https://twitter.com/GptMaestro/status/1924632309896249820

YouTube

Show All Videos

"'New News': System-2 Fine-tuning for Robust Integration of New Knowledge", Park et al 2025 (do LLMs need to 'think about' finetuning data, like training on multiple parahrased versions, to match ICL prompting?) (13 points, 1 comment)

$\textit{New News}$: System-2 Fine-tuning for Robust Integration of New Knowledge (2505.01812v1)

Summary

Related Papers

Tweets

YouTube

Reddit