Why Fine-Tuning Encourages Hallucinations and How to Fix It

Published 16 Apr 2026 in cs.CL, cs.AI, cs.LG, and cs.NE | (2604.15574v1)

Abstract: LLMs are prone to hallucinating factually incorrect statements. A key source of these errors is exposure to new factual information through supervised fine-tuning (SFT), which can increase hallucinations w.r.t. knowledge acquired during pre-training. In this work, we explore whether SFT-induced hallucinations can be mitigated using established tools from the continual learning literature, since they arise as a by-product of knowledge degradation during training. We propose a self-distillation-based SFT method that facilitates effective factual learning while minimizing hallucinations w.r.t. pre-existing knowledge by regularizing output-distribution drift. We also show that, in settings where new knowledge acquisition is unnecessary, suppressing factual plasticity by freezing parameter groups, can preserve task performance while reducing hallucinations. Lastly, we investigate the mechanism behind SFT-induced hallucinations through three hypotheses: capacity limitations, behavior cloning, and localized interference. Our experiments show that a main driver is interference among overlapping semantic representations, and that self-distillation succeeds by mitigating this interference.

Abstract PDF Upgrade to Chat

Authors (8)

Summary

The paper identifies SFT-induced hallucinations as a byproduct of factual forgetting driven by representational interference from semantic overlap.
The study benchmarks parameter freezing and self-distillation as practical methods to preserve pretrained knowledge while allowing new fact acquisition.
Experimental results show that semantic overlap can cause up to 15% accuracy loss on known facts, which is reduced to about 3% using these mitigation strategies.

Fine-Tuning in LLMs as a Driver of Hallucinations: Mechanisms and Mitigation

Introduction

Supervised fine-tuning (SFT) is central to adapting LLMs to new domains and tasks, yet it frequently introduces factually incorrect generations—hallucinations—contradicting previously acquired (pretrained) knowledge. "Why Fine-Tuning Encourages Hallucinations and How to Fix It" (2604.15574) positions SFT-induced hallucinations as a manifestation of factual forgetting—a form of catastrophic interference characteristic of continual learning. The work presents a systematic analysis of SFT-driven knowledge updates, quantifies their impact, and traces the mechanistic origin of induced hallucinations to representational interference driven by semantic overlap between old and new facts. In addition, the study benchmarks and proposes mitigation techniques drawn from continual learning, notably self-distillation and parameter freezing, to address the stability–plasticity dilemma in factual storage.

SFT-Induced Factual Forgetting: Empirical Characterization

The paper employs controlled protocols to decouple task-format learning, factual knowledge acquisition, and factual retention. Utilizing SLiCK [gekhman2024doesfinetuningllmsnew], facts within datasets are partitioned by the LLM’s prior knowledge: HighlyKnown (consistently recalled), MaybeKnown, WeaklyKnown, and Unknown. Experiments train on mixtures of HighlyKnown (task-format only) and Unknown facts (novel knowledge injection), using a training-evaluation split enabling longitudinal quantification of "held-out" factual accuracy.

A characteristic pattern emerges: SFT on new facts rapidly boosts performance for Unknowns (demonstrating plasticity), but accuracy for held-out HighlyKnown facts consistently degrades—indicating substantial forgetting, often $>15\%$ absolute accuracy loss. Conversely, if fine-tuning is restricted to HighlyKnown facts (no new facts introduced), accuracy for held-out facts remains near pre-finetuning ceiling throughout.

Figure 1: Factual correctness on Known, Unknown, and Held-out facts during fine-tuning. Novel fact acquisition drives accuracy loss (“factual forgetting”) on previously held-out facts, distinguishing task-format adaptation from knowledge interference.

This provides robust evidence that SFT-induced hallucinations are not a generic fine-tuning artifact, but instead stem specifically from integrating new factual entries into the parametric model.

Parameter Group Freezing: Suppressing Plasticity When Stability is Desired

To dissociate factual plasticity from hallucination emergence, the paper explores selective parameter freezing, leveraging prior findings that FFN layers primarily encode factual associations [geva2021transformerfeedforwardlayerskeyvalue], while attention layers contribute to task/general behavior [dar2023analyzingtransformersembeddingspace]. If only attention parameters are updated (FFN frozen), models can still learn the required output style and QA task protocol, but fail to acquire new factual knowledge. Critically, this regime exhibits a near-elimination of forgetting: accuracy on held-out facts is nearly as high as training on HighlyKnown facts alone.

Figure 2: Across training epochs and experiment variants, factual correctness for different knowledge groupings reveals selective parameter freezing sharply limits forgetting.

This strategy has direct practical significance for SFT scenarios focused on task adaptation, privacy, or alignment—circumstances where preservation of pretrained factual knowledge is prioritized over incorporating new facts. The central finding: restricting factual plasticity significantly curtails SFT-induced hallucinations, affirming that unwanted drift is mediated by factual updates, not task-format adaptation per se.

Mitigating Forgetting During Knowledge Injection: Self-Distillation

Where new knowledge integration is indispensable, the stability–plasticity tradeoff arises: how to encode new facts with minimal harm to previously learned facts. The study adapts continual learning’s self-distillation framework [li2017learningforgetting, shenfeld2026selfdistillationenablescontinuallearning] to SFT. Here, a snapshot of the model is frozen after an initial epoch (post–task adaptation, pre-factual drift) to serve as the "teacher." Fine-tuning then proceeds under a regularized loss, penalizing KL divergence between student and teacher output distributions at each token.

Figure 3: For Llama-3.1-8B and Qwen2.5-7B, self-distillation maintains high accuracy on previously known facts while still acquiring new knowledge, representing a dual improvement over regular SFT.

The result is a dramatic reduction in hallucinations: forgetting on held-out facts drops from $\sim$ 15% to $\sim$ 3% across architectures and training conditions, while factual plasticity (learning new facts) remains near that of standard SFT. Notably, $\ell_2$ weight regularization toward the teacher parameters only reduces forgetting at the cost of suppressing factual acquisition, highlighting the need for output-level (rather than parameter-level) regularization.

Mechanistic Origin: Representational Interference from Semantic Overlap

To adjudicate between hypotheses (behavioral cloning induced by SFT, representational capacity limits, or local interference), the paper fine-tunes models on large numbers of synthetic facts, varying only the "entity key" structure: (1) semantic keys—novel names fabricated by recombining tokens from real entities, hence representationally proximal to existing entries, and (2) UUID keys—random identifiers with no overlap with the pretrained lexicon.

When new facts are injected with semantic overlap (i.e., keys similar to existing ones), forgetting on held-out facts increases sharply with the number of new facts. In contrast, when keys are UUIDs—even for up to $10^6$ new entries—forgetting remains at or below $4\%$ , despite full factual acquisition. Formally, the degradation in held-out accuracy scales with representational overlap, not simply with fact count or generic task difficulty.

Figure 4: Construction of synthetic entities—either semantically overlapping with real entities or orthogonal (UUID)—allows direct manipulation of representational interference.

Further, analysis of representational drift and geometric structure demonstrates that SFT on semantically overlapping keys causes reorganization and translation of entity representations, directly supporting a structural interference account. Self-distillation, in contrast, sharply limits such drift and preserves both factual retention and acquisition rates.

Figure 5: Across hidden-state and output space drift metrics, standard SFT produces substantial representational reorganization for semantically overlapping new facts; self-distillation preserves the geometry of pretrained entity representations.

Discussion and Implications

The results suggest that SFT-induced hallucinations are a tractable, predictable by-product of representational interference among semantically similar facts—a mechanistic diagnosis that challenges explanations relying on global capacity constraints or SFT-driven behavior cloning. Accordingly, techniques targeting interference at the representation or output level (self-distillation, strategic module freezing) can significantly attenuate these side effects without fundamentally impeding either knowledge acquisition or task-format adaptation.

Practical implications are immediate: for SFT scenarios requiring high reliability with respect to pretrained knowledge, freezing fact-storage modules (e.g., FFN) or deploying self-distillation offers substantial hallucination reduction at minimal to no cost in desired task performance. The findings expand the scope of continual learning literature to direct applications in LLM fine-tuning pipelines and suggest that representation-level interference analysis should play a central role in SFT protocol design.

Theoretically, this prompts a reframing of post-pretraining fine-tuning: behavioral side effects observed during SFT (e.g., hallucinations) are best understood, and remediated, through the lens of selective interference—the local dynamics of distributed knowledge storage—rather than as emergent and irreducible tradeoffs of large-scale optimization.

Conclusion

This study offers a rigorous, mechanistic account of why SFT increases hallucination rates in LLMs, tying the phenomenon to catastrophic factual forgetting and, more specifically, representational interference among semantically similar facts. Mitigation is accessible: factual plasticity can be selectively suppressed (via parameter freezing), or new knowledge can be integrated with minimal forgetting (via output-level regularization such as self-distillation). Fine-tuning protocols that engage this structural understanding effectively decouple knowledge acquisition and hallucination rates, charting a path toward robust, reliable LLM deployment in continual and task-adaptive scenarios.

Figure 6: Schematic: SFT-induced hallucinations can be conceptualized as factual forgetting due to representational interference in parameter space; parameter freezing and self-distillation provide two orthogonal mitigation axes.

The unified perspective is that SFT-induced hallucinations are a local and preventable outcome of how new knowledge is encoded and distributed in pretrained parameter space, rather than an unavoidable side effect of downstream adaptation.

References

Key cited works relevant to mechanisms and mitigation include [li2017learningforgetting], [shenfeld2026selfdistillationenablescontinuallearning], [geva2021transformerfeedforwardlayerskeyvalue], [dar2023analyzingtransformersembeddingspace], [zhu2025teachlargemultimodalmodels], [gekhman2024doesfinetuningllmsnew]. See (2604.15574) for full experimental procedures and additional analyses.

Markdown Report Issue

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

Overview: What this paper is about

This paper asks a simple question: Why do LLMs sometimes “make stuff up” (hallucinate) more after we fine‑tune them, and how can we stop that? The authors show that fine‑tuning a model on new facts can make it “forget” some of what it already knew, which then shows up as wrong answers. They reframe this as a classic learning problem called the stability–plasticity tradeoff: learning new things (plasticity) can harm old knowledge (stability). They then test and propose two practical fixes that keep old knowledge safe while still letting the model learn.

Key objectives in plain language

The paper set out to:

Check if fine‑tuning on new facts really causes models to forget old facts (leading to more hallucinations).
Find ways to fine‑tune that keep the model’s old knowledge intact.
Figure out what, inside the model, causes this forgetting: is it limited memory, copying behavior, or interference between similar facts?

How they tested it (methods explained simply)

Think of the model like a big library:

“Pre‑training” stocked the library with lots of books (facts).
“Fine‑tuning” teaches the librarian how to answer questions in a certain style, and sometimes adds new books.

The authors separated two things that usually get mixed together:

Learning the task format (how to answer questions nicely), and
Learning new facts.

To do that, they:

Used a method called SLiCK to sort questions into “HighlyKnown” (the model already knows the answer) and “Unknown” (the model doesn’t).
Trained on different mixes:
- Only HighlyKnown questions (teaches the format, no new facts added).
- A mix of HighlyKnown and Unknown questions (teaches format and adds new facts).
Measured three things at the same time:
- Task learning: How well the model adopts the Q&A style on known facts.
- New fact learning (plasticity): How well it picks up the new facts.
- Old fact stability: Whether answers on held‑out known facts get worse (i.e., forgetting/hallucinating).

They also tried two mitigation strategies:

Freezing parameter groups: Imagine locking some drawers in the library so they can’t be changed. They compared updating only the “attention” parts versus only the “FFN” (feed‑forward) parts to see which updates cause forgetting.
Self‑distillation: Think of the model as a student who checks with its earlier self (a “teacher snapshot”) to avoid drifting too far from what it used to say. During fine‑tuning, they added a gentle “stay close to your previous outputs” rule so the model learns new facts without overwriting old ones.

Finally, to uncover the cause of forgetting, they ran a controlled experiment with synthetic facts:

Some new “locations” had name‑like labels (e.g., “Bergadena”), which sound similar to real places.
Others used random ID‑style labels (e.g., “Loc_fcfb46”), which share no resemblance with real names.
They added from 1,000 up to 1,000,000 such facts and watched when and how the model started forgetting old facts.
They also measured “representation drift” (like checking if books got moved around on the shelves inside the model): how much the model’s internal representation of known entities shifted during training.

Main findings and why they matter

Here are the highlights:

Learning new facts triggers forgetting of old facts (and thus more hallucinations):
- When trained on both known and new facts, the model learned the new facts but its accuracy on previously known facts dropped by about 15%.
- When trained only on known facts (no new facts), accuracy on old facts stayed stable. So the problem isn’t fine‑tuning itself—it’s the addition of new facts.
Freezing certain parts can prevent new facts from overwriting old ones:
- Updating only attention layers: the model learned the Q&A format but barely picked up new facts, and old knowledge remained safe (few hallucinations).
- Updating only FFN layers behaved like standard fine‑tuning: the model learned new facts but also forgot more old ones.
- This is useful when you want to adapt how the model answers without injecting new factual content (e.g., safety tuning or private-domain formatting).
Self‑distillation lets the model learn new facts without much forgetting:
- With a “stay close to your past answers” constraint, the model learned new facts at a similar speed but the drop on old facts shrank from ~15% to ~3%.
- This is great when you do want to inject new knowledge but keep the old knowledge intact.
The main cause of forgetting is interference between similar facts, not just limited capacity or blind behavior copying:
- Adding millions of random‑ID facts barely caused forgetting.
- Adding many name‑like facts (that resemble real entities) caused much more forgetting—even at smaller scales.
- Inside the model, the “representations” of old entities moved a lot more when the new facts used name‑like labels. Self‑distillation reduced this movement, which matched the drop in hallucinations.
- This means the problem is localized “bumping” in the model’s semantic neighborhoods: updating info about a new, similar‑sounding entity can jostle old, nearby knowledge.

What this means going forward (implications)

Treat hallucinations after fine‑tuning as a forgetting problem:
- It’s not inevitable. It happens when new facts disrupt nearby, similar knowledge.
Pick the right strategy for your goal:
- If you don’t want the model to absorb new facts (just teach it style or safety), freeze the parts that tend to store facts (e.g., avoid updating FFNs). This preserves old knowledge and reduces hallucinations.
- If you do want to add new facts, use self‑distillation. It keeps the model’s answers close to its earlier self and prevents harmful drift, cutting forgetting dramatically.
Be careful about data that resembles existing entities:
- New facts with names that look like real names are more likely to cause interference. If possible, use less confusable labels or apply stronger preservation techniques (like self‑distillation).
Big picture:
- Fine‑tuning doesn’t have to make hallucinations worse. With the right tools, we can keep old knowledge stable while adding new knowledge. Future work should treat “factual stability” as a first‑class goal during fine‑tuning, borrowing ideas from continual learning to make models more reliable.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a single, concrete list of what remains uncertain or unexplored, framed to guide follow‑up research.

External validity beyond closed-book entity-centric QA: Do the findings hold for long-form generation, summarization, multi-hop reasoning, code, math, or tool-augmented tasks where hallucinations arise via different pathways?
Model scale and regime dependence: Are the plasticity–stability dynamics and mitigations unchanged for larger, instruction-tuned, or RLHF/DPO-tuned frontier models (e.g., ≥70B) with different inductive biases and calibration properties?
Cross-lingual and domain generalization: Does overlap-driven interference behave similarly for non-English entities and in specialized domains (biomed, legal) with different tokenization statistics and entity distributions?
Sequential SFT and continual updates: How do hallucinations accumulate across multiple fine-tuning rounds (e.g., domain adaptation → instruction tuning → preference optimization), and can self-distillation remain effective over long sequences?
Alternative continual-learning methods: How do replay buffers, parameter isolation (e.g., adapters, dynamic expansion), EWC/Fisher penalties, orthogonal gradient projection, trust-region KL constraints, or EMA teachers compare to self-distillation on forgetting vs. plasticity?
Distillation design choices: What is the impact of teacher selection (snapshot vs. EMA, earlier vs. later epochs), temperature/λ schedules, token-level masking (e.g., entity tokens only), or online teacher updates on the stability–plasticity trade-off?
Compute/latency budget: What are the training-time and memory overheads of self-distillation relative to simpler constraints, and how do they scale with model size and dataset size?
Side effects beyond accuracy: How do freezing and self-distillation affect calibration, uncertainty, refusal behavior, harmful content, bias amplification, and output diversity?
Interaction with retrieval: Does representational interference persist or diminish in RAG settings, and does forgetting harm retrieval conditioning (e.g., alignment between retrieved passages and generation)?
Robustness to label noise and spurious correlations: Are interference and mitigations stable when injected “facts” contain noise, ambiguity, or conflicting evidence across sources?
Sensitivity to data curriculum and mixing: How do sampling ratios (Known/Unknown), curriculum pacing, batching, and example ordering modulate forgetting and interference?
Optimizer and hyperparameter dependence: Are the conclusions robust across optimizers (AdamW vs. Adafactor), learning-rate schedules, batch sizes, and regularization strengths?
Parameter-group granularity: Beyond attention vs. FFN, which specific subcomponents (e.g., embeddings, layer norms, q/k/v/o projections, FFN gates, positional encodings) drive factual plasticity vs. stability, and can finer-grained freezing or low-rank updates improve the trade-off?
Mechanistic localization: Which layers/heads/neurons exhibit the strongest representational drift during interference, and can neuron-/head-level constraints or routing avoid corrupting factual neighborhoods?
Representation similarity vs. surface-form overlap: The paper varies surface forms (name-like vs. UUID) but does not quantify embedding-space similarity; does forgetting correlate more tightly with token-, subword-, or contextual-embedding proximity than with string form?
Tokenization artifacts: Are results driven by subword segmentation (e.g., shared morphemes) rather than semantic overlap per se? Controls with equal token counts/frequencies and synthetic tokens with controlled segmentation are needed.
Relation and fact type coverage: Do effects replicate across more relations (beyond P17 Location→Country), multi-value facts, temporal facts, commonsense, procedural, or causal knowledge?
Persistence and relearning: After forgetting, can the model quickly recover held-out facts with minimal training, and does self-distillation ease recovery (i.e., was knowledge masked vs. erased)?
Long-term drift and stability: How stable are mitigations under prolonged training (more epochs) and larger knowledge injections (>1M facts) without catastrophic plasticity loss?
Causal evidence for interference: Hidden-state drift is a proxy; can gradient attributions, Fisher overlap, or causal circuit interventions directly demonstrate that updates to new entities perturb specific preexisting entities?
Targeted constraints: Would constraining logits only on entity tokens or trust-region constraints on entity spans suffice, reducing distillation costs while preserving effectiveness?
Interaction with adapters and LoRA: Can isolating new facts in adapters/LoRA while freezing base weights achieve similar stability without sacrificing plasticity, and how does this compare to distillation?
Evaluation breadth and metrics: Exact-match accuracy misses near-synonyms and partial credit; does the observed forgetting persist under semantic matching, human evaluation, or contradiction detection?
Statistical rigor: What are confidence intervals, variance across seeds/splits, and sensitivity to SLiCK misclassification of “Known/Unknown,” especially for Maybe/WeaklyKnown items that were filtered?
Safety and privacy use cases: The paper motivates privacy/alignment scenarios but does not test whether freezing or distillation reliably prevent undesired memorization of sensitive facts while maintaining task performance.
Curriculum for minimizing interference: Can scheduling new facts by low embedding similarity to held-out facts (or by spaced repetition) reduce interference while maintaining learning speed?
Theory of interference scaling: Can a predictive model connect capacity, representational overlap, and update magnitudes to expected forgetting, enabling principled hyperparameter choices?
Layerwise drift profiling: Only a middle layer is analyzed; how does drift evolve layer-by-layer, and do early vs. late layers differentially mediate interference and mitigation?
Generalization to multimodal LLMs: Do analogous interference patterns arise when learning new visual entities/names, and does self-distillation curb cross-modal forgetting?
Open-source tooling and reproducibility: Clearer release of data generation code (synthetic keys), seeds, and configurations would allow precise replication and ablation of the reported effects.

View Paper Prompt View All Prompts

Practical Applications

Immediate Applications

The paper offers concrete practices that can be applied today to reduce fine-tuning–induced hallucinations by managing the stability–plasticity tradeoff and mitigating representational interference.

Safer task-format adaptation by freezing FFN (update only attention)
- Sectors: software, enterprise AI, healthcare, finance, public sector, education
- Tools/workflows: configure full-parameter SFT to freeze FFN layers and train only attention (or place LoRA adapters only on attention projections) when the goal is instruction/style adaptation (alignment) rather than knowledge injection; use standard supervised losses
- Assumptions/dependencies: access to model weights or adapter targeting; may suppress new fact acquisition (intended); tested on 1.5B–8B non-reasoning LLMs—validate on larger/reasoning models before deployment
Knowledge updates with reduced forgetting via self-distillation (Learning without Forgetting)
- Sectors: enterprise knowledge management (support, legal ops), healthcare guidelines, finance (policy/manual updates), consumer assistants
- Tools/workflows: two-loss SFT objective combining task loss with KL regularization to a frozen “teacher” snapshot (e.g., after 1 epoch on Known-only data), with temperature τ≈0.5 and λ≈1 as starting points; maintain throughput by caching teacher logits
- Assumptions/dependencies: extra compute for teacher forward passes; hyperparameter tuning needed; verified on EntityQuestions QA—extend evaluation to downstream generation tasks
Two-stage fine-tuning curriculum to separate task learning from factual updates
- Sectors: most industries deploying adapted LLMs
- Tools/workflows: Stage 1—“Known-only” SFT to learn task format (collect or detect high-confidence-known items); snapshot becomes teacher; Stage 2—SFT on Known+Unknown with self-distillation; monitor held-out known facts
- Assumptions/dependencies: requires identifying “Known” vs “Unknown” items; if SLiCK is unavailable, approximate with confidence heuristics or retrieval-backed checks
LoRA targeting strategy to reduce drift
- Sectors: MLOps, platform teams fine-tuning OSS LLMs
- Tools/workflows: apply LoRA adapters only to attention layers for alignment/task-format tuning; for factual updates, combine LoRA (FFN+attention) with self-distillation
- Assumptions/dependencies: adapter support; trade-offs in capacity and convergence; validate on your data
Dataset linting for surface-form overlap to prevent interference
- Sectors: data engineering for LLM fine-tunes across domains
- Tools/workflows: “similarity linter” that flags new entity names with high lexical/embedding similarity to known entities; if high overlap, either (i) rename to unique tokens (e.g., internal IDs), (ii) strengthen distillation/regularization, or (iii) route to retrieval instead of parametric injection
- Assumptions/dependencies: access to embedding models; potential UX trade-offs if names are altered; best for internal KBs or intermediate representations
Pseudonymization/UUIDs for sensitive entities during SFT
- Sectors: healthcare, finance, HR, legal (privacy + stability)
- Tools/workflows: replace PII or sensitive entity names with stable UUID-like identifiers during fine-tuning; maintain a secure mapping layer to render surface forms at inference
- Assumptions/dependencies: pipeline to remap identifiers; risk of domain shift if final usage requires natural names; best for internal/structured workflows
Factual stability monitoring in MLOps
- Sectors: regulated industries; any team practicing continuous model updates
- Tools/workflows: add “held-out known facts” checks and acceptance thresholds to CI/CD; track a “factual stability score” (e.g., accuracy on held-out known facts vs peak); optionally add representational drift monitors (cosine drift on entity-token hidden states) when weights are accessible
- Assumptions/dependencies: requires curated held-out known sets or synthetic surrogates; hidden-state metrics require model access; thresholds must be calibrated
Evaluation protocol that separates task learning from factual stability
- Sectors: academia, industry evaluation teams
- Tools/workflows: adopt a SLiCK-like pipeline to tag dataset items as HighlyKnown vs Unknown; report (i) task-format learning (Known), (ii) factual plasticity (Unknown), and (iii) factual stability (Held-out Known) in all fine-tuning reports
- Assumptions/dependencies: compute for prompting-based classification; domain adaptation may require custom tagging logic
Prefer retrieval or tool-use for knowledge injection
- Sectors: product teams adding fast-changing facts (news, prices, policies)
- Tools/workflows: for large volumes of new facts—especially with entity names similar to existing ones—prioritize RAG/tool-use; restrict SFT to task-format alignment or use self-distillation when parametric updates are essential
- Assumptions/dependencies: high-quality retrievers/corpus; orchestration complexity; latency budgets
Developer best practices for small-scale/custom fine-tunes
- Sectors: SMEs, individual developers
- Tools/workflows: avoid training on synthetic name-like entities; use unique tags or IDs for internal terms; if using off-the-shelf fine-tuning services with LoRA, target attention only for alignment tasks; add a small held-out quiz of known facts to detect regressions post-tuning
- Assumptions/dependencies: limited control with closed APIs; small eval sets must be carefully curated

Long-Term Applications

These opportunities build on the paper’s insights but require additional research, scaling, or engineering before broad deployment.

Interference-aware optimizers and schedulers
- Sectors: model providers, MLOps platforms
- Tools/products: optimizers that dynamically constrain gradients on tokens/entities with high overlap to protected knowledge; schedule stronger distillation when interference risk spikes
- Assumptions/dependencies: reliable online detection of overlap and protected spans; instrumentation access; careful trade-off with plasticity
Factual stability regularization suites (“Factual Stability Kit”)
- Sectors: AI platforms, open-source toolchains
- Tools/products: turnkey libraries offering self-distillation, drift monitors, entity-overlap linting, and evaluation harnesses; CI integrations for “stability gates”
- Assumptions/dependencies: standard interfaces for different model families; maintenance of teacher snapshots/logits at scale
Interference graphs for safe knowledge injection
- Sectors: knowledge engineering, enterprise AI
- Tools/products: build “interference graphs” mapping semantic neighborhoods of entities; plan updates to minimize cross-entity interference (e.g., batch updates by low-overlap groups)
- Assumptions/dependencies: scalable neighborhood construction; evolving vocabularies; proprietary data constraints
Continual learning for on-device/edge LLMs
- Sectors: robotics, wearables, embedded systems
- Tools/products: self-distillation–based online learning that updates local facts while guarding global knowledge; compressed “teacher-on-chip” approximations
- Assumptions/dependencies: compute/memory constraints; privacy-preserving logging; robust fallback when confidence is low
Standardized “factual stability” certification for regulated deployments
- Sectors: healthcare, finance, government
- Tools/policy: regulatory guidance requiring stability tests (held-out known performance, drift thresholds) after each model update; audit logs of teacher snapshots and distillation configs
- Assumptions/dependencies: consensus test suites; sector-specific known-fact panels; enforcement mechanisms
Entity aliasing and canonicalization frameworks
- Sectors: enterprises with proprietary taxonomies
- Tools/products: pipelines that canonicalize internal entities to unique tokens for training and remap to human-readable names at runtime; reduce interference and support privacy
- Assumptions/dependencies: consistency across teams/systems; UX impacts; multilingual coverage
Editing-at-scale with interference-aware planning
- Sectors: search, virtual assistants, e-commerce catalogs
- Tools/products: large-scale knowledge editing that prioritizes low-overlap updates and activates stronger regularization for high-overlap batches; mixed RAG/SFT routing
- Assumptions/dependencies: accurate overlap estimates; orchestration across retrieval and SFT; monitoring to prevent regressions
Alignment methods that preserve factual stability
- Sectors: foundation model providers, safety research
- Tools/products: combine RLHF/DPO with self-distillation objectives or output-distribution constraints to prevent factual drift during alignment
- Assumptions/dependencies: interaction with preference optimization; careful tuning to avoid over-regularization of beneficial behaviors
Benchmarking and research extensions
- Sectors: academia, standards bodies
- Tools/products: multilingual and cross-domain datasets that disentangle task learning from factual stability; benchmarks capturing interference effects and mitigation efficacy
- Assumptions/dependencies: community adoption; reproducibility across model sizes and families; extensions beyond QA to long-form generation
Sector-specific continual updates with guarantees
- Sectors: healthcare (clinical guidelines), finance (regulatory changes), law (statute updates)
- Tools/products: pipelines that schedule periodic small updates with self-distillation, interference checks, and acceptance thresholds; provide documented “stability deltas” per release
- Assumptions/dependencies: access to authoritative “Known” panels; governance processes; integration with change management systems
Retrieval–parametric hybrid strategies
- Sectors: news/media, real-time analytics, scientific assistants
- Tools/products: policies that keep rapidly changing facts in retrieval while parameterizing only stable, low-overlap knowledge; self-distillation guards both regimes
- Assumptions/dependencies: high-quality retrievers; freshness/consistency guarantees; cost models for dual infrastructure

These applications rely on the paper’s core findings: (i) SFT-induced hallucinations often stem from representational interference, especially when new facts share surface-form similarity with known entities; (ii) freezing FFN (training attention only) supports task adaptation with minimal factual drift; and (iii) self-distillation regularizes output distribution drift to preserve prior knowledge while still acquiring new facts. Adopting these practices can reduce induced hallucinations from ~15% to ~3% in settings similar to those tested, with validation recommended for specific domains and model scales.

View Paper Prompt View All Prompts

Glossary

behavior cloning: A training effect where models imitate observed outputs, potentially producing answers regardless of knowledge boundaries. "behavior cloning derived by SFT"
catastrophic forgetting: Abrupt loss or degradation of previously learned knowledge when learning new information. "catastrophic forgetting over parametric factual knowledge"
closed-book QA: Question answering without external retrieval, relying only on the model’s internal (parametric) knowledge. "in closed-book QA"
continual learning: A paradigm where models learn from sequential tasks/data while balancing retention of old knowledge and acquisition of new knowledge. "In continual learning, forgetting typically arises"
cosine distance: A similarity measure based on the angle between two vectors, often used to compare representations. "cosine distance captures directional shifts in representation space"
distillation loss: A regularization term that encourages a student model’s output distribution to match a teacher’s outputs. "The distillation loss penalizes divergence between the output distributions"
EntityQuestions: An entity-centric QA dataset built from relational facts (e.g., from Wikipedia/Wikidata) used to evaluate factual knowledge. "We use the EntityQuestions dataset"
feed-forward network (FFN): The position-wise multilayer perceptron sublayers in Transformers that contribute to storing and transforming information. "Training only the FFN closely tracks standard SFT"
few-shot prompting: Providing a small number of example Q–A pairs in the prompt to condition model behavior. "under multiple randomized few-shot prompting configurations"
hidden-state drift: The change in internal representations (hidden states) over training, used to assess representational stability. "We track hidden-state drift for held-out entities throughout training"
ℓ2 regularization: A penalty on the squared magnitude of parameters (or their deviation from a reference) to constrain updates. "and $\ell_2$ regularization toward $\theta_i$ "
logits: Pre-softmax scores produced by a model for each token/class, which determine output probabilities after softmax. "the logits at token position $j$ "
next-token prediction loss: The standard cross-entropy objective for language modeling that predicts the next token in a sequence. "the standard next-token prediction loss"
non-reasoning LLMs: LLMs evaluated without explicit reasoning or chain-of-thought mechanisms. "several non-reasoning LLMs"
output-distribution drift: Changes in the model’s output probability distribution over training, potentially reflecting knowledge drift. "output-distribution drift."
parameter freezing: Keeping a subset of model parameters fixed during fine-tuning to preserve certain behaviors or knowledge. "freezing parameter groups"
parameter space: The high-dimensional space of all model parameters where training trajectories and forgetting are analyzed. "factual forgetting in parameter space"
representational interference: Detrimental interactions where updates for new information perturb existing representations, causing forgetting. "representational interference during fine-tuning"
representational overlap: Shared or nearby internal representations between different entities/facts that can lead to interference. "representational overlap as a primary driver"
semantic keys: Synthetic, name-like entity identifiers designed to share surface form with known entities, increasing overlap. "semantic keys, formed by recombining tokens from real location names"
self-distillation: A technique where the model regularizes against its own earlier outputs (teacher snapshots) to reduce forgetting. "we apply self-distillation"
SLiCK: A method to classify a model’s prior knowledge of facts into categories (e.g., HighlyKnown, Unknown) before training. "We apply the SLiCK method"
stabilityâplasticity tradeoff: The balance between preserving existing knowledge (stability) and acquiring new knowledge (plasticity). "stabilityâplasticity tradeoff"
student model: The model being optimized during distillation to align with a fixed teacher while learning new data. "the student model being fine-tuned"
supervised fine-tuning (SFT): Adapting a pretrained model on labeled data for downstream tasks or updated knowledge. "supervised fine-tuning (SFT)"
teacher model: A frozen snapshot of the model that provides soft targets for distillation during further training. "the parameters of a frozen teacher model"
temperature parameter: A scalar controlling the softness of the softmax distribution, used in distillation to emphasize less likely tokens. "temperature parameter"
UUID-style identifiers: Random, non-semantic strings used as entity keys that minimize representational overlap with known entities. "random UUID-style identifiers"

View Paper Prompt View All Prompts

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

Why Fine-Tuning Encourages Hallucinations and How to Fix It

Summary

Fine-Tuning in LLMs as a Driver of Hallucinations: Mechanisms and Mitigation

Introduction

SFT-Induced Factual Forgetting: Empirical Characterization

Parameter Group Freezing: Suppressing Plasticity When Stability is Desired

Mitigating Forgetting During Knowledge Injection: Self-Distillation

Mechanistic Origin: Representational Interference from Semantic Overlap

Discussion and Implications

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

Overview: What this paper is about

Key objectives in plain language

How they tested it (methods explained simply)

Main findings and why they matter

What this means going forward (implications)

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Immediate Applications

Long-Term Applications

Glossary

Open Problems

Continue Learning

Collections

Tweets