Papers

Topics

Authors

Recent

View all

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 77 tok/s

Gemini 2.5 Pro 52 tok/s Pro

GPT-5 Medium 30 tok/s Pro

GPT-5 High 31 tok/s Pro

GPT-4o 91 tok/s Pro

Kimi K2 178 tok/s Pro

GPT OSS 120B 385 tok/s Pro

Claude Sonnet 4 38 tok/s Pro

2000 character limit reached

Generalization or Hallucination? Understanding Out-of-Context Reasoning in Transformers (2506.10887v1)

Published 12 Jun 2025 in cs.CL and cs.LG

Abstract: LLMs can acquire new knowledge through fine-tuning, but this process exhibits a puzzling duality: models can generalize remarkably from new facts, yet are also prone to hallucinating incorrect information. However, the reasons for this phenomenon remain poorly understood. In this work, we argue that both behaviors stem from a single mechanism known as out-of-context reasoning (OCR): the ability to deduce implications by associating concepts, even those without a causal link. Our experiments across five prominent LLMs confirm that OCR indeed drives both generalization and hallucination, depending on whether the associated concepts are causally related. To build a rigorous theoretical understanding of this phenomenon, we then formalize OCR as a synthetic factual recall task. We empirically show that a one-layer single-head attention-only transformer with factorized output and value matrices can learn to solve this task, while a model with combined weights cannot, highlighting the crucial role of matrix factorization. Our theoretical analysis shows that the OCR capability can be attributed to the implicit bias of gradient descent, which favors solutions that minimize the nuclear norm of the combined output-value matrix. This mathematical structure explains why the model learns to associate facts and implications with high sample efficiency, regardless of whether the correlation is causal or merely spurious. Ultimately, our work provides a theoretical foundation for understanding the OCR phenomenon, offering a new lens for analyzing and mitigating undesirable behaviors from knowledge injection.

Summary

The paper demonstrates that out-of-context reasoning in transformers underlies both correct causal generalization and erroneous hallucinations from spurious associations.
Experiments on five LLMs using synthetic datasets reveal that factorized one-layer models accurately capture OCR, achieving zero test loss on causally linked implications.
The theoretical analysis highlights that gradient descent’s implicit bias, favoring nuclear norm minimization over Frobenius norm, is critical for enabling sample-efficient knowledge injection.

This paper investigates the dual behavior of LLMs after fine-tuning with new factual knowledge: their ability to generalize from new facts and their propensity to hallucinate incorrect information. The authors propose that both phenomena stem from a single mechanism called out-of-context reasoning (OCR), defined as the ability to deduce implications by associating concepts, even if those concepts lack a direct causal link.

The core argument is that OCR leads to generalization when the associated concepts are causally related and to hallucination when they are not. This is demonstrated through experiments on five LLMs (Gemma-2-9B, OLMo-7B, Qwen-2-7B, Mistral-7B-v0.3, and Llama-3-8B) using synthetic datasets. The setup involves fine-tuning models on facts $(s, r_1, b_i)$ (e.g., "Alice lives in France") and corresponding implications $(s, r_2, c_i)$ (e.g., "Alice speaks French") for a subset of subjects $S_{train}$ . The models are then tested on their ability to infer implications $(s', r_2, c_i)$ for new subjects $s' \in S_{test}$ for whom only the fact $(s', r_1, b_i)$ was provided during training.

Experimental Setup for LLMs:

The synthetic dataset uses fictitious names for subjects ( $S$ ) and pairs facts from a set $A_1$ with implications from a set $A_2$ . Five types of associations are tested:

City-Language (Generalization): Uses real-world causal links (e.g., "People living in Paris speak French").
City-Language (CF - Counterfactual Hallucination): Uses incorrect pairings (e.g., "Paris" mapped to "Japanese").
Country-Code (Hallucination): Fictitious association.
Profession-Color (Hallucination): Fictitious association.
Sport-Music (Hallucination): Fictitious association.

Subjects are partitioned, and for each partition, a distinct fact-implication pair $(b_i, c_i)$ is assigned. Training subjects constitute 20% of each partition, and test subjects 80%. The training data includes facts for all subjects and implications only for training subjects. Evaluation uses mean-rank: the average rank of the ground-truth implication among all candidates, sorted by prediction probability.

LLM Experimental Results:

The results in Table 1 show that LLMs exhibit strong generalization for causally related knowledge ("City-Language") but also learn spurious associations, leading to hallucinations for non-causally related or counterfactual pairings. This learning is sample-efficient, occurring with few training examples (e.g., four training subjects per subset). Generalization results tend to be stronger than hallucination, possibly because new causal knowledge aligns with pretrained knowledge.

Formalizing OCR with One-Layer Transformers:

To understand the underlying mechanism, OCR is formalized as a symbolic factual recall task.

Knowledge Representation: Atomic knowledge is a triple $(s, r, a)$ , where $s$ is a subject, $r \in \{r_1, r_2\}$ is a relation, and $a \in A$ is an answer. $A_1 = \{b_i\}_{i=1}^n$ are facts, $A_2 = \{c_i\}_{i=1}^n$ are implications, with a rule $(s, r_1, b_i) \implies (s, r_2, c_i)$ .
Dataset Construction:
- $\mathcal{D}_{train}^{(b)}$ : Facts for $S_{train}$ .
- $\mathcal{D}_{train}^{(c)}$ : Implications for $S_{train}$ .
- $\mathcal{D}_{test}^{(b)}$ : Facts for $S_{test}$ .
- $\mathcal{D}_{test}^{(c)}$ : Implications for $S_{test}$ (held-out).
- Training set: $\mathcal{D}_{train} = \mathcal{D}_{train}^{(b)} \cup \mathcal{D}_{train}^{(c)} \cup \mathcal{D}_{test}^{(b)}$ .
- Test set: $\mathcal{D}_{test} = \mathcal{D}_{test}^{(c)}$ .
Transformer Architecture: A one-layer single-head attention-only decoder-only transformer is used.
- Factorized Model: $f_\Theta(\mathbf{x}) = \mathbf{W}_O \mathbf{W}_V^\top \mathbf{W}_{KQ}^\top \mathbf{x}_T$ , where $\Theta = (\mathbf{W}_O, \mathbf{W}_V, \mathbf{W}_{KQ})$ . $\mathbf{W}_{KQ} = \mathbf{W}_K \mathbf{W}_Q^\top$ .
- Non-Factorized Model: $\tilde{\Theta} = (\mathbf{W}_{OV}, \mathbf{W}_{KQ})$ , where $\mathbf{W}_{OV} = \mathbf{W}_O \mathbf{W}_V^\top$ . $f_{\tilde{\Theta}}(\mathbf{x}) = \mathbf{W}_{OV} \mathbf{W}_{KQ}^\top \mathbf{x}_T$ .
- The models are trained with cross-entropy loss.

One-Layer Transformer Experimental Results:

Experiments show that the factorized model can solve the OCR task (achieves zero test loss), while the non-factorized model fails to generalize to test implications, despite both achieving zero training loss. Mechanism analysis (Figure 1) reveals that the factorized model learns a structured $\mathbf{W}_O \mathbf{W}_V^\top$ matrix that enables OCR, specifically by having non-zero weights in the "test-implication" block. The non-factorized model learns zero weights in this block.

Theoretical Analysis:

The paper provides a theoretical explanation for this difference, focusing on the implicit bias of gradient descent, even when the key-query matrix $\mathbf{W}_{KQ}$ is fixed.

Equivalent Expressivity: Proposition 3.1 states that the factorized $(\mathbf{W}_O, \mathbf{W}_V)$ and non-factorized $\mathbf{W}_{OV}$ parameterizations have equivalent expressive power if $d_h \ge d$ .
Implicit Bias and SVM Forms (Theorem 4.1):
- Training the factorized model $(\mathbf{W}_O, \mathbf{W}_V)$ with gradient descent leads to solutions where $\mathbf{W}_{OV}^{\text{F}} = \mathbf{W}_O \mathbf{W}_V^\top$ minimizes the nuclear norm $\| \mathbf{W}_{OV}^{\text{F}} \|_{\star}^2$ , subject to margin constraints:
- $\min_{\mathbf{W}_{OV}^{\text{F}}} \frac{1}{2} \| \mathbf{W}_{OV}^{\text{F}} \|_{\star}^2 \quad \text{s.t.} \quad h_{(s,r),a'}(\mathbf{W}_{OV}^{\text{F}}) \geq 1$ .
- Training the non-factorized model $\mathbf{W}_{OV}$ leads to solutions minimizing the Frobenius norm $\| \mathbf{W}_{OV} \|_F^2$ :
- $\min_{\mathbf{W}_{OV}} \frac{1}{2} \| \mathbf{W}_{OV} \|_F^2 \quad \text{s.t.} \quad h_{(s,r),a'}(\mathbf{W}_{OV}) \geq 1$ .
OCR Abilities (Theorem 4.2):
- The factorized model's solution $\mathbf{W}_{OV}^{\text{F}}$ (minimizing nuclear norm) exhibits OCR: for test data $(s,r) \in \mathcal{D}_{test}$ , the margin $h_{(s,r),a'}(\mathbf{W}_{OV}^{\text{F}}) \geq \min \{ \sqrt{m_{train}/ m_{test}}, 1\}$ . This means as long as $m_{train} > 0$ (some entities seen with both fact and implication), the model generalizes.
- The non-factorized model's solution $\mathbf{W}_{OV}$ (minimizing Frobenius norm) does not exhibit OCR for implications: for $(s,r) \in \mathcal{D}_{test}$ and $a' \in A_2 \setminus \{a^*(s,r)\}$ , $h_{(s,r),a'}(\mathbf{W}_{OV}) = 0$ . The Frobenius norm minimization tends to zero out weights for unseen test implications.

This theoretical difference explains the empirical findings. The nuclear norm minimization implicitly encourages low-rank solutions that can capture underlying relationships between facts and implications, enabling generalization. The Frobenius norm, in contrast, favors sparse solutions that zero out entries corresponding to unseen test implications.

Implications of Theoretical Findings:

Sample Efficiency of OCR: The OCR capability depends on the ratio $m_{train}/m_{test}$ , explaining why LLMs can generalize/hallucinate from few examples.
Caution with Reparameterization: Combining weight matrices (like $\mathbf{W}_O \mathbf{W}_V^\top \to \mathbf{W}_{OV}$ ) is common in theoretical analyses but can fundamentally change training dynamics and generalization capabilities.

Dynamics with Trainable Key-Query Matrix:

The analysis is extended to a trainable $\mathbf{W}_{KQ}$ matrix. Theorem 4.3 shows that under specific initialization (Assumption 4.2) and if $|A_2| > 1$ , the non-factorized model still fails to generalize to test implications, with test loss $\mathcal{L}_{test}(\tilde{\Theta}_t) \geq \log |A_2|$ . This is due to parameter symmetries leading to uniform prediction probability over $A_2$ for test implications.

Conclusions and Future Work:

The paper concludes that generalization and hallucination in LLMs after knowledge injection are two sides of the same OCR coin. The implicit bias of gradient descent in factorized transformer models favors solutions that enable this associative reasoning, which is beneficial for causal links but detrimental for spurious correlations. Future work includes extending the analysis to multi-layer transformers and developing methods to mitigate undesirable hallucinations during knowledge injection.

Practical Implementation Considerations:

Model Architecture: The factorization of output and value matrices $(\mathbf{W}_O, \mathbf{W}_V)$ is crucial for enabling OCR. Combining them into $\mathbf{W}_{OV}$ before training can hinder this capability.
Training Data for OCR: To encourage OCR, training data should include examples where entities are associated with both a fact and its implication. The ratio of such examples ( $m_{train}$ ) to entities seen only with facts ( $m_{test}$ ) influences the strength of OCR.
Mitigating Hallucinations: Since OCR can lead to hallucinations by learning spurious correlations, strategies might involve:
- Carefully curating fine-tuning data to avoid spurious co-occurrences.
- Techniques to explicitly distinguish causal from correlational relationships during training.
- Regularization methods beyond standard weight decay that might penalize learning non-causal associations, potentially by influencing the nuclear norm minimization landscape.
Debugging Generalization/Hallucination: If a model fails to generalize or excessively hallucinates after fine-tuning, examining the structure of the learned $\mathbf{W}_O \mathbf{W}_V^\top$ matrix (or its equivalent in deeper models) might provide insights. A low-rank structure capturing the intended associations would be desirable for generalization.
Computational Cost: The theoretical analysis focuses on one-layer models. In practice, LLMs are much deeper. However, the principle that matrix factorization influences implicit regularization and learning dynamics is likely to hold, albeit in more complex ways. The sample efficiency of OCR (needing few examples) is a key takeaway for practical fine-tuning.

This work provides a theoretical basis for understanding a fundamental aspect of LLM behavior, offering a new perspective for analyzing and improving knowledge injection techniques.