LLM Knowledge is Brittle: Truthfulness Representations Rely on Superficial Resemblance (2510.11905v1)

Published 13 Oct 2025 in cs.CL and cs.LG

Abstract: For LLMs to be reliable, they must learn robust knowledge that can be generally applied in diverse settings -- often unlike those seen during training. Yet, extensive research has shown that LLM performance can be brittle, with models exhibiting excessive sensitivity to trivial input variations. In this work, we explore whether this brittleness is a direct result of unstable internal knowledge representations. To explore this question, we build on previous work showing that LLM representations encode statement truthfulness -- i.e., true, factual statements can be easily separated from false, inaccurate ones. Specifically, we test the robustness of learned knowledge by evaluating representation separability on samples that have undergone superficial transformations to drive them out-of-distribution (OOD), such as typos or reformulations. By applying semantically-preserving perturbations, we study how separability degrades as statements become more OOD, across four LLM families, five evaluation datasets, and three knowledge probing methods. Our results reveal that internal representations of statement truthfulness collapse as the samples' presentations become less similar to those seen during pre-training. While LLMs can often distinguish between true and false statements when they closely resemble the pre-training data, this ability is highly dependent on the statement's exact surface form. These findings offer a possible explanation for brittle benchmark performance: LLMs may learn shallow, non-robust knowledge representations that allow for only limited generalizability. Our work presents a fundamental challenge for the utility of truthfulness probes, and more broadly, calls for further research on improving the robustness of learned knowledge representations.

Summary

The paper demonstrates that LLM truthfulness representations collapse under out-of-distribution transformations, with AUC scores sharply declining even when semantics remain intact.
It employs non-linear, linear, and next-token probability probes across diverse datasets, revealing that high benchmark accuracy does not guarantee robust internal knowledge.
Findings show that larger models are not inherently more resilient, highlighting the need for training innovations to achieve invariant, robust representations.

Brittle Knowledge in LLMs: Truthfulness Representations Depend on Superficial Resemblance

Introduction

This paper presents a systematic investigation into the robustness of truthfulness representations in LLMs, focusing on whether internal knowledge representations generalize beyond the superficial forms encountered during pre-training. The authors build on prior work demonstrating that LLM hidden states encode separable representations of true versus false statements, and extend this line of inquiry by probing the stability of these representations under semantically-preserving but distribution-shifting transformations. The central claim is that LLM knowledge is brittle: separability of truthfulness representations degrades rapidly as input samples become more out-of-distribution (OOD) with respect to pre-training data, even when semantic content is preserved.

Figure 1: Semantically-preserving transformations (e.g., typos, paraphrases, translation) drive samples OOD, causing collapse of true/false statement representations and degrading separability.

Methodology

Probing Techniques

Three probing methods are employed to assess the separability of true and false statements:

Non-linear probe: A 3-layer feedforward neural network trained on residual stream activations of the final token.
Linear probe: A single-layer logistic regression on the same activations.
P(True): An output-based method using next-token probabilities in a multiple-choice format, normalized to yield a probability of truthfulness.

Probe performance is quantified via AUC (Area Under the ROC Curve), with higher AUC indicating better separability.

Figure 2: Linear probe architecture for classifying true/false statements from LLM activations.

Models and Datasets

Ten decoder-only autoregressive LLMs spanning four model families (OLMo, Llama 3.1/3.2, Gemma-3) are evaluated. Four benchmark datasets are used: True-False, MMLU, OpenBookQA, and TruthfulQA, covering a range of domains and question formats.

Semantically-Preserving Transformations

To drive samples OOD, the following transformations are applied:

Typos and punctuation noise (character-level perturbations)
Syntactic negation
Clause reordering ("Yoda speak")
Translation to French and Spanish

These transformations preserve semantic content but alter surface form, enabling analysis of robustness to superficial changes.

Measuring OOD-ness

Statement perplexity, computed by the LLM, is used as a proxy for OOD-ness. Validation against n-gram statistics from the OLMo pre-training corpus confirms that higher perplexity correlates with lower representation in the training data.

Figure 3: Strong negative correlation between statement perplexity and n-gram count in pre-training data, validating perplexity as an OOD proxy.

Main Findings

Degradation of Truthfulness Representations

Across all probing methods, datasets, and model families, separability of true and false statements degrades as samples become more OOD. On the original True-False dataset, all probes achieve high AUC ( $\geq 0.96$ ), but AUC drops sharply with increasing perplexity. The degradation slope ( $\beta$ ) is consistently negative, with P(True) showing the steepest decline.

Figure 4: Probe AUC decreases as average perplexity increases, indicating collapse of truthfulness representations under OOD shift.

This pattern generalizes to MMLU, OpenBookQA, and TruthfulQA, with the steepest degradation observed for MMLU ( $\beta = -1.76$ for the non-linear probe).

Figure 5: Non-linear probe AUC on MMLU degrades with increasing perplexity, demonstrating brittleness across domains.

Model Scale and Family Effects

All tested model families exhibit similar degradation rates. Notably, larger models (e.g., Llama 3.1 70B) are less robust than smaller counterparts, contradicting the expectation that scale improves generalization of internal knowledge representations.

Figure 6: Non-linear probe AUC vs. perplexity for multiple model families; all show degradation, with larger models degrading faster.

Benchmark Performance vs. Representation Robustness

Filtering MMLU to questions answered correctly by the model does not yield more robust truthfulness representations. The degradation slope for the correct subset is nearly parallel to that of the full dataset, indicating that high benchmark accuracy does not imply robust internal knowledge.

Figure 7: Probe AUC for correct-only MMLU subset degrades at the same rate as the full set, showing that correct answers do not guarantee robust representations.

Topic-Level Variability

Analysis by MMLU topic reveals substantial variability in robustness. Some topics (e.g., sociology, marketing) maintain high separability and robustness under OOD shift, while others (e.g., history) degrade sharply. Importantly, robustness is not explained by pre-training coverage (as measured by perplexity), nor by benchmark accuracy.

Figure 8: Benchmark accuracy vs. probe AUC by topic; high accuracy does not guarantee robust separability.

Transformation Type Effects

Most transformations degrade AUC in proportion to their effect on perplexity. However, translation causes a pronounced drop in AUC without a corresponding increase in perplexity, indicating that knowledge representations can collapse even when samples remain relatively in-distribution.

Implications

Theoretical Implications

The findings challenge the assumption that LLMs acquire robust, generalizable knowledge during pre-training. Instead, internal representations of truthfulness are tightly coupled to superficial resemblance to training data, undermining the reliability of LLMs in OOD scenarios. This brittleness is not mitigated by increased model scale or pre-training coverage, suggesting a fundamental limitation in current LLM architectures and training regimes.

Practical Implications

The brittleness of truthfulness representations has direct consequences for the deployment of LLMs in real-world settings, where input variation is inevitable. Reliance on probing methods to improve factuality or reliability is fundamentally limited by the instability of underlying representations. Benchmark scores are insufficient indicators of robust knowledge, and post-training methods that elicit latent knowledge may not address the core issue.

Future Directions

Improving the robustness of knowledge representations requires architectural or training innovations that promote invariance to superficial input variation. Data-centric approaches focusing on diversity and quality, as well as methods for enforcing semantic invariance in internal representations, are promising avenues. Further research is needed to disentangle the effects of memorization, reasoning, and generalization in LLMs, and to develop evaluation protocols that capture robustness beyond benchmark accuracy.

Conclusion

This work provides compelling evidence that LLM knowledge is brittle: internal truthfulness representations degrade rapidly under semantically-preserving, superficial input transformations. The phenomenon is consistent across models, datasets, and probing methods, and is not alleviated by increased scale or pre-training coverage. These results highlight a critical challenge for the reliability and generalizability of LLMs, motivating future research into robust knowledge encoding and evaluation.