Exploring the generalization of LLM truth directions on conversational formats (2505.09807v1)

Published 14 May 2025 in cs.CL and cs.AI

Abstract: Several recent works argue that LLMs have a universal truth direction where true and false statements are linearly separable in the activation space of the model. It has been demonstrated that linear probes trained on a single hidden state of the model already generalize across a range of topics and might even be used for lie detection in LLM conversations. In this work we explore how this truth direction generalizes between various conversational formats. We find good generalization between short conversations that end on a lie, but poor generalization to longer formats where the lie appears earlier in the input prompt. We propose a solution that significantly improves this type of generalization by adding a fixed key phrase at the end of each conversation. Our results highlight the challenges towards reliable LLM lie detectors that generalize to new settings.

PDF Abstract

Exploring the Generalization of LLM Truth Directions on Conversational Formats

This paper, authored by Timour Ichmoukhamedov and David Martens, offers a detailed exploration into the generalization capabilities of LLMs with regards to truth detection within conversational formats. The investigation stems from the hypothesis that LLMs possess an activation space where truth and falsehoods can be linearly separable, as suggested by recent research.

Core Findings

The paper primarily examines how these truth directions behave across various types of conversational inputs. It extends previous studies by demonstrating that linear probes trained on specific hidden states can generalize across different conversation formats. The three key findings are:

Generalization Successes: The research indicates good generalization from probes trained on true/false statements to short conversational sequences where lies are at the statement's conclusion.
Generalization Failures: Conversely, the probes generalize poorly to longer conversational inputs where falsehoods appear earlier in the sequence. This suggests that the universal truth direction is not as robust across varied conversation formats, posing challenges for the development of reliable LLM-based lie detectors.
Improvement Strategies: The paper finds that appending a fixed key phrase at the end of each conversation significantly enhances generalization. This strategy seems to focus the model's attention on verifying the truth, despite the lie being placed earlier in the input sequence.

Implications

The implications of these findings are significant for AI safety and mechanistic interpretability. Understanding the activation patterns that correspond to truth and lies is crucial in designing LLMs that are efficient in lying detection, thereby improving trustworthiness and reducing the risks of AI deception. Furthermore, the research points to the potential of standardized prompt formats to aid the reliability of detecting falsehoods across diverse contexts.

Future Research Directions

Future work could extend the insights provided here by exploring the robustness of these generalization patterns across different LLM architectures and larger-scale models. Additionally, investigating synthetic LLM-generated conversation formats might offer pathways to more versatile lie detection mechanisms. Addressing biases and exploring more complex conversational structures could also refine the effectiveness of truth detection probes.

Overall, the paper contributes valuable insights into truth detection capabilities within LLMs and highlights the complexities and challenges in ensuring broad generalization across various conversational formats. The strategic addition of a key phrase points towards practical approaches to improving these capabilities, paving the way for future exploration and development within the field.

PDF Markdown Bookmark Chat (Pro)

Authors (2)

Timour Ichmoukhamedov (9 papers)
David Martens (27 papers)

Related Papers

Find Related Papers

Tweets

YouTube

Show All Videos