The Geometries of Truth Are Orthogonal Across Tasks (2506.08572v1)

Published 10 Jun 2025 in cs.LG, cs.AI, cs.CL, and stat.ML

Abstract: LLMs have demonstrated impressive generalization capabilities across various tasks, but their claim to practical relevance is still mired by concerns on their reliability. Recent works have proposed examining the activations produced by an LLM at inference time to assess whether its answer to a question is correct. Some works claim that a "geometry of truth" can be learned from examples, in the sense that the activations that generate correct answers can be distinguished from those leading to mistakes with a linear classifier. In this work, we underline a limitation of these approaches: we observe that these "geometries of truth" are intrinsically task-dependent and fail to transfer across tasks. More precisely, we show that linear classifiers trained across distinct tasks share little similarity and, when trained with sparsity-enforcing regularizers, have almost disjoint supports. We show that more sophisticated approaches (e.g., using mixtures of probes and tasks) fail to overcome this limitation, likely because activation vectors commonly used to classify answers form clearly separated clusters when examined across tasks.

Summary

The paper demonstrates that truthfulness geometries in LLMs are highly task-specific, with linear classifiers showing nearly disjoint activation supports across different tasks.
It reveals that orthogonal weight vectors derived from linear probes correlate with truthfulness, indicating limited transferability across domains.
The study finds that even mixture-of-tasks training fails to overcome domain-specific limitations, emphasizing the need for task-aware calibration in LLMs.

Analysis of Task-Specific Truthfulness Geometries in LLMs

The paper "The Geometries of Truth Are Orthogonal Across Tasks" presents a comprehensive examination of truthfulness detection in LLMs across different tasks by analyzing the activation patterns at inference time. This work critically evaluates the concept of a "geometry of truth" within these models, which posits that the activations correlating with correct answers can be linearly separable from those related to incorrect ones. The authors aim to elucidate the properties and limitations of such geometries when considering distinct task domains.

To achieve this, the paper systematically investigates the cross-task generalization capabilities of linear classifiers trained to discern truthfulness from LLM's activations. The key findings are outlined as follows:

Key Findings

Task-Specific Truthfulness Geometries:
- Linear classifiers trained on activations from distinct tasks exhibit significant divergence, revealing that truthfulness geometries are inherently task-specific.
- When these classifiers are trained with sparsity-inducing regularizers, the supports of the resulting models are shown to be nearly disjoint across tasks, suggesting weak inter-task transferability.
Geometric Analysis of Orthogonality:
- The weight vectors of linear probes across different tasks are predominantly orthogonally oriented, supporting the notion that the internal representation of truthfulness varies significantly across domains.
- This orthogonality strongly correlates with the generalization performance, highlighting the challenge of applying a single probe across multiple task categories.
Limited Impact of Mixture-of-Tasks Training:
- Training on a mixture of diverse tasks did not mitigate the generalization shortfall, as the optimal truthfulness vector for one task could not be reliably composed from others.
- Even advanced architectures like "mixture of probes," designed to handle multiple tasks simultaneously, failed to surpass the performance of simple linear probes trained on individual tasks.

Implications and Future Directions

The findings present significant implications for the deployment and reliability of LLMs in real-world applications. The strong task-dependence of truthfulness geometries indicates the necessity for task-specific training or fine-tuning of models, especially in high-stakes areas where accuracy is critical. The exploration of more complex probing architectures highlighted intrinsic limitations rather than shortcomings in existing methodologies.

Future research could focus on:

Developing methodologies to dynamically recognize and adapt to domain shifts during deployment to mitigate false outputs in unseen tasks.
Investigating the potential of non-linear embedding spaces that may accommodate more cohesive truthfulness representations across diversified tasks.
Further exploration of task-contextualization approaches to fine-tune LLMs immediately for task-specific truthfulness representation.

This paper underlines the complexity inherent in generalizing LLM truthfulness detection across multiple domains and suggests prudence in leveraging LLM assertions without task-aware calibration. The ongoing pursuit of reliable AI systems should consider these findings as essential insights towards more robust and adaptable LLMs. The paper contributes a critical refining point on the understanding of post-hoc evaluation tools for verifying the outputs of LLMs across varied applications.

Related Papers

Tweets

https://twitter.com/wazizian/status/1933532510178132089

YouTube

Show All Videos