Reasoning Subspace Projection in HARP
- Reasoning Subspace Projection is a technique that decomposes LLM hidden states into semantic and reasoning subspaces using SVD.
- It isolates internal cognitive traces from direct linguistic output, yielding compact features for accurate hallucination detection.
- Empirical results demonstrate significant AUROC improvements on benchmark datasets, showcasing its efficiency and robustness.
Reasoning Subspace Projection is a core mechanism of HARP (Hallucination Detection via Reasoning Subspace Projection), a framework designed to disentangle and isolate internal reasoning information within the hidden states of LLMs for the purpose of robust hallucination detection. HARP establishes that the hidden state space of an LLM can be orthogonally decomposed into two complementary subspaces: the semantic subspace, responsible for linguistic and predictive content, and the reasoning subspace, which encodes internal cognitive traces not used directly in output generation. This decomposition leverages Singular Value Decomposition (SVD) applied to the Unembedding layer’s parameter matrix, resulting in a low-dimensional representation that centers on reasoning traces. Subsequent projection onto this reasoning subspace provides highly compact and discriminative features for detecting hallucinations in generated sequences, yielding state-of-the-art performance across standard benchmarks (Hu et al., 15 Sep 2025).
1. Direct-Sum Decomposition of Hidden States
Let denote the -dimensional hidden state space at layer of an LLM. The fundamental hypothesis of HARP is that admits an orthogonal decomposition:
where:
- encodes information essential for next-token prediction,
- captures internal reasoning activity disentangled from immediate token output.
For any hidden state , this yields the additive split:
with . This orthogonalization is a key property enabling explicit separation of reasoning-related dynamics from surface-level linguistic information.
2. Disentanglement via Unembedding Singular Value Decomposition
HARP empirically demonstrates that the Unembedding layer’s parameter matrix, , acts as a filter that projects away the reasoning subspace components. To expose the underlying structure, one performs the SVD:
where , , and contains singular values .
Subspace assignment is determined by the energy captured in the singular values:
- where the top singular values ( of Frobenius norm energy) correspond to semantics,
- with the remaining dimensions designated as reasoning.
The singular vectors and form orthonormal bases for and , respectively.
3. Reasoning Subspace Projection and Feature Construction
After constructing the basis for the reasoning subspace, any hidden state is decomposed via
and the projection onto reasoning is given by
The resulting is a vector of dimension , providing a noise-filtered, reasoning-centric representation. This compactness is leveraged for efficient and robust downstream discrimination.
4. Detector Architecture, Training Paradigm, and Loss
For hallucination detection, HARP employs the following methodology:
- At each token position in an answer , extract .
- Pass each through a two-layer MLP (, ReLU activations) to obtain a scalar .
- Aggregate to a sequence-level hallucination score:
- Label training data using beam search to sample answer candidates and annotate with hallucination flags.
- Apply the binary cross-entropy loss:
- Optimization utilizes Adam, batch size 128, learning rate with cosine decay, over 50 epochs.
This configuration yields high discrimination power while maintaining computational efficiency due to the severe dimensionality reduction of the input features.
5. Empirical Performance and Benchmark Results
HARP was evaluated using Qwen-2.5-7B-Instruct and LLaMA-3.1-8B as backbone LLMs across four QA-oriented datasets:
| Dataset | HARP AUROC | Next Best AUROC | Difference |
|---|---|---|---|
| NQ-Open | 84.0% | 78.9% | +5.1% |
| TruthfulQA | 88.1% | 77.7% | +10.4% |
| TriviaQA | 92.8% | 85.3% | +7.5% |
| TyDiQA-GP | 88.4% | 74.8% | +13.6% |
Comparable or greater absolute gains are obtained using LLaMA-3.1-8B (e.g., on TriviaQA, HARP achieves 92.9% versus previous best near 76%). These results establish HARP’s Reasoning Subspace Projection as state-of-the-art for hallucination detection, surpassing prior detectors by significant margins (Hu et al., 15 Sep 2025).
6. Implications, Robustness, and Potential Extensions
Filtering out the high-dimensional semantic subspace ( of ) and focusing on the residual reasoning subspace permits the learning of highly informative yet compact features. This approach produces a single-pass hallucination detector and demonstrates strong robustness under distribution shift, with models trained on one QA dataset maintaining performance when evaluated on others.
Potential extensions documented include:
- Adaptive selection of to flexibly trade off between semantic and reasoning content.
- Integration of projection-based hallucination avoidance into generation loops, e.g., by manipulating in intermediate model layers.
- Application of the decomposition to additional text generation domains beyond QA, such as summarization or dialogue, and to non-autoregressive architectures.
- Further interpretability studies examining individual reasoning basis vectors () in connection with logical or knowledge-centric processing steps.
A plausible implication is that further exploration of the reasoning subspace could yield insights into internal cognitive representations in LLMs, and that suppression or augmentation of may impact not only hallucination but general model reliability.
7. Context and Related Methodologies
Reasoning Subspace Projection as operationalized in HARP is among the first frameworks to systematically leverage structural decomposition of LLM hidden states for hallucination detection, addressing limitations of prior detectors that did not separate semantic and reasoning components. The approach is notable for its robustness, efficiency (through dimensionality reduction), and adaptability to different architectures and tasks (Hu et al., 15 Sep 2025).
Related future research directions include deeper analysis of Unembedding layer properties, extensions to more diverse multilingual and multimodal tasks, and development of causal or contrastive probing tools to further dissect the functional role of semantic and reasoning subspaces.