ActivationReasoning: Logical Reasoning in LLMs

Updated 28 October 2025

ActivationReasoning is a framework that extracts interpretable latent features from LLMs using sparse autoencoders to represent logical propositions.
It maps neural activations to explicit concept representations, enabling formal logic and structured reasoning in tasks like multi-hop question answering and safety classification.
The framework includes a latent steering mechanism to bias model outputs, enhancing trust, transparency, and adherence to safety rules.

ActivationReasoning (AR) is a framework that embeds explicit logical reasoning in the latent activation spaces of LLMs by organizing and manipulating interpretable latent features extracted via sparse autoencoders (SAEs). AR transforms opaque neural activations into transparent, compositional propositions, enabling the application of formal logic, structured reasoning, and model control in domains such as multi-hop question answering, abstraction, safety classification, and context-sensitive inference (Helff et al., 21 Oct 2025).

1. Latent Concept Identification

The first stage of AR involves constructing a dictionary of latent concept representations. Each concept $c$ is characterized by:

A concept name (e.g., "Bridge", "USA")
A latent representation $r_{c}$ $r_{c}$ , which can be:
- Single-feature: $\mathcal{R}_{\rm single}$ (tied to a single SAE feature)
- Multi-feature: $\mathcal{R}_{\rm multi}$ (group of $k$ SAE features capturing polysemy)
- Relational-feature: $\mathcal{R}_{\rm relation}$ (encoding relations via e.g. shallow decision trees)

For each $c$ , $r_{c}$ is defined as:

For $\mathcal{R}_{\rm single}$ : $r_{c} = \arg\max\left(E[\ell_t|y_{c,t}=1] - E[\ell_t|y_{c,t}=0]\right)$
For $\mathcal{R}_{\rm multi}$ : $r_{c} = \text{top-}k\left(E[\ell_t|y_{c,t}=1] - E[\ell_t|y_{c,t}=0]\right)$
For $\mathcal{R}_{\rm relation}$ : $r_{c}$ is a decision tree induced from $(\ell_t, y_{c,t})$

A soft threshold $\tau_{c}$ is selected via balanced accuracy maximizing true and false positive rates: $\tau_{c} = \arg\max_{\tau \geq 0} \frac{1}{2}(TPR_{c}(\tau) + TNR_{c}(\tau))$

2. Activation Mapping and Proposition Extraction

At inference time, AR computes, for each token $t$ and concept $c$ , an activation score $a(c, t)$ :

For $\mathcal{R}_{\rm single}$ : direct SAE feature lookup
For $\mathcal{R}_{\rm multi}$ : $a(c, t) = w^\top \ell_t$ where $w$ is a weighting vector
For $\mathcal{R}_{\rm relation}$ : $a(c, t)$ via the decision tree $r_c$

The thresholding operation yields an activation matrix $A$ :

$A_{\rm local}[c, t] = \max(a(c, t) - \tau_c, 0)$

This matrix encodes the presence ("activation") of human-interpretable concepts at each token instance, which form atomic logical propositions.

3. Logical Reasoning Over Latent Propositions

Having extracted an explicit set of activated propositions, AR applies user-specified logical rules (propositional logic, forward chaining) to these. The rules may be compositional:

Example:

$\#1\{\text{Bridge}\} \land \#1\{\text{San Francisco}\} \land \#1\{\text{USA}\} \rightarrow \#1\{\text{Golden~Gate~Bridge}\}$

This process infers composite concepts not directly represented by individual SAE features.

Forward chaining is performed from the atomic activations, iterating until closure. The resulting enriched activation matrix $A'$ includes both detected and inferred concept-level evidence.

4. Latent Steering Mechanism

AR introduces a mechanism for steering latent activations to control downstream generation. For any concept $c$ ,

$h' = h + \alpha \cdot (\mathrm{SAE}_D[r_{c}] \times w) \cdot \left( \frac{\|h\|_2}{\|\mathrm{SAE}_D[r_{c}]\|_2} \right)$

where $h$ is the latent activation, $\mathrm{SAE}_D[r_c]$ are the SAE decoder weights for $r_c$ , $w$ is the weighting vector (multi-feature), and $\alpha$ is a steering factor. This operation biases the model toward desired behaviors as dictated by compositional or safety rules.

5. Multi-task Evaluation and Generalization

ActivationReasoning has been evaluated across benchmarks requiring complex, multi-hop, or context-sensitive reasoning:

Table: AR Performance Gains (selected benchmarks)

Task	Baseline Accuracy	AR Accuracy	Backbone Models
PrOntoQA (multi-hop)	50% (chance)	93–95%	Llama3.1-8B, Gemma2-9B
Rail2Country (mono)	41–35%	74.7–93.7%	Various
Rail2Country (meta)	29–26%	62.7–86.0%	Various
ProverQA (hard tier)	<50%	~70%	Instruction-tuned LLMs

AR outperforms much larger and otherwise instruction-tuned models when reasoning complexity increases, scales robustly, and shows generalization to abstract cues (e.g., indirect color references).

6. Implications for Model Transparency and Control

By mapping from high-dimensional neural activations to explicit logical propositions, AR provides auditability and interpretability of internal model reasoning. It supports explanation tracing—failure cases can be linked to concrete concept activations—and enables direct intervention via steering, enhancing model controllability and alignment with safety policies.

Notably, AR's mechanism allows the derivation and imposition of user-preferred abstractions or ethical constraints in generation tasks that require alignment with regulatory, contextual, or compositional requirements.

7. Limitations and Prospective Directions

AR relies on the quality of dictionary construction and appropriateness of SAE feature extraction for meaningful concept alignment. Current logical rules are user-defined; automated induction or probabilistic reasoning integration is identified as a promising future direction. The framework could be extended by leveraging alternative representation learning, integrating external knowledge sources, and scaling latent reasoning to open-ended real-world contexts.

A plausible implication is that ActivationReasoning, by structurally embedding logic in latent spaces, may serve as a model for the integration of symbolic reasoning and continuous neural systems, contributing to reliable, auditable reasoning in future LLM architectures.

PDF Markdown Chat (Pro)

References (1)

ActivationReasoning: Logical Reasoning in Latent Activation Spaces (2025)

Follow Topic

Get notified by email when new papers are published related to ActivationReasoning (AR).