ActivationReasoning: Logical Reasoning in LLMs
- ActivationReasoning is a framework that extracts interpretable latent features from LLMs using sparse autoencoders to represent logical propositions.
- It maps neural activations to explicit concept representations, enabling formal logic and structured reasoning in tasks like multi-hop question answering and safety classification.
- The framework includes a latent steering mechanism to bias model outputs, enhancing trust, transparency, and adherence to safety rules.
ActivationReasoning (AR) is a framework that embeds explicit logical reasoning in the latent activation spaces of LLMs by organizing and manipulating interpretable latent features extracted via sparse autoencoders (SAEs). AR transforms opaque neural activations into transparent, compositional propositions, enabling the application of formal logic, structured reasoning, and model control in domains such as multi-hop question answering, abstraction, safety classification, and context-sensitive inference (Helff et al., 21 Oct 2025).
1. Latent Concept Identification
The first stage of AR involves constructing a dictionary of latent concept representations. Each concept is characterized by:
- A concept name (e.g., "Bridge", "USA")
- A latent representation , which can be:
- Single-feature: (tied to a single SAE feature)
- Multi-feature: (group of SAE features capturing polysemy)
- Relational-feature: (encoding relations via e.g. shallow decision trees)
 
For each , is defined as:
- For :
- For :
- For : is a decision tree induced from
A soft threshold is selected via balanced accuracy maximizing true and false positive rates:
2. Activation Mapping and Proposition Extraction
At inference time, AR computes, for each token and concept , an activation score :
- For : direct SAE feature lookup
- For : where is a weighting vector
- For : via the decision tree
The thresholding operation yields an activation matrix :
This matrix encodes the presence ("activation") of human-interpretable concepts at each token instance, which form atomic logical propositions.
3. Logical Reasoning Over Latent Propositions
Having extracted an explicit set of activated propositions, AR applies user-specified logical rules (propositional logic, forward chaining) to these. The rules may be compositional:
- Example:
This process infers composite concepts not directly represented by individual SAE features.
Forward chaining is performed from the atomic activations, iterating until closure. The resulting enriched activation matrix includes both detected and inferred concept-level evidence.
4. Latent Steering Mechanism
AR introduces a mechanism for steering latent activations to control downstream generation. For any concept ,
where is the latent activation, are the SAE decoder weights for , is the weighting vector (multi-feature), and is a steering factor. This operation biases the model toward desired behaviors as dictated by compositional or safety rules.
5. Multi-task Evaluation and Generalization
ActivationReasoning has been evaluated across benchmarks requiring complex, multi-hop, or context-sensitive reasoning:
Table: AR Performance Gains (selected benchmarks)
| Task | Baseline Accuracy | AR Accuracy | Backbone Models | 
|---|---|---|---|
| PrOntoQA (multi-hop) | 50% (chance) | 93–95% | Llama3.1-8B, Gemma2-9B | 
| Rail2Country (mono) | 41–35% | 74.7–93.7% | Various | 
| Rail2Country (meta) | 29–26% | 62.7–86.0% | Various | 
| ProverQA (hard tier) | <50% | ~70% | Instruction-tuned LLMs | 
AR outperforms much larger and otherwise instruction-tuned models when reasoning complexity increases, scales robustly, and shows generalization to abstract cues (e.g., indirect color references).
6. Implications for Model Transparency and Control
By mapping from high-dimensional neural activations to explicit logical propositions, AR provides auditability and interpretability of internal model reasoning. It supports explanation tracing—failure cases can be linked to concrete concept activations—and enables direct intervention via steering, enhancing model controllability and alignment with safety policies.
Notably, AR's mechanism allows the derivation and imposition of user-preferred abstractions or ethical constraints in generation tasks that require alignment with regulatory, contextual, or compositional requirements.
7. Limitations and Prospective Directions
AR relies on the quality of dictionary construction and appropriateness of SAE feature extraction for meaningful concept alignment. Current logical rules are user-defined; automated induction or probabilistic reasoning integration is identified as a promising future direction. The framework could be extended by leveraging alternative representation learning, integrating external knowledge sources, and scaling latent reasoning to open-ended real-world contexts.
A plausible implication is that ActivationReasoning, by structurally embedding logic in latent spaces, may serve as a model for the integration of symbolic reasoning and continuous neural systems, contributing to reliable, auditable reasoning in future LLM architectures.