Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 149 tok/s
Gemini 2.5 Pro 46 tok/s Pro
GPT-5 Medium 25 tok/s Pro
GPT-5 High 30 tok/s Pro
GPT-4o 112 tok/s Pro
Kimi K2 205 tok/s Pro
GPT OSS 120B 434 tok/s Pro
Claude Sonnet 4.5 38 tok/s Pro
2000 character limit reached

ActivationReasoning: Logical Reasoning in LLMs

Updated 28 October 2025
  • ActivationReasoning is a framework that extracts interpretable latent features from LLMs using sparse autoencoders to represent logical propositions.
  • It maps neural activations to explicit concept representations, enabling formal logic and structured reasoning in tasks like multi-hop question answering and safety classification.
  • The framework includes a latent steering mechanism to bias model outputs, enhancing trust, transparency, and adherence to safety rules.

ActivationReasoning (AR) is a framework that embeds explicit logical reasoning in the latent activation spaces of LLMs by organizing and manipulating interpretable latent features extracted via sparse autoencoders (SAEs). AR transforms opaque neural activations into transparent, compositional propositions, enabling the application of formal logic, structured reasoning, and model control in domains such as multi-hop question answering, abstraction, safety classification, and context-sensitive inference (Helff et al., 21 Oct 2025).

1. Latent Concept Identification

The first stage of AR involves constructing a dictionary of latent concept representations. Each concept cc is characterized by:

  • A concept name (e.g., "Bridge", "USA")
  • A latent representation rcr_{c}, which can be:
    • Single-feature: Rsingle\mathcal{R}_{\rm single} (tied to a single SAE feature)
    • Multi-feature: Rmulti\mathcal{R}_{\rm multi} (group of kk SAE features capturing polysemy)
    • Relational-feature: Rrelation\mathcal{R}_{\rm relation} (encoding relations via e.g. shallow decision trees)

For each cc, rcr_{c} is defined as:

  • For Rsingle\mathcal{R}_{\rm single}: rc=argmax(E[tyc,t=1]E[tyc,t=0])r_{c} = \arg\max\left(E[\ell_t|y_{c,t}=1] - E[\ell_t|y_{c,t}=0]\right)
  • For Rmulti\mathcal{R}_{\rm multi}: rc=top-k(E[tyc,t=1]E[tyc,t=0])r_{c} = \text{top-}k\left(E[\ell_t|y_{c,t}=1] - E[\ell_t|y_{c,t}=0]\right)
  • For Rrelation\mathcal{R}_{\rm relation}: rcr_{c} is a decision tree induced from (t,yc,t)(\ell_t, y_{c,t})

A soft threshold τc\tau_{c} is selected via balanced accuracy maximizing true and false positive rates: τc=argmaxτ012(TPRc(τ)+TNRc(τ))\tau_{c} = \arg\max_{\tau \geq 0} \frac{1}{2}(TPR_{c}(\tau) + TNR_{c}(\tau))

2. Activation Mapping and Proposition Extraction

At inference time, AR computes, for each token tt and concept cc, an activation score a(c,t)a(c, t):

  • For Rsingle\mathcal{R}_{\rm single}: direct SAE feature lookup
  • For Rmulti\mathcal{R}_{\rm multi}: a(c,t)=wta(c, t) = w^\top \ell_t where ww is a weighting vector
  • For Rrelation\mathcal{R}_{\rm relation}: a(c,t)a(c, t) via the decision tree rcr_c

The thresholding operation yields an activation matrix AA:

Alocal[c,t]=max(a(c,t)τc,0)A_{\rm local}[c, t] = \max(a(c, t) - \tau_c, 0)

This matrix encodes the presence ("activation") of human-interpretable concepts at each token instance, which form atomic logical propositions.

3. Logical Reasoning Over Latent Propositions

Having extracted an explicit set of activated propositions, AR applies user-specified logical rules (propositional logic, forward chaining) to these. The rules may be compositional:

  • Example:

#1{Bridge}#1{San Francisco}#1{USA}#1{Golden Gate Bridge}\#1\{\text{Bridge}\} \land \#1\{\text{San Francisco}\} \land \#1\{\text{USA}\} \rightarrow \#1\{\text{Golden~Gate~Bridge}\}

This process infers composite concepts not directly represented by individual SAE features.

Forward chaining is performed from the atomic activations, iterating until closure. The resulting enriched activation matrix AA' includes both detected and inferred concept-level evidence.

4. Latent Steering Mechanism

AR introduces a mechanism for steering latent activations to control downstream generation. For any concept cc,

h=h+α(SAED[rc]×w)(h2SAED[rc]2)h' = h + \alpha \cdot (\mathrm{SAE}_D[r_{c}] \times w) \cdot \left( \frac{\|h\|_2}{\|\mathrm{SAE}_D[r_{c}]\|_2} \right)

where hh is the latent activation, SAED[rc]\mathrm{SAE}_D[r_c] are the SAE decoder weights for rcr_c, ww is the weighting vector (multi-feature), and α\alpha is a steering factor. This operation biases the model toward desired behaviors as dictated by compositional or safety rules.

5. Multi-task Evaluation and Generalization

ActivationReasoning has been evaluated across benchmarks requiring complex, multi-hop, or context-sensitive reasoning:

Table: AR Performance Gains (selected benchmarks)

Task Baseline Accuracy AR Accuracy Backbone Models
PrOntoQA (multi-hop) 50% (chance) 93–95% Llama3.1-8B, Gemma2-9B
Rail2Country (mono) 41–35% 74.7–93.7% Various
Rail2Country (meta) 29–26% 62.7–86.0% Various
ProverQA (hard tier) <50% ~70% Instruction-tuned LLMs

AR outperforms much larger and otherwise instruction-tuned models when reasoning complexity increases, scales robustly, and shows generalization to abstract cues (e.g., indirect color references).

6. Implications for Model Transparency and Control

By mapping from high-dimensional neural activations to explicit logical propositions, AR provides auditability and interpretability of internal model reasoning. It supports explanation tracing—failure cases can be linked to concrete concept activations—and enables direct intervention via steering, enhancing model controllability and alignment with safety policies.

Notably, AR's mechanism allows the derivation and imposition of user-preferred abstractions or ethical constraints in generation tasks that require alignment with regulatory, contextual, or compositional requirements.

7. Limitations and Prospective Directions

AR relies on the quality of dictionary construction and appropriateness of SAE feature extraction for meaningful concept alignment. Current logical rules are user-defined; automated induction or probabilistic reasoning integration is identified as a promising future direction. The framework could be extended by leveraging alternative representation learning, integrating external knowledge sources, and scaling latent reasoning to open-ended real-world contexts.

A plausible implication is that ActivationReasoning, by structurally embedding logic in latent spaces, may serve as a model for the integration of symbolic reasoning and continuous neural systems, contributing to reliable, auditable reasoning in future LLM architectures.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to ActivationReasoning (AR).