Logistic Regression Probe

Updated 2 April 2026

The paper introduces logistic regression probes as linear mappings that predict target variables from neural activations using a logistic sigmoid formulation.
It measures the linear separability of features, enabling detailed analysis of where interpretable information emerges in network layers.
While probes efficiently assess feature encoding, their correlational nature necessitates supplementary causal methods to validate true model influence.

A logistic regression probe, in the context of mechanistic interpretability, is a linear diagnostic mapping trained to predict a target variable (often human-interpretable or task-relevant) from hidden activations or representations within a neural network. Logistic regression probes are deployed as a systematic, quantitative tool to test for the encoding of specific information within learned representations, serving as a key observational technique in both feature-level and causal-variable localization studies.

1. Definition and Mathematical Formulation

A logistic regression probe is a function $p_\theta$ that takes as input the activation vector $h \in \mathbb{R}^d$ from a network layer and outputs a prediction $\hat{y} \in [0,1]$ , typically for classification tasks. Formally, for a binary target variable $y$ ,

$\hat{y} = p_\theta(h) = \sigma(w^\top h + b)$

where $\sigma$ is the logistic sigmoid and $(w, b)$ are the learned probe parameters. The probe is trained to minimize cross-entropy loss,

$\mathcal{L}_\mathrm{probe}(w,b) = -\mathbb{E}_{(h,y)}[y\log \hat{y} + (1-y)\log(1-\hat{y})].$

For multi-class classification, the formulation generalizes to the softmax regression case.

2. Role in Mechanistic Interpretability

Logistic regression probes are primarily used to assess the presence and linear separability of interpretable features in network activations. By training probes on hidden states from a fixed, pre-trained model, researchers can systematically evaluate which layers, components, or extracted feature directions encode information about variables such as token identity, syntactic category, or task-specific roles (Rai et al., 2024). Probes provide a quantitative measure—probe accuracy—that indicates how well a feature is “readable” by a linear classifier.

In the context of causal variable localization, as formalized in MIB (Mueller et al., 17 Apr 2025), the probe can serve as an attribution tool: mapping high-level causal variables onto low-level model activations, typically by identifying subspaces or feature directions that allow for accurate classification or prediction.

3. Methodological Context and Limitations

Logistic regression probes operate in an observational, correlational regime: they measure the statistical relationship between model activations and target variables but do not establish causal influence. High probe accuracy suggests linearly encoded information but does not imply that the probed information is used downstream by the model. This limitation motivates their supplementation with causal and interventional methods, such as activation patching, knockout/ablation, or interchange interventions (Mueller et al., 17 Apr 2025, Bereska et al., 2024).

To avoid over-attribution, probe capacity is sometimes regularized (e.g., L2 norm penalty) or compared to random baselines. Probe performance can vary by layer and model component, providing insights into the layerwise emergence and localization of semantic or algorithmic features (Rai et al., 2024).

4. Use Cases and Experimental Protocols

Logistic regression probes are widely deployed in both feature-level and circuit-level interpretability studies:

Feature localization: Probes are trained on activations to predict linguistic or semantic classes (e.g., part-of-speech tags, object types), with probe accuracy tracked as a function of layer depth. This reveals where in the architecture specific information becomes linearly accessible (Rai et al., 2024).
Comparative benchmark evaluation: MIB (Mueller et al., 17 Apr 2025) uses logistic regression probes as part of causal variable localization pipelines, and as a baseline to compare with more sophisticated feature extraction approaches such as sparse autoencoders or principal component analysis.
Evaluation of learned representations: In studies of superposition and polysemanticity, probes help quantify the degree to which a "feature" (in the representational sense) is recoverable from linear projections or sparse codes (Rai et al., 2024).
Control in experimental methodology: Probes serve as negative controls in ablation and patching experiments, to distinguish between genuinely causal features and features that are merely linearly present in hidden states.

5. Probes Versus Alternative Methods

While logistic regression probes are computationally efficient and provide interpretable metrics, they are limited to detecting linearly encoded (monosemantic) features and do not capture nonlinear or distributed representations. More expressive nonlinear probes (e.g., MLPs) can overfit small datasets and confound the origin of encoded information. Causal-intervention-based attribution techniques (such as activation patching, edge attribution patching, or distributed alignment search) overcome the correlational ambiguity, revealing which features are truly manipulated and used by the model (Mueller et al., 17 Apr 2025).

A summary table contrasts logistic regression probes with several alternative causal variable localization techniques, as assessed in the MIB benchmark (Mueller et al., 17 Apr 2025):

Method	Linear/Nonlinear	Causality Assessed	Typical Use
Logistic Regression	Linear	No	Feature localization
Principal Components	Linear	No	Dimensionality reduction, baseline probe
Sparse Autoencoder	Nonlinear, sparse	No	Feature disentanglement, unsupervised discovery
DAS (Distributed Alignment Search)	Linear, supervised	Yes	Causal variable localization

Only methods in the last category (e.g., DAS) yield high Interchange Intervention Accuracy (IIA), the causal metric formalized in (Mueller et al., 17 Apr 2025). Logistic probes, while an essential baseline, are not sufficient for establishing causal variable presence.

6. Impact and Recommendations

Logistic regression probes enable researchers to operationalize the question of “what is present” in activations. According to recent benchmarking studies, they are effective for rapid, high-throughput assessment of feature encoding, but should be complemented with causal-intervention methods for robust variable identification (Mueller et al., 17 Apr 2025, Rai et al., 2024). Ongoing research seeks to refine probe methodologies (e.g., subspace alignment, probe capacity control, integration with causal patching) and clarify the interpretability of probe-derived insights in the context of distributed and polysemantic representations.

7. References

"MIB: A Mechanistic Interpretability Benchmark" (Mueller et al., 17 Apr 2025)
"A Practical Review of Mechanistic Interpretability for Transformer-Based LLMs" (Rai et al., 2024)
"Mechanistic Interpretability for AI Safety -- A Review" (Bereska et al., 2024)

These works establish the theoretical grounding, methodological applications, and benchmark-driven evaluation of logistic regression probes within the broader field of mechanistic interpretability.

Markdown Report Issue Upgrade to Chat

References (3)

A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models (2024)

MIB: A Mechanistic Interpretability Benchmark (2025)

Mechanistic Interpretability for AI Safety -- A Review (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Logistic Regression Probe.

Logistic Regression Probe

1. Definition and Mathematical Formulation

2. Role in Mechanistic Interpretability

3. Methodological Context and Limitations

4. Use Cases and Experimental Protocols

5. Probes Versus Alternative Methods

6. Impact and Recommendations

7. References

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Logistic Regression Probe

1. Definition and Mathematical Formulation

2. Role in Mechanistic Interpretability

3. Methodological Context and Limitations

4. Use Cases and Experimental Protocols

5. Probes Versus Alternative Methods

6. Impact and Recommendations

7. References

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research