Logistic Regression Probe
- The paper introduces logistic regression probes as linear mappings that predict target variables from neural activations using a logistic sigmoid formulation.
- It measures the linear separability of features, enabling detailed analysis of where interpretable information emerges in network layers.
- While probes efficiently assess feature encoding, their correlational nature necessitates supplementary causal methods to validate true model influence.
A logistic regression probe, in the context of mechanistic interpretability, is a linear diagnostic mapping trained to predict a target variable (often human-interpretable or task-relevant) from hidden activations or representations within a neural network. Logistic regression probes are deployed as a systematic, quantitative tool to test for the encoding of specific information within learned representations, serving as a key observational technique in both feature-level and causal-variable localization studies.
1. Definition and Mathematical Formulation
A logistic regression probe is a function that takes as input the activation vector from a network layer and outputs a prediction , typically for classification tasks. Formally, for a binary target variable ,
where is the logistic sigmoid and are the learned probe parameters. The probe is trained to minimize cross-entropy loss,
For multi-class classification, the formulation generalizes to the softmax regression case.
2. Role in Mechanistic Interpretability
Logistic regression probes are primarily used to assess the presence and linear separability of interpretable features in network activations. By training probes on hidden states from a fixed, pre-trained model, researchers can systematically evaluate which layers, components, or extracted feature directions encode information about variables such as token identity, syntactic category, or task-specific roles (Rai et al., 2024). Probes provide a quantitative measure—probe accuracy—that indicates how well a feature is “readable” by a linear classifier.
In the context of causal variable localization, as formalized in MIB (Mueller et al., 17 Apr 2025), the probe can serve as an attribution tool: mapping high-level causal variables onto low-level model activations, typically by identifying subspaces or feature directions that allow for accurate classification or prediction.
3. Methodological Context and Limitations
Logistic regression probes operate in an observational, correlational regime: they measure the statistical relationship between model activations and target variables but do not establish causal influence. High probe accuracy suggests linearly encoded information but does not imply that the probed information is used downstream by the model. This limitation motivates their supplementation with causal and interventional methods, such as activation patching, knockout/ablation, or interchange interventions (Mueller et al., 17 Apr 2025, Bereska et al., 2024).
To avoid over-attribution, probe capacity is sometimes regularized (e.g., L2 norm penalty) or compared to random baselines. Probe performance can vary by layer and model component, providing insights into the layerwise emergence and localization of semantic or algorithmic features (Rai et al., 2024).
4. Use Cases and Experimental Protocols
Logistic regression probes are widely deployed in both feature-level and circuit-level interpretability studies:
- Feature localization: Probes are trained on activations to predict linguistic or semantic classes (e.g., part-of-speech tags, object types), with probe accuracy tracked as a function of layer depth. This reveals where in the architecture specific information becomes linearly accessible (Rai et al., 2024).
- Comparative benchmark evaluation: MIB (Mueller et al., 17 Apr 2025) uses logistic regression probes as part of causal variable localization pipelines, and as a baseline to compare with more sophisticated feature extraction approaches such as sparse autoencoders or principal component analysis.
- Evaluation of learned representations: In studies of superposition and polysemanticity, probes help quantify the degree to which a "feature" (in the representational sense) is recoverable from linear projections or sparse codes (Rai et al., 2024).
- Control in experimental methodology: Probes serve as negative controls in ablation and patching experiments, to distinguish between genuinely causal features and features that are merely linearly present in hidden states.
5. Probes Versus Alternative Methods
While logistic regression probes are computationally efficient and provide interpretable metrics, they are limited to detecting linearly encoded (monosemantic) features and do not capture nonlinear or distributed representations. More expressive nonlinear probes (e.g., MLPs) can overfit small datasets and confound the origin of encoded information. Causal-intervention-based attribution techniques (such as activation patching, edge attribution patching, or distributed alignment search) overcome the correlational ambiguity, revealing which features are truly manipulated and used by the model (Mueller et al., 17 Apr 2025).
A summary table contrasts logistic regression probes with several alternative causal variable localization techniques, as assessed in the MIB benchmark (Mueller et al., 17 Apr 2025):
| Method | Linear/Nonlinear | Causality Assessed | Typical Use |
|---|---|---|---|
| Logistic Regression | Linear | No | Feature localization |
| Principal Components | Linear | No | Dimensionality reduction, baseline probe |
| Sparse Autoencoder | Nonlinear, sparse | No | Feature disentanglement, unsupervised discovery |
| DAS (Distributed Alignment Search) | Linear, supervised | Yes | Causal variable localization |
Only methods in the last category (e.g., DAS) yield high Interchange Intervention Accuracy (IIA), the causal metric formalized in (Mueller et al., 17 Apr 2025). Logistic probes, while an essential baseline, are not sufficient for establishing causal variable presence.
6. Impact and Recommendations
Logistic regression probes enable researchers to operationalize the question of “what is present” in activations. According to recent benchmarking studies, they are effective for rapid, high-throughput assessment of feature encoding, but should be complemented with causal-intervention methods for robust variable identification (Mueller et al., 17 Apr 2025, Rai et al., 2024). Ongoing research seeks to refine probe methodologies (e.g., subspace alignment, probe capacity control, integration with causal patching) and clarify the interpretability of probe-derived insights in the context of distributed and polysemantic representations.
7. References
- "MIB: A Mechanistic Interpretability Benchmark" (Mueller et al., 17 Apr 2025)
- "A Practical Review of Mechanistic Interpretability for Transformer-Based LLMs" (Rai et al., 2024)
- "Mechanistic Interpretability for AI Safety -- A Review" (Bereska et al., 2024)
These works establish the theoretical grounding, methodological applications, and benchmark-driven evaluation of logistic regression probes within the broader field of mechanistic interpretability.