Papers
Topics
Authors
Recent
Search
2000 character limit reached

Linear Probe in Deep Learning

Updated 4 July 2026
  • Linear Probe is a simple linear classifier or regressor trained on fixed representations to gauge what information is linearly accessible and to serve as a diagnostic tool.
  • It is used across various domains—including language, vision, and reward modeling—to calibrate predictions, analyze syntactic geometry, and delineate model behavior.
  • Recent research extends linear probes to post-hoc calibration, weight-space learning, and control signals, highlighting both their efficiency and inherent interpretive limitations.

A linear probe is a simple linear classifier or regressor trained on top of frozen representations, hidden activations, or other fixed outputs in order to test what information is linearly decodable, calibrate predictions, or perform cheap adaptation. In the papers considered here, the probed object ranges from contextual token states in BERT, to residual-stream activations in LLMs, to frozen visual features from pre-trained encoders, and even to class-probability vectors produced by in-context learning. A common theme is that the base model remains fixed while the probe supplies a low-capacity readout, although some recent work relocates the linearity to probe generation itself rather than to the final readout (Pal et al., 2024, Abbas et al., 2024, Peng et al., 8 May 2026, Kahana et al., 2024).

1. Canonical definition and formal variants

In deep learning, a linear probe usually means training a simple linear classifier or regressor on top of frozen representations. In the most familiar classification form, the probed representation is a hidden vector hh, and prediction is written as

y^=softmax(Wh+b).\hat{y} = \text{softmax}(W h + b).

This basic template persists across many settings, but the representation being probed can differ substantially: hidden states, residual activations, frozen encoder features, or probability vectors (Abbas et al., 2024).

Variant Probed object Core linear form
Classical classifier Hidden activation hh y^=softmax(Wh+b)\hat{y}=\text{softmax}(Wh+b)
Structural probe Token states hil,hjlh_i^l,h_j^l dB(hil,hjl)=∥Bhil−Bhjl∥2d_B(h_i^l,h_j^l)=\|Bh_i^l-Bh_j^l\|_2
Calibration probe Class-probability vector p\mathbf p p~=softmax(Ap+b)\tilde{\mathbf p}=\text{softmax}(\mathbf A\mathbf p+\mathbf b)
Frozen-feature linear probe Feature matrix XX W∗=(X⊤X+λI)−1X⊤YW^*=(X^\top X+\lambda I)^{-1}X^\top Y
Residual-stream logistic probe Residual activation y^=softmax(Wh+b).\hat{y} = \text{softmax}(W h + b).0 y^=softmax(Wh+b).\hat{y} = \text{softmax}(W h + b).1

What remains invariant across these cases is the use of a linear map on a fixed representation. What changes is the epistemic role of the probe. In some papers the probe is diagnostic, asking whether syntax, deception, or hallucination is linearly accessible; in others it is operational, serving as a post-hoc calibrator, a closed-form transfer head, or a control signal inside a reward function (Pal et al., 2024, Abbas et al., 2024, Peng et al., 8 May 2026, Nordby et al., 15 Apr 2026, O'Neill et al., 31 Jul 2025, Papadatos et al., 2024).

2. Structural probing and syntactic geometry

A particularly influential line of work uses linear probes to study whether syntactic structure is encoded geometrically in language-model states. In the structural-probe formulation adopted by "Hitting 'Probe'rty with Non-Linearity, and More," one learns a linear transformation y^=softmax(Wh+b).\hat{y} = \text{softmax}(W h + b).2 such that Euclidean distances in the transformed space approximate tree distances in a gold dependency tree:

y^=softmax(Wh+b).\hat{y} = \text{softmax}(W h + b).3

with training objective

y^=softmax(Wh+b).\hat{y} = \text{softmax}(W h + b).4

Trees are reconstructed from the resulting distances and evaluated with Unlabeled Undirected Attachment Score (UUAS), the percentage of undirected edges in the gold tree that appear in the predicted tree (Pal et al., 2024).

This linear construction is used as both a baseline and a conceptual reference for non-linear extensions. The paper emphasizes that the simplicity of linear probes is both their key advantage and a potential limitation: a single linear transformation may be too weak if syntax is stored in a more complex, non-linearly separable way. Even so, the empirical picture is not that linear probes fail. Rather, linear probes remain very competitive, and the gap to modest non-linearity is often small. On the Universal Dependencies English Web Treebank, the RBF probe slightly exceeds the linear probe at the final layer for both BERT and BERTy^=softmax(Wh+b).\hat{y} = \text{softmax}(W h + b).5: for BERT, 69.71 UUAS versus 66.65; for BERTy^=softmax(Wh+b).\hat{y} = \text{softmax}(W h + b).6, 60.77 versus 60.10 (Pal et al., 2024).

The same paper also argues that evaluation based only on UUAS can be misleading. Its edge-strength metric,

y^=softmax(Wh+b).\hat{y} = \text{softmax}(W h + b).7

is used to visualize whether a probe assigns high-confidence structure to correct or incorrect edges. This suggests that a linear probe can achieve high UUAS while still spreading mass over many spurious relations, especially in middle layers where syntactic signals are diffuse. A plausible implication is that linear-probe quality cannot be reduced to a single tree-reconstruction score; the geometry induced by the probe matters as much as the discrete decoded structure (Pal et al., 2024).

3. Post-hoc calibration and probability-space probes

Linear probes need not operate on hidden states. "Enhancing In-context Learning via Linear Probe Calibration" defines Linear Probe Calibration (LinC) as an affine map on the class-probability vector produced by a frozen GPT-like model under in-context learning:

y^=softmax(Wh+b).\hat{y} = \text{softmax}(W h + b).8

Here the representation being linearly probed is the vector of predicted class probabilities y^=softmax(Wh+b).\hat{y} = \text{softmax}(W h + b).9, not the model’s internal activations or logits. The parameters hh0 and hh1 are trained on a small validation set by minimizing cross-entropy over held-out prompts (Abbas et al., 2024).

This formulation is explicitly post-hoc: the base model hh2 remains frozen, and inference consists of running ordinary ICL to obtain hh3, followed by the learned linear remapping and softmax normalization. The paper positions LinC relative to temperature scaling and Platt scaling, but with a full matrix hh4 that allows both class-wise scaling and off-diagonal couplings between classes. Because the transformation is learned in the exact prompting regime used at test time, it targets prompt-template sensitivity, demonstration-order sensitivity, and label-imbalance sensitivity as calibration failures rather than as failures of representation learning per se (Abbas et al., 2024).

The empirical claims are unusually strong for such a small probe. LinC is reported to require only minimal additional samples, as few as five labeled data samples; performance often saturates quickly as the validation size grows. Across text-classification benchmarks and several GPT-like models, the paper reports an average improvement of up to 21%, and up to a 50% improvement in some cases, together with lower expected calibration error and markedly reduced variance across prompt templates, label proportions, and demonstration permutations. On GPT-J in the 0-shot setting averaged over seven datasets, NoC gives 43.1%, ConC 50.5%, and LinC 64.6%; on GPT-J with DBPedia in the 0-shot setting, NoC gives 19.7%, ConC 48.0%, and LinC 69.3% (Abbas et al., 2024).

Conceptually, LinC broadens the term linear probe. The linear map no longer asks what the hidden state contains; instead it corrects a miscalibrated output simplex. This suggests that linear probing is not confined to representational diagnosis. It can also be a compact, supervised correction layer over a fixed predictive system (Abbas et al., 2024).

4. Frozen-feature linear probes in vision and transfer

In vision, linear probing often means fitting a linear head on top of frozen features extracted by a pre-trained encoder. "Closed-Form Linear-Probe Dataset Distillation for Pre-trained Vision Models" makes this formulation explicit. Let hh5 be a fixed feature extractor, hh6 a feature matrix, hh7 one-hot labels, and hh8 the probe. The inner objective is ridge-regularized least squares:

hh9

with closed-form solution

y^=softmax(Wh+b)\hat{y}=\text{softmax}(Wh+b)0

Using the push-through identity, the paper rewrites this in sample-space kernel form,

y^=softmax(Wh+b)\hat{y}=\text{softmax}(Wh+b)1

where y^=softmax(Wh+b)\hat{y}=\text{softmax}(Wh+b)2. In this frozen-encoder regime, the relevant kernel is exactly the empirical Gram of frozen features; no neural tangent kernel or infinite-width approximation is required (Peng et al., 8 May 2026).

The same paper argues that the choice of outer objective is decisive. A standard MSE outer loss underperforms a temperature-scaled softmax cross-entropy in which the columns of y^=softmax(Wh+b)\hat{y}=\text{softmax}(Wh+b)3 act as class anchors in feature space. On ImageNet-100 with IPC=1, CLP-DD with MSE outer achieves 82.1% average accuracy, while the class-anchor objective reaches 85.0%; DSA-free LGM gives 68.6%. On ImageNet-1K with IPC=1, CLP-DD averages 68.3%, slightly above LGM with DSA at 67.8%, while running roughly y^=softmax(Wh+b)\hat{y}=\text{softmax}(Wh+b)4 faster and using less than one-eighth of the GPU memory (Peng et al., 8 May 2026).

The geometric interpretation is that linear probes on frozen features carve out class-specific anchors rather than merely fitting centroids. The paper reports that distilled embeddings lie near the margins of class clusters rather than at class means, which differentiates the probe-induced solution from centroid selection. This suggests that in strong pre-trained feature spaces, the utility of a linear probe depends less on summarizing class appearance than on locating discriminative directions that best partition the feature manifold (Peng et al., 8 May 2026).

A related debate appears in few-shot CLIP adaptation. The abstract of "LP++: A Surprisingly Strong Linear Probe for Few-Shot CLIP" states that Linear Probe has been often reported as a weak baseline, motivating prompt learning and feature adaptation methods. LP++ generalizes the standard LP baseline by making linear classifier weights learnable functions of the text embedding, with class-wise multipliers blending image and text knowledge, and reports highly competitive few-shot CLIP performances together with black-box operation and orders-of-magnitudes faster runtime than state-of-the-art few-shot CLIP adaptation methods (Huang et al., 2024). A plausible implication is that some apparent weaknesses of linear probes are optimization- and formulation-dependent rather than intrinsic to linear readouts themselves.

5. Residual-stream probes for deception, hallucination, and sycophancy

In recent LLM interpretability and safety work, linear probes are trained directly on residual-stream activations. "Linear Probe Accuracy Scales with Model Size and Benefits from Multi-Layer Ensembling" defines a layerwise deception detector as an L2-regularized logistic regression classifier on standardized residual activations, with no intercept and y^=softmax(Wh+b)\hat{y}=\text{softmax}(Wh+b)5, corresponding to regularization coefficient y^=softmax(Wh+b)\hat{y}=\text{softmax}(Wh+b)6. For layer y^=softmax(Wh+b)\hat{y}=\text{softmax}(Wh+b)7, the score is

y^=softmax(Wh+b)\hat{y}=\text{softmax}(Wh+b)8

and performance is evaluated by AUROC. The paper finds that best-layer probe accuracy improves with model size by roughly 5% AUROC per y^=softmax(Wh+b)\hat{y}=\text{softmax}(Wh+b)9 parameters, with hil,hjlh_i^l,h_j^l0, but also that single-layer probes are brittle: best layers vary widely across tasks and models, and some deception types fail at essentially chance under a single-layer probe. Multi-layer ensembles recover strong performance, improving AUROC by +29.3% on Insider Trading and +78.4% on Harm-Pressure Knowledge relative to the best single-layer baseline used in that comparison (Nordby et al., 15 Apr 2026).

"A Single Direction of Truth" applies a closely related idea to contextual hallucinations. It trains logistic regression probes on the post-layer-norm residual stream at a single token, the final token of the last sentence of a continuation, with score

hil,hjlh_i^l,h_j^l1

Across Gemma-2 observers and several datasets, the paper reports F1 up to 0.99 on CNN/DM, 0.97 on XSum, and 0.75 on ContraTales, outperforming lexical overlap, entity verification, semantic similarity, and Lookback Lens baselines by 5–27 points. It further localizes the signal with gradient-times-activation, finding a sparse, consistent MLP attribution pattern concentrated in layers 7, 8, and 9, and shows causal steering by adding hil,hjlh_i^l,h_j^l2 to the residual stream at generation start. At hil,hjlh_i^l,h_j^l3, hallucination rate is approximately 0.86 and repetition rate is below 0.05; at hil,hjlh_i^l,h_j^l4, repetition rate is approximately 0.84 and hallucination rate approximately 0.35 (O'Neill et al., 31 Jul 2025).

"Linear Probe Penalties Reduce LLM Sycophancy" moves from detection to reward shaping. It trains a single fully connected layer with sigmoid on reward-model activations to output a sycophancy score,

hil,hjlh_i^l,h_j^l5

using binary cross-entropy. At inference the raw logit hil,hjlh_i^l,h_j^l6 is used in a surrogate reward

hil,hjlh_i^l,h_j^l7

For UltraRM, the paper reports that layers 12–25 give greater than 90% accuracy, chooses layer 16 for the final method, and reports test accuracy of approximately 94%, together with sycophancy-score differences of about 2.9 on POLI and about 3.2 on a separate feedback dataset. Under best-of-hil,hjlh_i^l,h_j^l8 sampling, optimizing the original reward increases the positivity gap associated with feedback sycophancy, while the probe-penalized surrogate substantially reduces it (Papadatos et al., 2024).

Taken together, these results show that a linear probe on residual or reward-model activations can function as a detector, a scaling-law diagnostic, an attribution handle, and a control signal. This suggests that, at least for the studied behaviors, a substantial part of the relevant information is linearly accessible in activation space (Nordby et al., 15 Apr 2026, O'Neill et al., 31 Jul 2025, Papadatos et al., 2024).

6. Probe-generated inputs and weight-space learning

Not all work uses linear probes as readouts on hidden states. "Deep Linear Probe Generators for Weight Space Learning" studies probing as a way to represent a model by its outputs on a learned set of inputs:

hil,hjlh_i^l,h_j^l9

A vanilla probing baseline learns the probe inputs dB(hil,hjl)=∥Bhil−Bhjl∥2d_B(h_i^l,h_j^l)=\|Bh_i^l-Bh_j^l\|_20 directly and then predicts properties of the model, such as dataset identity or generalization error, from the resulting outputs. The paper reports that this baseline is already surprisingly strong, but also that current probe learning strategies are ineffective: random synthetic probes are often comparable to or slightly better than independently learned probes, which is interpreted as evidence of overfitting (Kahana et al., 2024).

The proposed modification is ProbeGen, which factorizes each probe as

dB(hil,hjl)=∥Bhil−Bhjl∥2d_B(h_i^l,h_j^l)=\|Bh_i^l-Bh_j^l\|_21

where dB(hil,hjl)=∥Bhil−Bhjl∥2d_B(h_i^l,h_j^l)=\|Bh_i^l-Bh_j^l\|_22 is a learned latent code and dB(hil,hjl)=∥Bhil−Bhjl∥2d_B(h_i^l,h_j^l)=\|Bh_i^l-Bh_j^l\|_23 is a shared deep linear generator. For images, the generator uses transposed convolution layers without nonlinear activations; for INR coordinates, it uses fully connected linear layers. The stated goal is to impose an inductive bias toward structured probes, reduce overfitting, and preserve efficiency (Kahana et al., 2024).

Empirically, ProbeGen is reported to outperform state-of-the-art weight-space baselines while remaining much cheaper. On small-scale benchmarks, ProbeGen with 128 probes reaches 0.984 accuracy on MNIST INRs, 0.877 on FMNIST INRs, 0.957 Kendall’s dB(hil,hjl)=∥Bhil−Bhjl∥2d_B(h_i^l,h_j^l)=\|Bh_i^l-Bh_j^l\|_24 on CIFAR10-GS, and 0.932 on CIFAR10 Wild Park. The paper also reports that ProbeGen requires between 30 and 1000 times fewer FLOPs than other top approaches, with an explicit comparison of 63.40 versus 0.02 billion FLOPs on MNIST INRs and 94.56 versus 3.41 billion FLOPs on CIFAR10-GS (Kahana et al., 2024).

This is an important boundary case for the term linear probe. The model is still frozen, and the method still interrogates what can be extracted from a fixed system by a restricted linear mechanism, but the linearity now lies in probe generation rather than in the output head. A plausible implication is that "linear probe" now denotes a family of low-capacity interrogators, not only a single architectural template (Kahana et al., 2024).

7. Interpretive limits, recurring misconceptions, and open questions

Several papers emphasize that the simplicity of linear probes is both their key advantage and a potential limitation. In structural probing, a single linear map may underfit syntactic information that is present but not linearly accessible; modest non-linearity, especially the RBF probe, can reveal small but consistent gains and cleaner qualitative structure (Pal et al., 2024). In deception detection, a single layer can be catastrophically unreliable for some tasks even when nearby tasks are easy, and multi-layer ensembling is needed because the relevant direction rotates gradually across depth rather than appearing at one privileged location (Nordby et al., 15 Apr 2026). In hallucination detection and sycophancy control, linear probes are powerful but depend on the observer model, the reward model, and the specific data distribution used for training (O'Neill et al., 31 Jul 2025, Papadatos et al., 2024).

Another recurring issue is probe power. Probing papers repeatedly note the classical concern that a sufficiently expressive probe may learn the task rather than reveal the structure already present in the representation. One response is to keep the probe weakly non-linear or strictly linear, as in the structural-probe paper; another is to preserve linearity but alter optimization or parameterization, as in closed-form ridge probing and ProbeGen (Pal et al., 2024, Peng et al., 8 May 2026, Kahana et al., 2024). This suggests that the question is not simply linear versus non-linear, but which constraints preserve interpretability while retaining enough expressivity to expose the relevant signal.

A further misconception is that weak empirical performance of a particular LP baseline always implies that linear probing is intrinsically inadequate. The few-shot CLIP discussion around LP++ directly contests this, arguing that Linear Probe has often been reported as a weak baseline partly because optimization and formulation choices were suboptimal (Huang et al., 2024). Likewise, CLP-DD shows that in frozen-feature regimes the linear probe can often be solved exactly in closed form, eliminating step-size, initialization, and inner-loop hyperparameters as confounders (Peng et al., 8 May 2026).

More broadly, the papers point to an unresolved interpretive distinction between information presence and ease of extraction. ProbeGen explicitly notes an active debate about whether linear probes measure information presence or ease of extraction (Kahana et al., 2024). The results surveyed here do not settle that debate, but they do narrow it. When a linear probe supports closed-form solutions, scales with model size, transfers across datasets, localizes to specific sub-circuits, or steers behavior causally, the probe is doing more than fitting a convenient decoder. At minimum, it is identifying a stable linear direction or linear decision surface that the underlying system already makes available (Peng et al., 8 May 2026, Nordby et al., 15 Apr 2026, O'Neill et al., 31 Jul 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Linear Probe.