Papers
Topics
Authors
Recent
Search
2000 character limit reached

Masked by Consensus: Disentangling Privileged Knowledge in LLM Correctness

Published 14 Apr 2026 in cs.CL | (2604.12373v1)

Abstract: Humans use introspection to evaluate their understanding through private internal states inaccessible to external observers. We investigate whether LLMs possess similar privileged knowledge about answer correctness, information unavailable through external observation. We train correctness classifiers on question representations from both a model's own hidden states and external models, testing whether self-representations provide a performance advantage. On standard evaluation, we find no advantage: self-probes perform comparably to peer-model probes. We hypothesize this is due to high inter-model agreement of answer correctness. To isolate genuine privileged knowledge, we evaluate on disagreement subsets, where models produce conflicting predictions. Here, we discover domain-specific privileged knowledge: self-representations consistently outperform peer representations in factual knowledge tasks, but show no advantage in math reasoning. We further localize this domain asymmetry across model layers, finding that the factual advantage emerges progressively from early-to-mid layers onward, consistent with model-specific memory retrieval, while math reasoning shows no consistent advantage at any depth.

Summary

  • The paper demonstrates that isolating disagreement cases reveals a significant premium gap in factual recall, exposing latent correctness signals in LLM representations.
  • The paper applies both self- and external-probe classifiers across diverse datasets with rigorous stratified evaluation to uncover internal signal dynamics.
  • The paper finds that privileged knowledge emerges in deeper layers during factual tasks, suggesting actionable paths for improved hallucination detection.

Disentangling Privileged Knowledge in LLM Correctness: A Domain-Specific Perspective

Motivation and Theoretical Framing

The paper "Masked by Consensus: Disentangling Privileged Knowledge in LLM Correctness" (2604.12373) interrogates whether LLMs possess epistemic privilege concerning their answer correctness—signals about their own accuracy embedded in internal representations and unavailable to external observers. Drawing from philosophical constructs of self-knowledge and introspection, the authors challenge recent claims that LLMs lack unique correctness indicators by demonstrating that standard cross-model evaluation is confounded by high inter-model agreement. This masks genuine privileged knowledge and motivates a refined methodology leveraging disagreement subsets, where models diverge in their answer correctness.

Methodological Design

The authors construct a probing framework centered on two classifier configurations: self-probes, trained on the target model's own internal states, and external-probes, trained on peer models' representations. Critically, both probe types are tasked with predicting the correctness of the target model's responses. The primary experimental manipulation involves sampling disagreement subsets—instances where the target and source models' correctness labels differ—to eliminate external proxy signals. The evaluation leverages five datasets spanning factual recall (Mintaka, TriviaQA, HotPotQA) and mathematical reasoning (MATH, GSM1K), along with three comparable decoder LMs (Qwen-2.5-7B, Llama-3.1-8B, Gemma-2-9B) and an embedding baseline. Probes are trained using both linear and MLP architectures, with stratified cross-validation and AUC as the central metric.

Empirical Findings

Absence of Premium Gap Under Random Sampling

Initial results demonstrate that, on random samples, external probes—especially those using robust public feature encodings from strong peer models—match or surpass self-probe performance in both factual and mathematical domains. This finding suggests that correctness prediction is largely governed by question features accessible to any model, consistent with prior claims negating privileged correctness signals.

Identification of Agreement Confound

The paper reveals that high inter-model agreement (approx. 80% for factual, 75% for mathematical datasets) facilitates external probes exploiting their own correctness patterns as proxies for the target model's behavior. Consequently, any genuine privileged signal in the target's representations is obfuscated during standard evaluation, necessitating isolating disagreement cases.

Domain-Specific Emergence of Privileged Knowledge

When evaluating on disagreement subsets, a statistically significant premium gap emerges in factual tasks (up to ~8.9% AUC difference), confirming that models retain unique internal signals predictive of their own correctness. This privileged layer is strictly domain-specific—conspicuous in factual recall, but absent in mathematical reasoning, where external probes remain comparably effective even with increased expressivity (MLP probes).

Layer-Wise Localization

The privileged signal in factual tasks gradually intensifies from early-to-mid layers onward, with the gap increasing toward deeper layers. This progression is congruent with subject-specific memory retrieval mechanisms consolidated through successive forward passes. In contrast, mathematical reasoning displays no consistent premium gap across any network depth, implying that correctness is governed by structural input features rather than idiosyncratic retrieval or confidence signals.

Concept-Level Drivers and Lexical Analysis

Control experiments stripping questions to named entities and nouns ('Lexical-Only' inputs) recover a substantial proportion of probe performance in factual and MATH datasets, substantiating the role of concept-level familiarity and topic indicators. For GSM1K, correctness prediction drops to chance, indicating reliance on problem structure rather than surface tokens.

Implications for AI Research and Practice

The findings delineate boundaries of LLM introspection and correct answer detection, reconciling prior conflicting literature by showing that privileged correctness signals are not universal but domain-specific and masked by consensus when models agree. Practically, internal activation monitoring for factual tasks may offer superior hallucination detection and output reliability, with the methodology extensible to hybrid domains (e.g., coding, commonsense reasoning). The observed emergence of privileged knowledge in early-to-mid layers opens avenues for targeted intervention (activation steering). The causal origins and mechanisms underlying privileged correctness remain open for further investigation, suggesting experimental manipulations on identified correctness directions in the residual stream.

Limitations and Directions for Future Research

The study is limited by model scale primarily to 7B–9B LMs, with secondary evaluation on Qwen-3-32B. Results may differ with larger models or unsupervised settings. The probe-based analysis is correlational, not establishing causality. Furthermore, linear and MLP probe architectures may not exhaustively extract privileged signals. The scope is restricted to factual recall and mathematical reasoning; extension to hybrid domains is required to generalize the domain-specificity finding.

Conclusion

The paper offers a rigorous methodological correction to privilege signal detection in LLMs, demonstrating that genuine self-knowledge about answer correctness is present in factual domains but absent in mathematical reasoning, and is masked by consensus when models agree. By isolating disagreement cases and conducting layer-wise analysis, the authors provide robust evidence that privileged correctness emerges progressively in deeper model representations for factual tasks, aligning correctness with idiosyncratic memory retrieval. These results have substantial implications for introspection research, hallucination detection, and the structure of LLM internal knowledge, setting the agenda for future mechanistic and causal interrogation of privileged signals in LLMs.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Explain it Like I'm 14

Overview: What this paper is about

This paper asks a simple but deep question: do LLMs have a kind of “gut feeling” about whether their own answers will be right or wrong that other models can’t see? In other words, besides what we can observe from the outside, do models carry private, inside information that helps them judge their own correctness?

The key questions in plain terms

The researchers focused on two kid-friendly questions:

  • Do models have special, private cues in their “thinking process” that tell them when they’re right?
  • If such private cues exist, when do they show up (for facts, for math, in early or later stages of the model’s computations)?

How they studied it (with simple analogies)

Think of each model like a student answering quiz questions. While a student thinks, their brain has activity patterns. Models have something similar called “hidden states”—internal signals that show what’s going on while they read a question.

The team built small “detectors” (classifiers, also called probes) that try to predict whether the model’s answer will be correct. They trained two kinds of detectors:

  • Self-probe: reads the target model’s own hidden states (like putting a stethoscope on the same student’s brain).
  • External-probe: reads a different model’s hidden states to predict the target model’s correctness (like asking a classmate’s brain activity to guess whether the first student will be right).

If a self-probe beats any external-probe, that suggests the target model has private, hard-to-copy signals about its own correctness.

Two more details, in everyday language:

  • Disagreement subset: If two students usually miss or get the same questions right, a classmate’s brain patterns can look like a good predictor—just because both students find the same questions easy or hard. To avoid that “copying,” the researchers focused on the questions where the two models disagreed (one was right, the other wrong). This is like only checking the questions where students’ answers differ, to see who really has unique knowledge.
  • Layers: Models process text in steps called layers (think of them as stages in reasoning). The researchers checked early, middle, and late stages to see where any private signals show up.

Technical bits translated:

  • “Linear probe” and “MLP probe”: simple and slightly less simple detectors that read internal signals and try to predict correct vs. incorrect.
  • “AUC” (Area Under the Curve): a score from 0.5 (random guessing) to 1.0 (perfect) that says how well a detector separates right answers from wrong ones without needing to pick a single threshold.

What they found and why it matters

Here are the key results:

  • On normal test sets, self-probes were not better than external-probes. At first glance, that looks like there’s no private “gut feeling.”
  • But models often agree—about 75–80% of the time—on which questions are easy or hard. That shared pattern lets an external-probe look good without seeing the target model’s private signals. This is called the “agreement confound.”
  • On the disagreement questions (where one model is right and the other is wrong), a clear pattern appears:
    • Factual questions (like trivia) show a real self-probe advantage (about a 5% boost). That’s evidence of genuine private knowledge: the model’s own internal signals help predict whether it will recall a fact correctly.
    • Math questions show no self-probe advantage. External-probes do just as well as self-probes, even when models disagree. That suggests math success depends more on public, visible features of the problem rather than private, model-specific signals.
  • Where in the model do these private signals show up?
    • For factual questions, the self-probe advantage starts in early-to-middle layers and grows in later layers—consistent with how a model retrieves and consolidates facts as it processes the question.
    • For math, there’s no consistent advantage at any layer.

Why this matters:

  • It shows that “knowing you’re right” is domain-specific. Models seem to have private, internal hints about factual recall (like feeling whether a memory is accessible), but not about math reasoning.
  • It explains why earlier studies sometimes didn’t find private signals: the strong agreement across models masked them.

What this could mean going forward

  • Better fact-checking and hallucination detection: Tools that peek into a model’s internal signals may spot when it’s likely to make up facts—even if the final answer looks confident.
  • Smarter evaluation methods: Testing only on “where models disagree” gives a clearer view of true, model-specific strengths and blind spots.
  • Deeper research on “where” private signals live: The fact that factual signals grow in mid-to-late layers suggests new ways to steer or improve memory retrieval inside models.

A short note on limits

  • The study focused on medium-sized models and two domains (facts and math); results might differ for bigger models or other areas like coding or commonsense reasoning.
  • Probes read signals but don’t prove cause-and-effect. Future work could try “nudging” those internal signals to see if that actually changes correctness.

In one sentence

Models do seem to carry a private “gut feeling” about factual correctness hidden in their internal signals—especially in mid-to-late processing steps—but don’t show the same private edge for math, where success appears more publicly visible from the problem itself.

Knowledge Gaps

Knowledge gaps, limitations, and open questions

The paper leaves the following issues unresolved and open for future investigation:

  • Causal mechanism: The work is correlational; it does not identify which circuits, heads, or residual directions causally implement the “privileged” correctness signal, nor test whether intervening on them modulates outputs.
  • Probe expressivity: Only linear and small MLP probes are tested; it remains unknown whether more expressive or contrastive probes would change the presence or size of the premium gap, especially in math.
  • Token/time localization: Probes use the question’s final-token representation every 5th layer; it is unclear which tokens (e.g., subject tokens, BOS, delimiter tokens) or time steps carry the strongest signal, and whether answer- or intermediate-generation states encode stronger privileged knowledge.
  • Logits vs hidden states: The study does not compare hidden-state probes against strong black-box baselines (e.g., logit margins, entropy, self-reported confidence, sequence-level likelihoods), leaving the incremental value of internals underexplored.
  • Agreement-confound control: Beyond pairwise disagreement, there is no use of item-response theory or triad/ensemble disagreement to factor out shared difficulty more systematically; how conclusions change under such controls is unknown.
  • Domain coverage: Results are limited to factual QA and math; it is unknown whether privileged knowledge appears in coding, commonsense reasoning, multimodal tasks, scientific QA, or tool-augmented settings.
  • Retrieval-augmented scenarios: The study uses question-only inputs; whether privileged knowledge persists or changes with external documents, retrieval-augmented generation, or tool use is not tested.
  • Prompting/generation regime: Effects of chain-of-thought prompting, self-consistency, temperature, decoding strategies, and instruction style on privileged knowledge (especially for math) are not evaluated.
  • Base vs instruction-tuned models: Only instruction-tuned LMs are analyzed; whether base models exhibit different privileged-knowledge patterns is unknown.
  • Scale and architecture: Main analysis focuses on 7B–9B models (plus one 32B); the scaling behavior and architecture dependence (e.g., MoE, state-space, multimodal encoders) of privileged knowledge remain open.
  • Training-data overlap: The role of pretraining/instruction-tuning data overlap across models in creating high agreement (and masking) is not controlled, leaving uncertainty about whether the factual premium depends on shared corpora.
  • Cross-lingual and paraphrase robustness: It is unclear if the privileged signal is tied to lexical surface forms; robustness to paraphrases, multilingual variants, and adversarial perturbations is untested.
  • Temporal drift/novelty: Whether privileged knowledge holds for temporally novel facts (post-cutoff data) versus memorized facts is not examined.
  • Label quality and scoring: Correctness labeling details (e.g., EM vs fuzzy match, symbolic math checking, partial credit) may affect conclusions; sensitivity to label noise and alternative scoring is not reported.
  • Sample efficiency and generalization: The probe’s data efficiency, cross-dataset transfer, and OOD generalization (train on one dataset/model, test on another) are not characterized.
  • Multi-source external observers: The work uses single-source external probes; whether ensembling multiple external models or jointly training on multi-source representations closes the factual premium gap is unknown.
  • Embedding-model coverage: Only one embedding model is used; it is unclear if conclusions generalize across embedding families/sizes or sentence-transformers trained with different objectives.
  • Layer granularity: Layers are sampled every 5th step; a full per-layer/per-head analysis could reveal sharper emergence points or specific attention heads responsible for the factual premium.
  • Mechanistic link to memory retrieval: The hypothesized connection to model-specific memory retrieval is not directly validated via activation patching, path attribution, or targeted ablations.
  • Math-negative result diagnosis: It is unclear whether the absence of a premium gap in math stems from dataset composition, lack of chain-of-thought, weaker math specialization of the chosen models, or fundamentally “public” reasoning signals.
  • Calibration metrics: Only AUC is reported; calibration (ECE/Brier), decision-theoretic utility, and thresholded performance under cost asymmetries are not assessed.
  • Statistical dependence: Fold-wise paired t-tests assume adequate independence; potential dependence from cross-validation splits and multiple reuse of the same base models is not probed with hierarchical or mixed-effects models.
  • Disagreement-subset artifacts: Although training is done on full data, possible distributional shift or selection artifacts in the disagreement subset (e.g., boundary cases with atypical features) are not fully ruled out.
  • Direction of predictiveness: The paper probes pre-generation question states; whether privileged correctness can be predicted earlier (e.g., after entity encoding) or only after specific mid-layer computations is not established beyond coarse trends.
  • Practical impact: The reported ~5% factual premium’s operational significance (for hallucination detection, routing, or safety monitoring) and its cost–benefit trade-off versus black-box confidence methods remain unquantified.

Practical Applications

Immediate Applications

The paper’s findings enable several deployable workflows and tools, particularly for factual tasks where models exhibit privileged (self-only) correctness signals that external observers cannot fully capture.

  • Factual hallucination detection via self-probes
    • Sector: software, enterprise search, customer support, healthcare (non-diagnostic knowledge assistance), legal research
    • Workflow/product: Integrate a lightweight “self-probe” (linear/MLP) on early-to-mid layers to predict whether the model’s factual answer will be correct; if low, trigger RAG, web search, or human review before responding.
    • Assumptions/dependencies:
    • Access to internal activations (open-weight or self-hosted models like Llama/Qwen/Gemma).
    • Modest but real AUC gains (~5%) are domain-specific (factual only); calibration and thresholding must be tuned per model/dataset.
    • Overhead of extracting mid-layer states must be budgeted for latency-sensitive applications.
  • Confidence-aware routing for factual queries
    • Sector: enterprise assistant platforms, developer tools
    • Workflow/product: Use self-probe scores to route factual prompts to: (a) retrieval-enhanced pipelines, (b) stronger specialist models, or (c) abstention. For math tasks, bypass self-probe and rely on external verifiers or step-checkers (no privileged advantage found).
    • Assumptions/dependencies:
    • Requires per-domain routing logic (facts vs math).
    • Needs ground-truth evaluation to set decision thresholds.
  • Disagreement-based evaluation for model introspection
    • Sector: academia, model evaluation teams, ML ops
    • Workflow/product: Adopt “disagreement subsets” (examples where models disagree on correctness) to evaluate introspection and avoid the inter-model agreement confound. Report the premium gap (self vs external probes) as a metric of genuine privileged knowledge.
    • Assumptions/dependencies:
    • Requires at least two comparable models to construct disagreement subsets.
    • Probes should be trained on full data and evaluated on disagreement-only splits (as per paper’s protocol).
  • Layer-targeted confidence heads for factual QA
    • Sector: software/ML infrastructure
    • Workflow/product: Attach small confidence heads to early-to-mid layers (where the factual advantage emerges) to predict correctness during inference with minimal overhead.
    • Assumptions/dependencies:
    • Head placement tuned per architecture; validate where advantage turns positive (typically early-to-mid layers).
    • Ongoing monitoring for distribution shift.
  • Audit and compliance logging for factual answers
    • Sector: finance, legal, enterprise risk/compliance
    • Workflow/product: Record the self-probe correctness score alongside outputs as an internal confidence tag; use for post-hoc audits and failure analysis of factual claims.
    • Assumptions/dependencies:
    • Scores are informative but not guarantees; use in conjunction with evidence retrieval/citations.
    • Activation logging must respect privacy/security constraints.
  • Dataset curation and training prioritization
    • Sector: academia, model training teams
    • Workflow/product: Use model–model disagreement to mine “boundary” factual examples for fine-tuning/evaluation to reduce hallucinations and calibrate confidence.
    • Assumptions/dependencies:
    • Requires multiple models and label pipelines; disagreements reflect harder/uncertain regions with lower baseline AUC.
  • Model selection and ensemble routing
    • Sector: platform engineering
    • Workflow/product: Identify domains where each model’s self-introspection is strongest (factual recall) and route accordingly; for math, prefer external verifiers/step checkers since self-advantage is absent.
    • Assumptions/dependencies:
    • Benchmark per task family; maintain routing rules over time.
  • Pretrained “introspection probe packs”
    • Sector: open-source tooling, ML ops
    • Workflow/product: Ship ready-to-use probes for popular open models (e.g., Llama-3.1-8B, Qwen-2.5-7B, Gemma-2-9B) to provide factual correctness scores out-of-the-box.
    • Assumptions/dependencies:
    • Distribution shift can degrade performance; include fine-tuning hooks.
    • Probes are model- and layer-specific; version control is needed.
  • Better benchmarks for correctness prediction
    • Sector: academia, benchmarking consortia
    • Workflow/product: Update correctness prediction benchmarks to report results on full sets and disagreement subsets; include per-layer analyses to localize where privileged signals arise.
    • Assumptions/dependencies:
    • Community adoption; standardized data splits and reporting protocols.

Long-Term Applications

Translating the paper’s insights into broader, scalable systems will require further research, engineering, and policy development.

  • Activation steering to improve factual correctness
    • Sector: software, healthcare/finance (high-stakes factuality), safety research
    • Workflow/product: Learn “correctness directions” tied to factual retrieval in residual streams; steer activations to increase the probability of correct factual recall.
    • Assumptions/dependencies:
    • Causal validation is needed (the paper suggests this as future work).
    • Guard against unintended behavior shifts and alignment regressions.
  • Standardized introspection APIs from model providers
    • Sector: cloud AI platforms, policy
    • Workflow/product: Expose provider-level “activation-based confidence” endpoints for factual tasks (e.g., mid-layer correctness scores) to enable black-box users to benefit from self-knowledge without raw activation access.
    • Assumptions/dependencies:
    • Requires provider cooperation and privacy/security safeguards.
    • Standardization bodies may define metrics and interfaces.
  • Regulatory guidance for high-stakes AI correctness
    • Sector: policy, healthcare, finance, public sector
    • Workflow/product: Incorporate disagreement-based introspection audits into certification for systems making factual claims (e.g., requiring evidence of privileged-signal harnessing or robust alternatives like human-in-the-loop/RAG).
    • Assumptions/dependencies:
    • Evidence that introspective signals improve safety in practice.
    • Harmonization with existing risk management frameworks.
  • Training objectives to enhance factual introspection
    • Sector: model development
    • Workflow/product: Multi-task or auxiliary losses that make models explicitly encode retrieval success/confidence in mid layers; improve the premium gap and downstream confidence calibration.
    • Assumptions/dependencies:
    • Must avoid teaching-to-the-test artifacts where probes pick up question-only features.
    • Eval must use disagreement subsets to measure true gains.
  • Domain-specific confidence strategies in education and tutoring
    • Sector: education technology
    • Workflow/product: For factual questions, rely on model introspection to decide when to fetch sources; for math, default to step-verification and external proof-checkers since no self-advantage exists.
    • Assumptions/dependencies:
    • Integration with step-verification pipelines; user experience design for adaptive feedback.
  • Hardware/runtime support for mid-layer readouts
    • Sector: AI hardware, compilers, inference systems
    • Workflow/product: Optimize inference runtimes to expose mid-layer vectors and run tiny heads/probes with minimal overhead (e.g., event-driven readouts at specific layers).
    • Assumptions/dependencies:
    • Requires scheduler/runtime support and minimal memory penalties.
    • Security constraints for activation access in multi-tenant settings.
  • Multi-agent/task allocation using introspective signals
    • Sector: autonomous agents, orchestration frameworks
    • Workflow/product: Route factual subtasks to agents/models with strong self-privileged signals; for math agents, prioritize external verifiers or formal solvers.
    • Assumptions/dependencies:
    • Requires accurate task typing (factual vs math) and reliable agent-level scoring.
  • Cross-domain extension (coding, commonsense, scientific QA)
    • Sector: software engineering tools, research assistants
    • Workflow/product: Apply the disagreement-subset protocol to hybrid domains to determine whether privileged correctness signals exist and where (by layer); tailor guardrails accordingly.
    • Assumptions/dependencies:
    • New benchmarks and careful task design (e.g., coding correctness vs test outcomes).
    • Per-domain calibration likely needed.
  • Model governance dashboards for introspection health
    • Sector: ML ops, compliance
    • Workflow/product: Track premium gaps over time across domains and layers; alert when introspection degrades (e.g., after fine-tuning or quantization).
    • Assumptions/dependencies:
    • Continuous evaluation infrastructure with multiple reference models.
    • Clear thresholds for action and rollback procedures.

Glossary

  • Activation steering: Manipulating internal model activations to steer behavior in a desired direction. "A natural avenue for future work is activation steering: if the factual correctness signal is genuinely tied to subject-specific retrieval, intervening on the identified correctness direction in the residual stream should predictably modulate output correctness."
  • Area Under the ROC Curve (AUC): A threshold-independent metric measuring the ability of a classifier to distinguish classes. "We evaluate performance using the Area Under the ROC Curve (AUC)."
  • Bonferroni-Holm correction: A stepwise multiple-comparison procedure that controls the family-wise error rate. "To control for family-wise error rates in multiple comparisons, we apply the Bonferroni-Holm correction \citep{holm1979simple} (p<0.05p < 0.05)."
  • Cross-Model: An external-probe configuration where the source model is a peer LLM of comparable size. "Cross-Model: MsourceM_{source} is a peer LLM of comparable size (e.g., predicting Qwen's correctness using Llama's hidden states)."
  • Disagreement subset: The set of examples where two models disagree on correctness, used to isolate model-specific signals. "To address this challenge, we construct disagreement subsets: questions where models produce conflicting correctness labels."
  • Embedding-Model: An external-probe configuration where the source model is an embedding model. "Embedding-Model: MsourceM_{source} is an embedding model of comparable size."
  • Epistemic privilege: The philosophical notion that an agent has special access to its own internal states not recoverable from external observation. "In the philosophy of mind, epistemic privilege refers to the idea that an agent has special access to its own internal states—information that cannot be fully recovered from external observation alone"
  • External-Probe: A probe trained on an external model’s representations to predict the target model’s correctness. "We refer to the advantage in correctness prediction performance of a self-probe over an external-probe as the premium gap."
  • Forward pass: The progression of computations through model layers when processing input. "consistent with model-specific memory retrieval that accumulates through the forward pass"
  • Instruction-tuned decoder LMs: Decoder-only LLMs fine-tuned using instruction-following data. "We evaluate three instruction-tuned decoder LMs of comparable size: Llama-3.1-8B~\citep{grattafiori2024llama3herdmodels}, Qwen2.5-7B~\citep{qwen2025qwen25technicalreport}, and Gemma-2-9B~\citep{gemmateam2024gemma2improvingopen}, alongside the embedding model Qwen3-Embedding-8B~\citep{yang2025qwen3technicalreport}."
  • Inter-model agreement: The extent to which different models yield the same correctness labels, which can confound evaluations. "We identify inter-model agreement as a critical confound: probes leverage shared difficulty patterns to predict correctness without needing access to the target's internal state."
  • Linear probe: A linear classifier trained on hidden states to predict a target property (e.g., correctness). "Our primary analysis uses a linear probe (logistic regression with L2L_2 regularization)."
  • Logistic regression with L2 regularization: A linear classification method with weight decay used for probing representations. "Our primary analysis uses a linear probe (logistic regression with L2L_2 regularization)."
  • Mechanistic analyses: Interpretability studies that examine internal computations to explain model behavior. "This view is further supported by mechanistic analyses from \citet{Chi2025LargeLM}, who show that when LLMs hallucinate due to retrieving incorrect knowledge, their internal states are indistinguishable from those of correct answers, suggesting that LLMs do not explicitly encode correctness."
  • MLP probes: Non-linear multilayer perceptron classifiers used to probe internal representations. "To ensure findings are not artifacts of linearity, we replicated all experiments using non-linear MLP probes, yielding qualitatively similar results (see \Cref{app:mlp_probe_results})."
  • Multi-hop reasoning: Reasoning that requires chaining multiple pieces of information across steps. "Note that while HotpotQA is often considered a multi-hop reasoning dataset, we use question-only evaluation without supporting documents, making it a test of parametric memory retrieval."
  • Nested Stratified K-Fold Cross-Validation (k=10): A two-level cross-validation procedure with stratification to tune and evaluate models robustly. "All probes are evaluated via Nested Stratified K-Fold Cross-Validation (k=10k=10), reporting AUC on the aggregated out-of-fold probabilities."
  • Normalized layer depth: A scaled measure of layer position used to compare across architectures or subsets of layers. "and plot the premium gap against normalized layer depth (0 = first probed layer, 1 = last)."
  • Out-of-distribution tasks: Evaluation settings that differ from the data distribution seen during training or development. "\citet{binder2025looking} note that the observed self-prediction advantage is often limited to simple settings and does not consistently generalize to out-of-distribution tasks."
  • Paired t-test: A statistical test comparing means of paired measurements (e.g., across folds) to assess significance. "We assess the significance of the premium gap using paired tt-tests across validation folds."
  • Parametric memory retrieval: Recalling factual information stored in the model’s parameters rather than consulting external context. "Note that while HotpotQA is often considered a multi-hop reasoning dataset, we use question-only evaluation without supporting documents, making it a test of parametric memory retrieval."
  • Premium gap: The performance advantage of self-probes over external probes in predicting correctness. "we measure the premium gap: the performance advantage of a correctness classifier trained on a model's own internal representations over one trained on external model representations (\Cref{fig:main_method})."
  • Privileged knowledge: Internal, model-specific information predictive of answer correctness that is not fully inferable from external signals. "We investigate whether LLMs possess similar privileged knowledge about answer correctness, information unavailable through external observation."
  • Residual stream: The sequence of residual connections in transformer architectures through which token representations are propagated. "intervening on the identified correctness direction in the residual stream should predictably modulate output correctness."
  • Residual-stream features: Features derived from the transformer’s residual stream that encode specific signals (e.g., factual self-awareness). "Similarly, \citet{tamoyan2025factual} demonstrate that residual-stream features encode a ``factual self-awareness'' signal: simple linear projections can predict whether a model will recall a fact correctly."
  • Self-Probe: A probe trained on the target model’s own hidden states to predict its correctness. "self-probes successfully predict correctness across both factual knowledge and mathematical reasoning."

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 10 tweets with 58 likes about this paper.