Mirage Probes: Unmasking Hidden Behaviors

Updated 16 June 2026

Mirage probes are diagnostic instruments that detect covert, illusory, or degenerate behaviors across computational, physical, and condensed-matter systems by analyzing hidden dynamics.
They employ methods such as low-dimensional geometric analysis, temporal burst aggregation, and contrastive evaluation to expose spurious internal encoding and reasoning pathways.
Applications range from monitoring LLM agentic encoding and unlearning robustness to enhancing visual grounding, repository-level code reasoning, and probing collective modes in condensed matter.

A mirage probe is an evaluative instrument—often operationalized as a diagnostic, probing, or readout mechanism—deployed to surface otherwise covert, illusory, or degenerate behaviors within computational, physical, or condensed-matter systems. The nomenclature "mirage" highlights the detection or revelation of phenomena absent from direct observation or classic surface-level metrics, instead requiring internal, dynamical, or representational interrogation to identify latent computation, knowledge, or collective modes. Across domains—LLMs, vision-LLMs, code agents, Fermi liquids, superconducting heterostructures—mirage probes serve as the principal methodology for unmasking such hidden or spurious processes. This entry surveys the core design principles, mathematical structures, methodological workflows, and empirical results associated with mirage probes in various disciplines.

1. Mirage Probes in LLM Agentic Encoding Detection

The MIRAGE probe apparatus for LLM agents exemplifies a model-internal, mechanistically grounded monitoring technique targeting covert information encoding, particularly in scenarios involving agentic exfiltration or adversarial manipulation (Revankar et al., 9 Jun 2026). The probe exploits the emergence of a low-dimensional encoding subspace within the residual stream of autoregressive LLMs, empirically demonstrated to support diverse encoding strategies (Base64, ROT13, acrostic, synonym chains, etc.) across multiple architectures.

Formally, for each encoding style $s$ , the mean-difference direction is computed as

$\mathbf v_s = \bar{\mathbf h}_s^+ - \bar{\mathbf h}_s^-,\qquad \bar{\mathbf h}_s^\pm = \frac{1}{N^\pm}\sum \mathbf h_{i,s}^\pm,$

where $\mathbf h_{t}^{(\ell)}\in \mathbb{R}^d$ is the residual-stream hidden state at token $t$ , layer $\ell$ , and $N^\pm$ enumerate "positive" (encoding) and "negative" (normal) traces respectively. Principal-component analysis reveals that these $\mathbf v_s$ for $s$ in a test suite span a $\leq 5$ -dimensional subspace, with $94$– $\mathbf v_s = \bar{\mathbf h}_s^+ - \bar{\mathbf h}_s^-,\qquad \bar{\mathbf h}_s^\pm = \frac{1}{N^\pm}\sum \mathbf h_{i,s}^\pm,$ 0 of the leave-one-encoding-out (LOEO) probe's area-under-curve (AUC) attainable using just the leading principal component(s).

A logistic regression probe is then trained on L2- and robust-scaled, mean-pooled hidden activations extracted from the top $\mathbf v_s = \bar{\mathbf h}_s^+ - \bar{\mathbf h}_s^-,\qquad \bar{\mathbf h}_s^\pm = \frac{1}{N^\pm}\sum \mathbf h_{i,s}^\pm,$ 1 layers. For a normalized feature vector $\mathbf v_s = \bar{\mathbf h}_s^+ - \bar{\mathbf h}_s^-,\qquad \bar{\mathbf h}_s^\pm = \frac{1}{N^\pm}\sum \mathbf h_{i,s}^\pm,$ 2, the probability of internal encoding at each generation step is given by $\mathbf v_s = \bar{\mathbf h}_s^+ - \bar{\mathbf h}_s^-,\qquad \bar{\mathbf h}_s^\pm = \frac{1}{N^\pm}\sum \mathbf h_{i,s}^\pm,$ 3, where $\mathbf v_s = \bar{\mathbf h}_s^+ - \bar{\mathbf h}_s^-,\qquad \bar{\mathbf h}_s^\pm = \frac{1}{N^\pm}\sum \mathbf h_{i,s}^\pm,$ 4 are learned parameters. Temporal aggregation via a sliding-window "burst" maximization characterizes the occurrence of encoding events. Critically, an additional mechanism is provided by the polarity of the planning token: projecting the last input’s hidden state onto a shared encoding direction yields a sign that linearly separates inline (simulated) encoding from tool-call delegation (AUC $\mathbf v_s = \bar{\mathbf h}_s^+ - \bar{\mathbf h}_s^-,\qquad \bar{\mathbf h}_s^\pm = \frac{1}{N^\pm}\sum \mathbf h_{i,s}^\pm,$ 5).

Fusing both the burst and intent (planning token polarity) channels in real time yields a two-channel MIRAGE monitor achieving empirical AUC $\mathbf v_s = \bar{\mathbf h}_s^+ - \bar{\mathbf h}_s^-,\qquad \bar{\mathbf h}_s^\pm = \frac{1}{N^\pm}\sum \mathbf h_{i,s}^\pm,$ 6 on a suite of 126 exfiltration scenarios, vastly outperforming output-side detection ( $\mathbf v_s = \bar{\mathbf h}_s^+ - \bar{\mathbf h}_s^-,\qquad \bar{\mathbf h}_s^\pm = \frac{1}{N^\pm}\sum \mathbf h_{i,s}^\pm,$ 7). The method is robust to adversarial suppression: probe-aware attacks that obscure the encoding subspace concurrently destruct encoding fidelity, establishing a tight coupling between geometric separability and covert computation recoverability. False positive rates on benign inputs are model-dependent, from $\mathbf v_s = \bar{\mathbf h}_s^+ - \bar{\mathbf h}_s^-,\qquad \bar{\mathbf h}_s^\pm = \frac{1}{N^\pm}\sum \mathbf h_{i,s}^\pm,$ 8 (Qwen-7B) to $\mathbf v_s = \bar{\mathbf h}_s^+ - \bar{\mathbf h}_s^-,\qquad \bar{\mathbf h}_s^\pm = \frac{1}{N^\pm}\sum \mathbf h_{i,s}^\pm,$ 9 (Phi-3.5), indicating that probe success is fundamentally governed by host-model geometry (Revankar et al., 9 Jun 2026).

2. Mirage Probes in Evaluating LLM Unlearning Robustness

In the context of machine unlearning, Mirage Probes provide a dynamic, model-driven framework for evaluating residual knowledge in LLMs following an unlearning procedure (Shah et al., 11 Mar 2026). Contrasting with static cloze or Q→A benchmarks, Mirage Probes are constructed by extracting a knowledge graph from the pre-unlearned model via BFS expansion, relevance filtering, and alias resolution, covering explicit single-hop, multi-hop, and surface-form-perturbed queries.

The generated probes are tuned along two axes: hop count (reasoning depth) and surface perturbations (aliasing, paraphrasing). Difficulty is parameterizable by BFS decay rate $\mathbf h_{t}^{(\ell)}\in \mathbb{R}^d$ 0, depth $\mathbf h_{t}^{(\ell)}\in \mathbb{R}^d$ 1, and semantic similarity/relevance thresholds. The probe pipeline ensures that only queries for which the pre-unlearned model answered correctly are retained, isolating proper residual knowledge exposure.

A comprehensive suite of metrics is proposed: average forgetting (multi-hop accuracy loss post-unlearning), average retention (performance on unrelated or retained content), and a harmonic overall measure. Empirical results on LLaMA-3.1-8B indicate that methods showing near-complete forgetting on static benchmarks are penetrated by Mirage Probes, with 20–35% multi-hop recovery typical even for strong unlearning methods. Aliasing increases recovery by $\mathbf h_{t}^{(\ell)}\in \mathbb{R}^d$ 2.

Activation analysis reveals that single-hop queries are resolved by dominant mid-layer pathways commonly targeted by unlearning updates, while multi-hop queries exploit alternative early-plus-late pathways left largely intact. This anatomical bypass mechanism explains the systematic brittleness of unlearning to compositional probing (Shah et al., 11 Mar 2026).

3. Mirage Probes for Visual Grounding in VLMs

Vision-LLMs (VLMs) frequently exhibit so-called mirage behavior: confidently answering visual questions with or without an image, creating illusory benchmark gains due to ungrounded or spurious reasoning (Ben-Levi et al., 11 Jun 2026). The Mirage Probe methodology here comprises a contrastive, representation-level decoding framework.

Mirage behavior is defined operationally: for $\mathbf h_{t}^{(\ell)}\in \mathbb{R}^d$ 3, if $\mathbf h_{t}^{(\ell)}\in \mathbb{R}^d$ 4 and $\mathbf h_{t}^{(\ell)}\in \mathbb{R}^d$ 5 yield the same answer and high cosine similarity, the instance is labeled mirage (surface lexical matches are controlled by paraphrased contrastive pairs). Linear and shallow-MLP probes are then fit to activations across network sites (residual, MLP, and attention outputs) to decode mirage state $\mathbf h_{t}^{(\ell)}\in \mathbb{R}^d$ 6. The crucial finding is that the mirage-vs-non-mirage signal is linearly decodable, indicating encoding in a single direction.

A Prior Harnessing Index (PHI) further quantifies the extent to which the model’s predictions on $\mathbf h_{t}^{(\ell)}\in \mathbb{R}^d$ 7 are determined by linguistic priors rather than visual content. Two regimes of mirage are distinguished:

Textual-bias mirages—model never builds a visual representation, instead answering from dataset priors and benchmark patterning.
Spurious-image mirages—model synthesizes a “hallucinated” latent visual representation and answers as if grounded.

Mitigation differs accordingly: text-distribution cleaning reduces textual-bias mirages by lowering PHI, but spurious-image mirages persist as they reside in the model’s visual representations. Empirically, linear probes achieve $\mathbf h_{t}^{(\ell)}\in \mathbb{R}^d$ 8 accuracy (VQA-RAD), difference-of-activation probes $\mathbf h_{t}^{(\ell)}\in \mathbb{R}^d$ 9, and these effects persist under contrastive surface-form-controlled evaluation, excluding surface lexical confounds (Ben-Levi et al., 11 Jun 2026).

4. Mirage Probes in Repository-Level Code Agent Benchmarking

Mirage-inspired probing frameworks have been adapted to the benchmarking and analysis of code agents—specifically, to assess authentic repository context reasoning rather than surface-level success. RepoMirage [Editor's term], implemented in RepoMirage-Perturb and RepoMirage-Extend, introduces semantics-preserving repository-level perturbations (proxy chains, runtime masking, constant externalization) and turns the corresponding structural bottlenecks into explicit subtasks.

Table: Perturbation-induced performance drop across models (Li et al., 25 May 2026):

Model	Resolved_org	Resolved_pert	Drop (%)
GPT-4.1	0.384	0.182	–52.6
GPT-5	0.650	0.490	–24.6
Qwen3.6-35B	0.678	0.534	–21.2
Average	—	—	≈27.0

Following perturbation, agents show a 20–50% collapse in patch success rates, despite a 2–6 $t$ 0 increase in repository file access. Directly-posed recovery subtasks are solved at $t$ 1– $t$ 2 by state-of-the-art agents. Trajectory analysis identifies “exploration drift”—agents inspect more files but fail to integrate information into effective structural edits.

A structure-first scaffolding workflow (RepoAnchor) that separates repository traversal from editing and explicitly summarizes discovered structure yields absolute gains up to $t$ 3 points on extension tasks. These results reveal that capability on end-to-end software engineering benchmarks does not imply robust context-reasoning and signal structural modeling as a necessary research direction (Li et al., 25 May 2026).

5. Mirage Modes and Gaps in Condensed Matter Physics

The concept of a mirage probe extends to physical systems, particularly strongly interacting quantum matter:

Mirage Collective Modes in 2D Fermi Liquids

In two-dimensional Fermi liquids, mirage modes constitute a new class of collective excitation corresponding to poles of the dynamical susceptibility $t$ 4 residing on the unphysical Riemann sheet (for $t$ 5, $t$ 6) (Klein et al., 2019). Although not bona fide poles on the physical sheet, mirage modes create sharp, Lorentzian-like peaks in $t$ 7, indistinguishable from zero-sound resonances in frequency-domain spectroscopy. Time-domain (pump-probe) experiments are required to distinguish them, via their rapidly decaying pole term that eventually crosses over to a branch-cut continuum. The physical observability of mirage modes is dictated by interaction strength ( $t$ 8), accessible parameter regimes in high-mobility 2DEGs, and experimentally feasible time-resolution.

Mirage Gaps in Altermagnet–Superconductor Heterostructures

Proximity-coupled 2D altermagnets and s-wave superconductors exhibit gapless superconductivity with finite-energy mirage gaps (Wei et al., 2023). The Bogoliubov–de Gennes spectrum features both zero-energy nodal segments (gapless) and finite-energy gaps (mirage) pinned by the anisotropic exchange term. Tunneling spectroscopy reveals distinctive conductance features: collapse of the zero-bias Andreev plateau and quantized plateau re-emergence at finite mirage gap voltages. The detection of these mirage gaps is optimized for $t$ 9– $\ell$ 0, $\ell$ 1, and sub-millivolt voltage sweeps utilizable in low- $\ell$ 2 STM or planar junction experiments.

6. Implications and Best Practices

Mirage probes reveal a broad class of failure modes, degenerate capabilities, or covert behaviors that are invisible under standard evaluation protocols but present at the level of internal representation or dynamical response. Across application domains, key methodological principles include:

Model- or system-driven probe generation, with precise control over semantic complexity and surface variability.
Statistical and geometrical quantification of recoverable signals via low-dimensional or contrastive probes.
Fusion of multiple channels (e.g., temporal burst, intent/polarity signals) for robust monitoring.
Interpretation of negative results: suppression of the mirage signal via perturbation, defense, or adversarial attack is often structurally indistinguishable from destruction of the core function (e.g., encoding fidelity, knowledge composition).
Empirical evaluation must consider geometry-induced limitations: some systems (e.g., Phi-3.5 in LLMs) are inherently non-separable for linear probes, precluding reliable detection or reading of covert behaviors.

Mirage probes thus set new benchmarks for evaluation rigor, model interpretability, and the diagnosis of hidden computation, with implications well beyond their original target domains.