Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Robust Auto-Interpretability Metrics

Updated 3 July 2025
  • Auto-interpretability metrics are rigorous evaluation measures that define and quantify the quality of concept representations in machine learning models.
  • They assess robustness by applying controlled adversarial perturbations to analyze the stability of interpretations under minimal input changes.
  • Empirical results reveal that slight token modifications can significantly alter sparse autoencoder outputs, challenging model oversight and safety assumptions.

Auto-interpretability metrics are a rigorous class of evaluation measures designed to objectively quantify the quality, reliability, and real-world viability of concept representations—such as those produced by sparse autoencoders (SAEs) in LLMs. While traditional metrics focus on factors like sparsity, reconstruction, or surface-level alignment with human-meaningful concepts, recent research has underscored that robustness with respect to input perturbations is a foundational requirement for faithful and actionable interpretability. Robustness metrics, formulated as specific input-space optimization problems, systematically probe how easily the assigned concept features of an interpretable representation can be altered or manipulated without materially changing the underlying model’s output, thereby exposing the potential illusory nature of current interpretability practices for model monitoring and oversight.

1. Robustness Evaluation: Framework and Formalization

The core principle motivating the robustness evaluation framework for auto-interpretability metrics is that the mapping from model inputs to human-interpretable concept representations should be stable with respect to small, plausible perturbations to the input. Formally, consider the composite mapping

z=fSAE(fLLM(x))z = f_{\text{SAE}}(f_{\text{LLM}}(x))

where fLLMf_{\text{LLM}} is the transformation from input token sequence xXx \in \mathcal{X} to model hidden state, and fSAEf_{\text{SAE}} transforms this to a sparse conceptual representation zZz \in \mathcal{Z}. The robustness of SAEs is assessed according to the degree to which zz can be changed by adversarially chosen modifications to xx, subject to explicit constraints.

The evaluation leverages a bi-Lipschitz assumption on the unobservable ground-truth concept mapping fc:XCf_c: \mathcal{X} \rightarrow \mathcal{C}, using input-level (token-based) perturbations as a practical surrogate for semantic variation: L1dx(xi,xj)dc(fc(xi),fc(xj))L2dx(xi,xj)L_1 d_x(x_i, x_j) \leq d_c(f_c(x_i), f_c(x_j)) \leq L_2 d_x(x_i, x_j) with dxd_x a string metric (typically Levenshtein distance) and dcd_c a conceptual distance. This formulation ensures that robustness quantification is grounded in mathematically bounded and plausible input changes.

2. Adversarial Perturbation Scenarios and Optimization Methods

The robustness metric is instantiated as a family of input-space optimization problems, targeting different adversarial scenarios vital for practical oversight:

  • Untargeted robustness: For a given input x1x_1, maximize the SAE concept divergence,

maxx1dz(z1,z1)s.t.dx(x1,x1)ϵx\max_{x_1'} d_z(z_1, z_1') \quad \text{s.t.} \quad d_x(x_1, x_1') \leq \epsilon_x

seeking to find the largest possible conceptual shift produced by minimal edits.

  • Targeted robustness: Given inputs x1,x2x_1, x_2,

minx2dz(z1,z2)s.t.dx(x1,x2)δx\min_{x_2} d_z(z_1, z_2) \quad \text{s.t.} \quad d_x(x_1, x_2) \geq \delta_x

where the adversary tries to bring disparate semantic inputs as close together as possible in the concept space.

Evaluation is performed along three binary axes (semantic goal: untargeted/targeted; activation goal: population/individual; perturbation mode: suffix/replacement), yielding eight concrete scenarios. Each scenario is characterized by a definite objective and is approached using a discrete optimization variant (GCG, Gradient Coordinate Gradient) to efficiently construct sequences of token edits.

3. Empirical Analysis: Fragility and Limitations of SAE Interpretability

Systematic empirical evaluation reveals that, across all tested scenarios, tiny adversarial perturbations can cause substantial manipulations in SAE concept representations without notable changes to the base LLM's output, as judged by LLMs such as GPT-4.1 (agreement ≥95%).

  • Population-level attacks: Typically reduce overlap between original and attacked top-k concept units by >80% (untargeted) or align non-overlapping patterns by >70% (targeted) with just 1–3 token edits.
  • Individual-level attacks: Achieve activation/deactivation of target units with success rates >90%, even for human-interpretable or safety-critical units.
  • Generalizability: The phenomenon is robust across model layers, activation functions, and persists with transfer across LLM/SAE pairs.
  • Interpretability “contract” violation: SAEs may present apparently meaningful and stable concept assignments that, in reality, are arbitrarily alterable, breaking the coupling between interpretation and true model behavior.

Practical implication: SAE-based model oversight and monitoring may be easily fooled or bypassed, raising concerns for downstream reliability and safety.

4. The Role of Robustness in Auto-Interpretability Metrics

The findings underscore robustness as a necessary foundation for trustworthy auto-interpretability. Conventional metrics—such as sparsity/reconstruction tradeoff, human probe alignment, and feature disentanglement—are static, assuming unperturbed data, and fail to capture the persistence of interpretations under plausible adversarial or accidental changes. As a result:

  • Metrics must explicitly penalize non-robust concept assignments.
  • High variance or easily manipulated concept labeling under bounded perturbations should downweight auto-interpretability scores.
  • Traditional high performance on static interpretability benchmarks may mask severe vulnerabilities.

The paper recommends incorporating adversarial robustness into future evaluation and competition protocols, and designing new training objectives for SAEs and dictionary learning that regularize for robustness as a core principle.

5. Comparison with Classic Interpretability Metrics

Existing metrics are critically challenged by these results:

Metric Scope Limitation (as revealed) Robustness Perspective
Reconstruction-Sparsity Info preservation, sparsity Unstable features may reconstruct, but be brittle Robustness checks persistence of info
Human Interpretability (probe) Probe or annotation correspondence Can be manipulated adversarially, misleading in practice Robustness checks faithfulness of labels
Feature Disentanglement Latent independence/separation Easy to adversarially disrupt, swap, or obfuscate Robustness checks real-world faithfulness

Robustness is thus an orthogonal and indispensable complement to classical interpretability metrics.

6. Recommendations and Implications for Metric Design

  • Incorporate robust, adversarial evaluations into all SAE and concept-level interpretability analysis pipelines.
  • Combine static and robustness metrics for thorough assessment, including both worst-case and expected changes under permissible input perturbations.
  • Develop and adopt robustness-regularized objectives in future architecture and training designs for interpretable autoencoders and feature selectors.
  • Benchmark against adversarial as well as null and unperturbed settings to avoid overestimating trustworthiness.

A plausible implication is that only by passing both classic and robustness-oriented benchmarks can a method’s interpretability claims be considered suitable for high-stakes oversight or safety tasks.


Metric Type Evaluates Robustness Limitation Found Complementarity of Robustness
Reconstruction-Sparsity Info/sparsity trade Features easily flipped/disguised Robustness: persistent assignment
Probe/Human Alignment Label assignment Forged under adversarial input Robustness: faithful labeling
Disentanglement/Modularity Feature separation Can be swapped or entangled maliciously Robustness: real-world separation

In summary, robustness of concept representations under input perturbation must be considered a core desideratum—on equal footing with sparsity, human alignment, and disentanglement—when designing and applying auto-interpretability metrics. Without robustness, any purported interpretability may be an “illusion,” potentially creating misleading assurances in model oversight and safety-critical deployments.