Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

Gemini 2.5 Flash 82 tok/s

Gemini 2.5 Pro 43 tok/s Pro

GPT-5 Medium 30 tok/s

GPT-5 High 32 tok/s Pro

GPT-4o 95 tok/s

GPT OSS 120B 469 tok/s Pro

Kimi K2 212 tok/s Pro

2000 character limit reached

Robust Auto-Interpretability Metrics

Updated 3 July 2025

Auto-interpretability metrics are rigorous evaluation measures that define and quantify the quality of concept representations in machine learning models.
They assess robustness by applying controlled adversarial perturbations to analyze the stability of interpretations under minimal input changes.
Empirical results reveal that slight token modifications can significantly alter sparse autoencoder outputs, challenging model oversight and safety assumptions.

Auto-interpretability metrics are a rigorous class of evaluation measures designed to objectively quantify the quality, reliability, and real-world viability of concept representations—such as those produced by sparse autoencoders (SAEs) in LLMs. While traditional metrics focus on factors like sparsity, reconstruction, or surface-level alignment with human-meaningful concepts, recent research has underscored that robustness with respect to input perturbations is a foundational requirement for faithful and actionable interpretability. Robustness metrics, formulated as specific input-space optimization problems, systematically probe how easily the assigned concept features of an interpretable representation can be altered or manipulated without materially changing the underlying model’s output, thereby exposing the potential illusory nature of current interpretability practices for model monitoring and oversight.

1. Robustness Evaluation: Framework and Formalization

The core principle motivating the robustness evaluation framework for auto-interpretability metrics is that the mapping from model inputs to human-interpretable concept representations should be stable with respect to small, plausible perturbations to the input. Formally, consider the composite mapping

$z = f_{\text{SAE}}(f_{\text{LLM}}(x))$

where $f_{\text{LLM}}$ is the transformation from input token sequence $x \in \mathcal{X}$ to model hidden state, and $f_{\text{SAE}}$ transforms this to a sparse conceptual representation $z \in \mathcal{Z}$ . The robustness of SAEs is assessed according to the degree to which $z$ can be changed by adversarially chosen modifications to $x$ , subject to explicit constraints.

The evaluation leverages a bi-Lipschitz assumption on the unobservable ground-truth concept mapping $f_c: \mathcal{X} \rightarrow \mathcal{C}$ , using input-level (token-based) perturbations as a practical surrogate for semantic variation: $L_1 d_x(x_i, x_j) \leq d_c(f_c(x_i), f_c(x_j)) \leq L_2 d_x(x_i, x_j)$ with $d_x$ a string metric (typically Levenshtein distance) and $d_c$ a conceptual distance. This formulation ensures that robustness quantification is grounded in mathematically bounded and plausible input changes.

2. Adversarial Perturbation Scenarios and Optimization Methods

The robustness metric is instantiated as a family of input-space optimization problems, targeting different adversarial scenarios vital for practical oversight:

Untargeted robustness: For a given input $x_1$ , maximize the SAE concept divergence,

$\max_{x_1'} d_z(z_1, z_1') \quad \text{s.t.} \quad d_x(x_1, x_1') \leq \epsilon_x$

seeking to find the largest possible conceptual shift produced by minimal edits.

Targeted robustness: Given inputs $x_1, x_2$ ,

$\min_{x_2} d_z(z_1, z_2) \quad \text{s.t.} \quad d_x(x_1, x_2) \geq \delta_x$

where the adversary tries to bring disparate semantic inputs as close together as possible in the concept space.

Evaluation is performed along three binary axes (semantic goal: untargeted/targeted; activation goal: population/individual; perturbation mode: suffix/replacement), yielding eight concrete scenarios. Each scenario is characterized by a definite objective and is approached using a discrete optimization variant (GCG, Gradient Coordinate Gradient) to efficiently construct sequences of token edits.

3. Empirical Analysis: Fragility and Limitations of SAE Interpretability

Systematic empirical evaluation reveals that, across all tested scenarios, tiny adversarial perturbations can cause substantial manipulations in SAE concept representations without notable changes to the base LLM's output, as judged by LLMs such as GPT-4.1 (agreement ≥95%).

Population-level attacks: Typically reduce overlap between original and attacked top-k concept units by >80% (untargeted) or align non-overlapping patterns by >70% (targeted) with just 1–3 token edits.
Individual-level attacks: Achieve activation/deactivation of target units with success rates >90%, even for human-interpretable or safety-critical units.
Generalizability: The phenomenon is robust across model layers, activation functions, and persists with transfer across LLM/SAE pairs.
Interpretability “contract” violation: SAEs may present apparently meaningful and stable concept assignments that, in reality, are arbitrarily alterable, breaking the coupling between interpretation and true model behavior.

Practical implication: SAE-based model oversight and monitoring may be easily fooled or bypassed, raising concerns for downstream reliability and safety.

4. The Role of Robustness in Auto-Interpretability Metrics

The findings underscore robustness as a necessary foundation for trustworthy auto-interpretability. Conventional metrics—such as sparsity/reconstruction tradeoff, human probe alignment, and feature disentanglement—are static, assuming unperturbed data, and fail to capture the persistence of interpretations under plausible adversarial or accidental changes. As a result:

Metrics must explicitly penalize non-robust concept assignments.
High variance or easily manipulated concept labeling under bounded perturbations should downweight auto-interpretability scores.
Traditional high performance on static interpretability benchmarks may mask severe vulnerabilities.

The paper recommends incorporating adversarial robustness into future evaluation and competition protocols, and designing new training objectives for SAEs and dictionary learning that regularize for robustness as a core principle.

5. Comparison with Classic Interpretability Metrics

Existing metrics are critically challenged by these results:

Metric	Scope	Limitation (as revealed)	Robustness Perspective
Reconstruction-Sparsity	Info preservation, sparsity	Unstable features may reconstruct, but be brittle	Robustness checks persistence of info
Human Interpretability (probe)	Probe or annotation correspondence	Can be manipulated adversarially, misleading in practice	Robustness checks faithfulness of labels
Feature Disentanglement	Latent independence/separation	Easy to adversarially disrupt, swap, or obfuscate	Robustness checks real-world faithfulness

Robustness is thus an orthogonal and indispensable complement to classical interpretability metrics.

6. Recommendations and Implications for Metric Design

Incorporate robust, adversarial evaluations into all SAE and concept-level interpretability analysis pipelines.
Combine static and robustness metrics for thorough assessment, including both worst-case and expected changes under permissible input perturbations.
Develop and adopt robustness-regularized objectives in future architecture and training designs for interpretable autoencoders and feature selectors.
Benchmark against adversarial as well as null and unperturbed settings to avoid overestimating trustworthiness.

A plausible implication is that only by passing both classic and robustness-oriented benchmarks can a method’s interpretability claims be considered suitable for high-stakes oversight or safety tasks.

Metric Type	Evaluates	Robustness Limitation Found	Complementarity of Robustness
Reconstruction-Sparsity	Info/sparsity trade	Features easily flipped/disguised	Robustness: persistent assignment
Probe/Human Alignment	Label assignment	Forged under adversarial input	Robustness: faithful labeling
Disentanglement/Modularity	Feature separation	Can be swapped or entangled maliciously	Robustness: real-world separation

In summary, robustness of concept representations under input perturbation must be considered a core desideratum—on equal footing with sparsity, human alignment, and disentanglement—when designing and applying auto-interpretability metrics. Without robustness, any purported interpretability may be an “illusion,” potentially creating misleading assurances in model oversight and safety-critical deployments.

PDF Markdown Chat (Upgrade)

Follow-up Questions

We haven't generated follow-up questions for this topic yet.

Generate Now