Because we have LLMs, we Can and Should Pursue Agentic Interpretability (2506.12152v1)
Abstract: The era of LLMs presents a new opportunity for interpretability--agentic interpretability: a multi-turn conversation with an LLM wherein the LLM proactively assists human understanding by developing and leveraging a mental model of the user, which in turn enables humans to develop better mental models of the LLM. Such conversation is a new capability that traditional inspective' interpretability methods (opening the black-box) do not use. Having a LLM that aims to teach and explain--beyond just knowing how to talk--is similar to a teacher whose goal is to teach well, understanding that their success will be measured by the student's comprehension. While agentic interpretability may trade off completeness for interactivity, making it less suitable for high-stakes safety situations with potentially deceptive models, it leverages a cooperative model to discover potentially superhuman concepts that can improve humans' mental model of machines. Agentic interpretability introduces challenges, particularly in evaluation, due to what we call
human-entangled-in-the-loop' nature (humans responses are integral part of the algorithm), making the design and evaluation difficult. We discuss possible solutions and proxy goals. As LLMs approach human parity in many tasks, agentic interpretability's promise is to help humans learn the potentially superhuman concepts of the LLMs, rather than see us fall increasingly far from understanding them.
Summary
- The paper introduces agentic interpretability, empowering LLMs to proactively explain their behaviors and build mutual mental models in interactive dialogues.
- It demonstrates a multi-turn, conversational method where explanations adapt to user knowledge, enhancing model training and the learning of superhuman concepts.
- The approach contrasts with static inspective methods by emphasizing dynamic human-AI collaboration and highlighting challenges in reproducibility and computational efficiency.
This paper, "Because we have LLMs, we Can and Should Pursue Agentic Interpretability" (2506.12152), argues for a new approach to understanding LLMs called agentic interpretability. This method leverages the LLM itself as a proactive, conversational partner to help humans build better mental models of the LLM's behavior and knowledge. In turn, the LLM develops a mental model of the user to tailor its explanations effectively, creating a dynamic similar to a teacher-student relationship.
Core Concepts of Agentic Interpretability
Agentic interpretability is defined by three core components:
- Proactive Assistance: The LLM takes initiative in explaining itself, going beyond answering direct questions. It might offer unsolicited clarifications, suggest areas for exploration, or adapt its explanatory strategy based on its (inferred) understanding of the user's needs and knowledge gaps.
- Multi-Turn Interaction: Understanding is developed through an extended dialogue, allowing for iterative refinement, clarification, and deeper exploration of complex topics. This is crucial for opaque model behaviors that might involve counter-intuitive or even superhuman knowledge.
- Mutual Mental Model:
- Machine's model of human: The LLM develops and maintains a representation (implicit or explicit) of the user's current knowledge, understanding, and potential misconceptions. This allows the LLM to tailor explanations effectively, much like a teacher adapts to a student's level. For example, an LLM might remember a user previously stated unfamiliarity with a concept and adjust its explanation accordingly.
- Human's model of machine: The process also helps humans build a more accurate and nuanced mental model of the LLM's capabilities, limitations, and reasoning processes. This is seen as valuable in itself, helping humans "keep up" with increasingly complex AI.
The paper contrasts agentic interpretability with traditional "inspective" methods (e.g., mechanistic interpretability, feature attribution) which focus on opening the black box but don't typically involve the model as an active, conversational participant in its own explanation. While inspective methods are vital for high-stakes safety scenarios, especially with potentially deceptive models, agentic interpretability is proposed as particularly useful for integrating AI into society and for humans to learn potentially superhuman concepts from AI.
Practical Applications and Examples
The paper outlines several hypothetical scenarios where agentic interpretability could be applied:
- Model Trainer Model: Current model development involves developers building mental models of LLMs through trial and error. Agentic interpretability could transform this by creating a "meta-model" trained on a project's development history (code, experiments, discussions). Researchers could then converse with this meta-model, which would proactively suggest hypotheses, identify patterns, or guide debugging, based on its understanding of the developers' goals and knowledge gaps.
- Implementation Idea: Fine-tune an LLM on project-specific data (code commits, bug reports, Slack discussions, experimental logs). The interface would allow developers to query the model about past decisions, unexpected results, or to get suggestions for new experiments. The model would try to infer the developer's current understanding and tailor its advice.
- Teaching Superhuman Knowledge: Drawing on Vygotsky's Zone of Proximal Development (ZPD), an agentic LLM could identify what a user is ready to learn and provide tailored guidance.
- Example: An LLM-based AlphaZero could teach superhuman chess concepts by engaging a player in a Socratic dialogue. It would discern the player's ZPD (e.g., understands sacrifices but struggles with complex positional play) and introduce new "superhuman chess concepts" (perhaps even coining new terms like
super_chess_37
as explored in related work [hewitt2025cantunderstandaiusing]) through targeted puzzles and incremental explanations. The user can ask clarifying questions, guiding their learning.
- Example: An LLM-based AlphaZero could teach superhuman chess concepts by engaging a player in a Socratic dialogue. It would discern the player's ZPD (e.g., understands sacrifices but struggles with complex positional play) and introduce new "superhuman chess concepts" (perhaps even coining new terms like
- Agentic Mechanistic Interpretability: Even with potentially deceptive models, agentic interpretability could enhance mechanistic interpretability. Researchers could perform "open-model surgery" (ablating connections, amplifying activations) while conversing with the model. The model, guided by its understanding of the researchers' goal (to understand a component's function), would explain the resulting changes.
- Benefit with Deceptive Models: If a model is deceptive, it must reconcile its explanations with the observable internal changes. This creates opportunities for inconsistencies to emerge. The cognitive load on a deceptive model to maintain a facade during such interactive probing could reveal its true nature.
Challenges and Trade-offs
The paper acknowledges significant challenges:
- Human-Entangled-in-the-Loop Evaluation: Human responses are integral to the interpretability algorithm itself, not just feedback. This "human-entangled-in-the-loop" nature makes reproducibility, controlled comparisons, and isolating variables very difficult.
- Potential Solution: Use LLMs as proxies for human users in early-stage development and evaluation, where appropriate, to allow for faster, less expensive iteration.
- High Variance: Users have diverse backgrounds, needs, and preferred learning styles. LLMs also exhibit variability in their responses. This leads to a vast space of possible conversational trajectories, making it hard to design methods that cover all possibilities, especially when dealing with complex or superhuman knowledge.
- Inefficiency for Completeness: Agentic interpretability, focusing on interaction, may not be the most efficient way to achieve a "complete" explanation of every model behavior, which might be better suited to exhaustive inspective methods like circuit finding.
- Difficulty in Hill-Climbing for Computational Efficiency: Unlike some interpretability methods that can be optimized computationally once verified (e.g., making influence functions faster), the human-interactive nature makes it hard to use purely computational metrics for improvement without human interaction.
Evaluation Strategies
Evaluating agentic interpretability is complex because directly measuring mental models is impossible. The paper proposes focusing on proxy goals and end-task metrics:
Two main cases for evaluation:
- Case Improve: Make machines do what we want.
- Goal: The user wants to modify the LLM's behavior to better align with a human-defined concept (e.g., make outputs funnier, more factually correct for a specific domain).
- Process: The agentic LLM provides insights (e.g., "my understanding of this code base is poor; add documentation to my system prompt") that lead to a modified model M′.
- Evaluation: Measure if M′ performs better on the human-defined concept f(x,M′(x)) compared to the original model M(x), with constraints on how much M′ can differ from M.
- Example: An LLM helps a developer understand why it generates suboptimal code for a specific task. The developer, through conversation, learns to adjust the system prompt or fine-tuning data, resulting in improved code generation for that task. The improvement in code quality is the metric.
- Case Learn: Learn about machine concepts (potentially superhuman).
- Goal: The user wants to understand a concept internal to or defined by the machine (e.g., why the model refuses certain requests, what characterizes a "high-quality" response from the model's perspective, or a superhuman chess strategy).
- Process: The LLM teaches the human about this machine concept through interactive dialogue, examples, and quizzes.
- Evaluation:
- Simulatability: Test the human's ability to predict the machine's concept for new examples. For instance, after discussing what the LLM considers a "good" response, can the human accurately classify new responses according to the LLM's criteria?
- End-task metric: If direct prediction is hard or the concept is abstract, assess how well the human leverages their new understanding to achieve a specific goal (e.g., does a chess player's Elo rating improve after learning a superhuman chess concept from the LLM?).
Notation for Evaluation:
- x: input to LLM
- y: output from LLM
- f(x,y)↦c: a concept function, where c∈C is a property of inputs/outputs.
- Human-defined concept: e.g., Is y correct for x?
- Machine-defined concept: e.g., Would the model judge y as good for x?
Specific Evaluation Challenges & Paths Forward:
- Models don't know "why" they behave: LLMs lack deep meta-understanding. The paper suggests interactive dialogue can be a co-discovery process where human and LLM collaboratively build better mutual mental models.
- Human vs. Machine Concept Distinction: The line can be blurry. The advice is to focus on whether the "improve" or "learn" evaluation framework is more scientifically interesting or useful for the specific application.
- Human evaluation is expensive/hard to replicate: Use LLMs as proxies for humans during development to enable faster, cheaper iteration, while acknowledging that real human evaluation is the ultimate goal.
Relationship to Other Fields
- Cognitive Science: Draws on theories of mental models, grounding in communication, Rational Speech Acts (RSA), and pedagogy (diagnosing learner understanding).
- HCI (Human-Computer Interaction): Relates to XAI interfaces (but emphasizes dialogue over static presentation), interactive machine learning (but focuses on post-hoc explanation over model improvement during training), and human-AI collaboration (agentic interpretability is key for the mutual understanding needed).
Conclusion
The paper argues that while LLMs present enormous complexity, their agentic capabilities also offer a unique opportunity to accelerate human understanding of these systems. By fostering a "teaching mode" in LLMs, agentic interpretability aims to help humans keep pace with AI advancements, potentially even learning superhuman concepts. This approach shifts from solely treating LLMs as objects of paper to leveraging them as intelligent tools for their own interpretation.
Related Papers
HackerNews
- Because We Have LLMs, We Can and Should Pursue Agentic Interpretability (4 points, 0 comments)