Agent-Centric Interpretability

Updated 9 May 2026

Agent-centric interpretability is an approach that defines and explains intelligent agents' internal states, decision-making, and coordination in dynamic, complex environments.
It employs methodologies including unified Bayesian models, multi-agent logging, and interactive protocols that iteratively refine explanations and support human oversight.
This framework enhances system-level accountability by ensuring transparent decision traceability and alignment of agent behaviors with human expectations.

Agent-centric interpretability encompasses the principles, frameworks, and methodologies that position intelligent agents—autonomous entities capable of learning, planning, and acting in dynamic environments—at the center of the interpretability and explainability process. Unlike traditional post-hoc or model-centric methods, agent-centric interpretability treats agents as entities whose internal models, states, coordination structures, learning dynamics, and interaction histories must be elucidated to ensure robust oversight, human alignment, and system-level accountability. Research in this area integrates multi-agent architectures, iterative explanation/refinement loops, human-aware modeling, system logging, causal graph analysis, and interactive conversational protocols, all tailored for agents and agentic systems operating in complex, often safety-critical domains.

1. Conceptual Foundations and Motivations

Agent-centric interpretability has emerged in response to the increasing deployment of black-box agents and agentic systems in real-world, multi-step, and potentially open-world settings, where emergent and opaque behaviors, temporal dependence, and goal misalignment pose new risks for trust and accountability. The core motivation is to build infrastructure and protocols that enable humans to understand, predict, and diagnose the decision-making processes of agents and their broader coordination structures, rather than just yielding isolated feature attributions or static global model summaries.

Central definitions include treating interpretability as fundamentally observer-dependent and embedding the interpretative context within the agent's lifecycle, including goal formation, planning, coordination, and environment interaction (Zhu et al., 23 Jan 2026). The human observer's evolving beliefs and the agent's own internal state representations are modeled explicitly, and both communication and reasoning are made contextually adaptive (Sreedharan et al., 2021, Sreedharan et al., 2020).

Human-in-the-loop and agent-as-teacher paradigms further expand the scope: interpretability is achieved not merely by artifact presentation but through iterative, sometimes bi-directional, interactions where an agent models the user and adapts its explanations or behaviors accordingly (Kim et al., 13 Jun 2025, He et al., 20 Mar 2026).

2. Key Theoretical Frameworks

The formalization of agent-centric interpretability is grounded in unified Bayesian models, multi-agent system design, and causal analysis.

Unified Bayesian Formulation: The human observer maintains a hypothesis set $\mathbb{M}^R$ over possible agent models (including an explicit "unknown" or "random" model $\mathcal{M}^0$ ), and updates their posterior after each observed behavior prefix via

$P(\mathcal{M}\mid \hat\tau_{\mathrm{obs}}) = \frac{P_\ell(\hat\tau_{\mathrm{obs}} \mid \mathcal{M}) P(\mathcal{M})}{\sum_{M'\in\mathbb{M}^R} P_\ell(\hat\tau_{\mathrm{obs}} \mid M') P(M')}$

where $P_\ell$ is typically a noisy-rational model likelihood (Sreedharan et al., 2021, Sreedharan et al., 2020). Explicability, legibility, and predictability are unified as functionals on this posterior.

System-Level Agentic Interpretability: For a system $\mathcal{A} = \langle G, M, P, \Omega, \Lambda \rangle$ (agents, memory, planning, orchestration, environment tools), interpretability requires complete event and causal logging:
- Decision traceability: $\tau_t^{(g)} = \langle s_t, h_t, a_t, r_t, \psi_t \rangle$ per agent $g$ at each $t$ .
- Temporal causal chains: DAGs reflecting cross-step influences.
- Coordination transparency: Full metadata on communication and orchestration events (Zhu et al., 23 Jan 2026).
Agentic Multi-agent Enhancement: Modular sub-agents specialize in explanation, discovery, and critique, supporting auditable and composable workflows. Each agent's output, reasoning, and feedback contributions are logged and traceable (He et al., 20 Mar 2026, Marin-Llobet et al., 2 May 2026).

3. Methodologies and System Architectures

A wide range of methodologies instantiate agent-centric interpretability, including:

Iterative, Multi-agent Workflows: For example, explainable narrative generation is decomposed into narrator, evaluator, critic, and coherence agents, each contributing explicit outputs and enabling stepwise audit trails (He et al., 20 Mar 2026). These workflows support iterative refinement, modularity, and ensemble feedback (e.g., majority voting in evaluator ensembles).
Interactive, Agentic Protocols: Agentic interpretability sessions unfold over multiple dialogue turns. The agent (often an LLM) maintains an explicit or implicit model of the user's knowledge ( $M_m$ ), adapts its choices and explanations accordingly, and solicits feedback to refine alignment. This contrasts with inspective one-shot methods (Kim et al., 13 Jun 2025).
Model-Agnostic Surrogates and Explanation Conditioning: Black-box agent behavior is distilled into interpretable surrogates (e.g., local decision-tree paths) which then serve as the only permissible basis for natural language explanation by LLMs, minimizing hallucination and constraining explanations to grounded rationales (Xi-Jia et al., 8 Apr 2025, Zhang et al., 2023).
Mechanistic and Empirical Agent Loops: Automated multi-agent frameworks drive both feature discovery (e.g., kNN graphs, clustering) and explanation refinement (hypothesis testing, metric-based selection, polysemanticity detection) within LLMs and neural agents. Each candidate explanation is subjected to targeted intervention and falsification loops, supporting auditability and robustness (Marin-Llobet et al., 2 May 2026).
Systematic Policy and Learning Visualization: Frameworks such as REVEAL-IT visualize agent policy evolution, weight changes, and task curriculum using graph-based renderings and GNN-based explainers, directly mapping agent learning dynamics for inspection and optimization (Ao et al., 2024).
Human-Centered Extrospective Modeling: Explanation selection is driven not just by what is salient to the agent but by what is most likely novel to the specific user, estimated via dynamic support scores over knowledge items within a SUDO context framework (Spillner et al., 29 Jul 2025).

4. Metrics and Evaluation Protocols

Agent-centric interpretability draws on tailored metrics that go beyond static feature faithfulness:

Metric / Protocol	Purpose	Example (Paper)
Faithfulness (Rank, Sign, Value accuracy)	Quantify alignment of generated narratives with ground truth (e.g., SHAP explanations)	(He et al., 20 Mar 2026)
Decision Traceability Score (DTS)	Fraction of steps with complete per-agent decision logs	(Zhu et al., 23 Jan 2026)
Goal Alignment Score (GAS)	Measures preservation of explicit constraints across multi-agent system steps	(Zhu et al., 23 Jan 2026)
Compounding Error Bound (CEB)	Upper-bound on cumulative propagated error over decision sequence	(Zhu et al., 23 Jan 2026)
Polysemanticity/Coherence metrics	Detects multiple, irreducible hypotheses and linguistic quality of explanations	(Marin-Llobet et al., 2 May 2026, He et al., 20 Mar 2026)
Agent Interpretability Score	Proportion of LLM-graded interpretability tests (across simulation/counterfactuals) passed by a model representation	(Singh et al., 5 May 2026)

Human-in-the-loop studies supplement these metrics, measuring end-user prediction accuracy, subjective preferences over explanation style, and interaction helpfulness, often using controlled conditions to assess the efficacy of agent-generated explanations in real decision tasks (Xi-Jia et al., 8 Apr 2025, Zhang et al., 2023, He et al., 20 Mar 2026). Evaluations may also be entangled with agentic refinements, creating non-i.i.d. interactions that challenge classical benchmarking (Kim et al., 13 Jun 2025).

5. Applications and System Implementations

Agent-centric interpretability is implemented in a spectrum of system types, including:

LLM-based multi-agent narrative and XAI systems: Architectures such as those in (He et al., 20 Mar 2026) and (Prasai et al., 5 Nov 2025) employ explicit agent routing, modular tool invocation (e.g., BertViz, TransformerLens, RAG-explainer), and conversational, auditable interfaces.
Mechanistic LLM feature discovery and validation systems: Multi-agent frameworks automate discovery and empirical validation of internal model features, supporting scalable mechanistic interpretability (Marin-Llobet et al., 2 May 2026).
Automated data science pipelines for agents: The agentic-imodels framework evolves new regressor classes that optimize for both agent-interpretable display and predictive performance, directly benefitting downstream agentic work (Singh et al., 5 May 2026).
Reinforcement learning curriculum/diagnosis: Visualization frameworks such as REVEAL-IT make agents’ evolving policy structures and training dynamics directly accessible via GNN-annotated graph renderings, supporting both capacity analysis and curriculum learning (Ao et al., 2024).
Human-centered, personalized AI assistants: Agent-worldview models surface “uncommon ground” and tailor explanations in real time based on the individual user’s knowledge trajectory, enabling adaptive and extrospective explanation (Spillner et al., 29 Jul 2025).

6. Limitations, Open Challenges, and Future Directions

Key challenges in agent-centric interpretability include:

Faithfulness and auditability: LLM-generated rationales are not guaranteed to be faithful in complex, time-dependent, or deceptive settings (CoT faithfulness rates 20–40%), and compounding temporal dependencies exceed the reach of feature-attribution methods (Zhu et al., 23 Jan 2026).
Evaluation complexity: Human-agent entanglement in interactive interpretability loops poses difficulties for reproducibility, statistical analysis, and protocol standardization (Kim et al., 13 Jun 2025).
Cross-agent and system-level reasoning: Existing methods often do not scale to capture system-level causality, policy composition, and error propagation in multi-agent collaborations (Zhu et al., 23 Jan 2026).
Human vs. agent interpretability: Tools traditionally optimized for human simulatability may be suboptimal for fully automated agent pipelines; agentic-imodels and related tools seek to close this gap (Singh et al., 5 May 2026).
Polysemanticity and abstraction: Mechanistic interpretability struggles with features exhibiting polysemanticity; explicit detection and reporting in agentic frameworks addresses only part of the problem (Marin-Llobet et al., 2 May 2026).
Scalability and modality coverage: Many distillation and surrogate approaches are limited to structured, non-dense features and may not generalize to fully visual or multi-modal systems (Xi-Jia et al., 8 Apr 2025, Zhang et al., 2023).

Active research targets the development of:

Formal causal models and meta-explanation layers for system-level query and fusion (Zhu et al., 23 Jan 2026);
Adaptive, personalized explanation agents rooted in real-time user modeling (Spillner et al., 29 Jul 2025, He et al., 20 Mar 2026);
Integration of mechanistic and conversational (agentic) interpretability paradigms (Kim et al., 13 Jun 2025);
Standardized benchmarks and evaluation ecosystems for holistic, agent-centric interpretability infrastructures (Zhu et al., 23 Jan 2026).

7. Practical Implications and System-Level Accountability

Agent-centric interpretability is catalyzing a shift from explanation as a localized, model-centric artifact to a comprehensive, lifecycle-embedded infrastructure. System-level accountability is realized through auditable, modular agent protocols, explicit causal graph architectures, and dynamic, personalized explanations that bridge both human and autonomous agent users. Such infrastructures are now seen as critical enablers of trustworthy, controllable, and aligned deployment of advanced agentic systems in domains ranging from data science automation and AI assistants to embodied robotics and autonomous multi-agent collaboration (Zhu et al., 23 Jan 2026, Singh et al., 5 May 2026, Prasai et al., 5 Nov 2025).

These developments are driving the field toward transparent, interactive, and diagnosis-ready AI, where both human users and agentic subsystems can jointly understand and scrutinize decision logic across scales and abstraction levels.