Automated Interpretability and Feature Discovery in Language Models with Agents

Published 2 May 2026 in cs.CL, cs.AI, and cs.HC | (2605.01555v1)

Abstract: We introduce an autonomous multiagent framework for mechanistic interpretability that automates both explaining and finding internal features in LLMs. The system runs two coupled loops: (1) explanation refinement, where an agent proposes competing hypotheses and iteratively tests them with targeted prompt controls and a multi-metric evaluation; and (2) feature discovery, where an agent generates prompt sets, constructs a k-nearest-neighbor graph in activation space, and retrieves candidate features using statistical separability and semantic coherence criteria. On Gemma-2 family models and MLP neurons in weight-sparse transformers, our agent improves over one-shot auto-interpretations, discovers language-specific and safety-relevant features, and produces auditable explanation traces, showing that agent-driven empirical loops yield sharper and more falsifiable explanations than one-shot labels.