Mixture of Experts Made Intrinsically Interpretable (2503.07639v1)

Published 5 Mar 2025 in cs.LG and cs.CL

Abstract: Neurons in LLMs often exhibit \emph{polysemanticity}, simultaneously encoding multiple unrelated concepts and obscuring interpretability. Instead of relying on post-hoc methods, we present \textbf{MoE-X}, a Mixture-of-Experts (MoE) LLM designed to be \emph{intrinsically} interpretable. Our approach is motivated by the observation that, in LLMs, wider networks with sparse activations are more likely to capture interpretable factors. However, directly training such large sparse networks is computationally prohibitive. MoE architectures offer a scalable alternative by activating only a subset of experts for any given input, inherently aligning with interpretability objectives. In MoE-X, we establish this connection by rewriting the MoE layer as an equivalent sparse, large MLP. This approach enables efficient scaling of the hidden size while maintaining sparsity. To further enhance interpretability, we enforce sparse activation within each expert and redesign the routing mechanism to prioritize experts with the highest activation sparsity. These designs ensure that only the most salient features are routed and processed by the experts. We evaluate MoE-X on chess and natural language tasks, showing that it achieves performance comparable to dense models while significantly improving interpretability. MoE-X achieves a perplexity better than GPT-2, with interpretability surpassing even sparse autoencoder (SAE)-based approaches.

Summary

The paper introduces MoE-X, a novel Mixture-of-Experts architecture designed for intrinsic interpretability in language models by leveraging sparse activations and refined routing.
MoE-X employs ReLU experts and a novel sparsity-aware routing mechanism to promote disentangled representations and efficient processing of salient features.
Benchmarked against dense models and SAEs, MoE-X achieves superior interpretability and competitive performance (better than GPT-2 perplexity), enabling more trustworthy AI in sensitive sectors.

The paper "Mixture of Experts Made Intrinsically Interpretable" (2503.07639) introduces MoE-X, a novel Mixture-of-Experts (MoE) LLM designed for intrinsic interpretability, addressing the challenge of polysemanticity in LLMs. The paper posits that wider networks with sparse activations are more likely to capture interpretable factors. MoE-X aims to enhance transparency in AI systems without compromising performance.

Key Components of MoE-X

MoE-X distinguishes itself through several key architectural and training innovations:

Intrinsic Interpretability via MoE Architecture

MoE-X is designed as an intrinsically interpretable architecture, diverging from traditional LLMs that suffer from polysemantic neurons. The MoE structure inherently promotes disentangled and interpretable representations. This design choice aligns with the principle that sparse activations in wider networks can capture more coherent factors, thereby facilitating a clearer understanding of the model's internal representations.

ReLU Experts and Sparsity-Aware Routing

The model incorporates ReLU activations within its experts to encourage sparsity. Additionally, a novel routing mechanism prioritizes experts based on expected activation sparsity, ensuring that only the most salient features are selected for processing. This addresses the computational challenges typically associated with training wide and sparse networks. The routing mechanism is redesigned to prioritize experts with the highest activation sparsity, ensuring salient features are routed and processed effectively.

Scalability and Performance Evaluation

MoE-X is benchmarked against dense models and interpretability techniques like Sparse Auto-Encoders (SAEs) in tasks involving chess and natural language processing. Results indicate that MoE-X achieves superior interpretability compared to SAE-based approaches while maintaining performance metrics comparable to models like GPT-2. Specifically, MoE-X attains a perplexity better than GPT-2, demonstrating competitive performance.

Theoretical and Practical Implications

Theoretical Advances

The paper examines architectural choices that enhance interpretability, finding that expanding the hidden size and promoting activation sparsity contribute significantly to clearer representations without extensive post-hoc analysis. This work advances the understanding of structural properties that facilitate interpretability in neural networks.

Practical Applications

MoE-X offers a pathway for deploying AI systems in sectors where transparency and predictability are critical, such as healthcare, law, and autonomous systems. By reducing the complexity barrier, MoE-X makes AI more manageable and trustworthy in sensitive applications.

Future Research Directions

Future research directions include expanding MoE-X to larger-scale models, refining sparsity techniques, and applying these principles to other types of neural networks beyond transformers. Further work may also explore the refinement of sparsity techniques to optimize performance and interpretability across diverse tasks.

In summary, the MoE-X architecture offers a promising approach to enhancing the interpretability of LLMs. By integrating intrinsic interpretability into the model architecture, leveraging ReLU activations and sparsity-aware routing, and achieving competitive performance, MoE-X paves the way for more transparent and manageable AI systems.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/yxy2168/status/1899735444386091103

https://twitter.com/fly51fly/status/1899937813753426426

https://twitter.com/semisance/status/1899776925410754664

https://twitter.com/gastronomy/status/1899673148091461897

YouTube

Show All Videos