Monet: Mixture of Monosemantic Experts for Transformers (2412.04139v4)

Published 5 Dec 2024 in cs.AI

Abstract: Understanding the internal computations of LLMs is crucial for aligning them with human values and preventing undesirable behaviors like toxic content generation. However, mechanistic interpretability is hindered by polysemanticity -- where individual neurons respond to multiple, unrelated concepts. While Sparse Autoencoders (SAEs) have attempted to disentangle these features through sparse dictionary learning, they have compromised LLM performance due to reliance on post-hoc reconstruction loss. To address this issue, we introduce Mixture of Monosemantic Experts for Transformers (Monet) architecture, which incorporates sparse dictionary learning directly into end-to-end Mixture-of-Experts pretraining. Our novel expert decomposition method enables scaling the expert count to 262,144 per layer while total parameters scale proportionally to the square root of the number of experts. Our analyses demonstrate mutual exclusivity of knowledge across experts and showcase the parametric knowledge encapsulated within individual experts. Moreover, Monet allows knowledge manipulation over domains, languages, and toxicity mitigation without degrading general performance. Our pursuit of transparent LLMs highlights the potential of scaling expert counts to enhance mechanistic interpretability and directly resect the internal knowledge to fundamentally adjust model behavior. The source code and pretrained checkpoints are available at https://github.com/dmis-lab/Monet.

Summary

The paper’s main contribution is MONET, a novel architecture that integrates sparse dictionary learning into a Mixture-of-Experts framework to reduce polysemanticity in LLMs.
The approach employs parameter-efficient expert composition, scaling to 262,144 experts per layer while keeping computational costs proportional to the square root of expert count.
Empirical evaluations demonstrate MONET’s competitive performance and its ability to control domain-specific, linguistic, and toxicity-related features in language models.

An Expert Approach to Mechanistic Interpretability in Transformers: An Evaluation of MONET

The paper introduces "Mixture of Monosemantic Experts for Transformers" (MONET), a novel architecture aimed at overcoming polysemanticity in LLMs through mechanistic interpretability. The core issue addressed is that neurons in LLMs often respond to multiple, unrelated concepts—a challenge termed polysemanticity. The authors argue that understanding internal computations is crucial for aligning LLMs with human values and preventing undesirable behaviors such as generating toxic content.

Technical Contribution

MONET is proposed as an advancement over existing Sparse Autoencoders (SAEs), which, although they seek to disentangle superposed features through sparse representations, compromise LLM performance due to their reliance on a post-hoc reconstruction loss. Instead, MONET incorporates sparse dictionary learning directly into the pretraining process via Mixture-of-Experts (MoE) architecture, dramatically scaling the number of experts to 262,144 per layer while maintaining efficient parameter scaling where total parameters increment with the square root of the number of experts rather than linearly.

The key contributions of this work include:

Parameter-Efficient Expert Composition: A new expert decomposition approach that scales expert count efficiently, addressing both memory usage and computational cost constraints typically associated with large-scale MoE models such as PEER.
Mechanistic Interpretability: The architecture facilitates fine-grained analysis of expert routing, confirming mutual exclusivity in knowledge representation between different expert groups.
Robust Knowledge Manipulation: The architecture permits control over domain-specific, linguistic, and toxicity-related knowledge without degrading the model’s general performance.

Empirical Evaluation

Empirical results presented in the paper demonstrate that MONET holds competitive performance across diverse LLMing benchmarks compared to dense LLMs with equivalent parameter counts. The paper's comprehensive evaluation spans models ranging from 850 million to 4.1 billion parameters and showcases MONET's scalability and practical applicability. Notably, the vertical decomposition (VD) variant of MONET consistently outperforms the horizontal decomposition (HD) variant across multiple benchmarks, illustrating the effectiveness of different architectural configurations.

Moreover, qualitative analyses reveal the monosemantic specialization of individual experts within MONET, providing compelling evidence of specialized expert knowledge across different domains and languages, and showcasing the potential for domain-specific or linguistic feature manipulation.

Implications and Future Directions

MONET presents a significant step towards achieving mechanistic interpretability in LLMs, offering insights into feature disentanglement and the possibilities of more nuanced model behaviors through expert specialization. By effectively scaling the number of monosemantic experts and aligning them with specific features or domains, MONET sets the stage for future explorations in interpretability and controllability within AI systems, as well as in the mitigation of biases and toxic outputs.

The implications for practical applications are profound, potentially extending to safer AI deployments where machine outputs can be more readily audited and adjusted by human operators. The methodology proposed for interpreting and manipulating model knowledge at the expert level opens avenues for future breakthroughs in transparency and model oversight, furnishing AI researchers with a more granular toolkit for model fine-tuning and behavior analysis.

In conclusion, MONET is a vital addition to the landscape of LLM interpretability, promising not only improved alignment with human behavioral expectations but also heightened awareness and control over the computational processes underpinning large-scale AI models. The research underscores the importance of continued exploration into expert specialization within LLMs and suggests new directions for enhancing the interpretability and trustworthiness of AI technologies.

PDF Markdown

Related Papers

GitHub

GitHub - dmis-lab/Monet: Monet: Mixture of Monosemantic Experts for Transformers (8 stars)

Tweets

https://twitter.com/scaling01/status/1865200522581418298

https://twitter.com/menhguin/status/1889911800449417490

https://twitter.com/menhguin/status/1897097365745754372

https://twitter.com/gm8xx8/status/1864913174773768555

https://twitter.com/rohanpaul_ai/status/1865898941444976739

https://twitter.com/Umberto_Senpai/status/1874562143959622020

YouTube

Show All Videos