Universal Sparse Autoencoders: Interpretable Cross-Model Concept Alignment (2502.03714v1)

Published 6 Feb 2025 in cs.CV and cs.LG

Abstract: We present Universal Sparse Autoencoders (USAEs), a framework for uncovering and aligning interpretable concepts spanning multiple pretrained deep neural networks. Unlike existing concept-based interpretability methods, which focus on a single model, USAEs jointly learn a universal concept space that can reconstruct and interpret the internal activations of multiple models at once. Our core insight is to train a single, overcomplete sparse autoencoder (SAE) that ingests activations from any model and decodes them to approximate the activations of any other model under consideration. By optimizing a shared objective, the learned dictionary captures common factors of variation-concepts-across different tasks, architectures, and datasets. We show that USAEs discover semantically coherent and important universal concepts across vision models; ranging from low-level features (e.g., colors and textures) to higher-level structures (e.g., parts and objects). Overall, USAEs provide a powerful new method for interpretable cross-model analysis and offers novel applications, such as coordinated activation maximization, that open avenues for deeper insights in multi-model AI systems

Summary

The paper introduces Universal Sparse Autoencoders to create a shared latent space that aligns activations and interpretable concepts across diverse models.
It employs an overcomplete sparse autoencoder trained on random model activations to minimize reconstruction loss and ensure coherent concept mapping.
Results show USAEs capture both common visual features and model-specific distinctions, providing novel insights into AI model interpretability and design.

Universal Sparse Autoencoders: Interpretable Cross-Model Concept Alignment

The paper presents a novel framework for interpretable concept alignment across deep neural networks named Universal Sparse Autoencoders (USAEs). This method facilitates the discovery of common, interpretable concepts spanning multiple pretrained models, providing a unified lens into how diverse architectures encode information. The core mechanism involves simultaneously training a single overcomplete sparse autoencoder on activations from different models, enabling the alignment and reconstruction of these activations via a shared conceptual space.

Methodology

USAEs are defined as a form of sparse autoencoder that extends the notion of traditional sparse autoencoders from focusing on a single model to encompassing multiple models. The idea is to encode inputs from any given model into a universal, sparse representation. This representation can then be decoded to approximate activations from any other model within the framework. This process aligns concepts across models by deriving a shared latent space, where each model's features can be reconstructed based on a common dictionary of concepts.

Training involves selecting a model randomly at each iteration to generate a batch of activations, which are encoded into a shared space. This shared representation is used to reconstruct activations for every model in the framework, with the objective of minimizing reconstruction loss across all models. This technique promotes concept alignment and ensures efficient updates for both encoder and decoder stages within the system.

Results and Findings

Qualitative assessments reveal that USAEs successfully capture a range of semantically coherent concepts across different models—extending from fundamental visual primitives such as colors and shapes to more complex, higher-level abstractions like objects and parts. Moreover, quantitative analysis using metrics like firing entropy and co-fire proportions indicates that many of these identified concepts are both essential for model reconstruction and consistently shared across different pretrained architectures and tasks.

An interesting aspect of USAEs is their ability to uncover concepts unique to specific models, indicative of how individual architectures or training objectives influence concept encoding. For example, DinoV2 displays distinct concepts related to depth and perspective, likely due to its architectural and training paradigms.

Implications and Future Directions

The introduction of USAEs has profound implications for AI interpretability, especially in multi-model environments. By providing a means to explore and understand how concepts are encoded across different model internals, USAEs pave the way for more insightful analysis of model behaviors and interactions. This understanding could guide better model design, contribute toward mitigating deployment risks, and ensure alignment with regulatory frameworks.

One of the paper’s unique contributions is the application of USAEs to coordinated activation maximization, which allows for the simultaneous visualization of shared concepts across multiple networks. This capability presents new opportunities for identifying not just similarities, but also divergences in how models process information.

Future work could focus on enhancing the scalability of USAEs, exploring their application beyond vision models to domains such as NLP, and refining their ability to highlight not just universal, but also subtle model-specific concept distinctions. As AI systems become increasingly complex and diverse, approaches like USAEs will prove invaluable in deepening our understanding of the underlying representations that drive these advances.