Archetypal SAE: Adaptive and Stable Dictionary Learning for Concept Extraction in Large Vision Models

Published 18 Feb 2025 in cs.CV | (2502.12892v2)

Abstract: Sparse Autoencoders (SAEs) have emerged as a powerful framework for machine learning interpretability, enabling the unsupervised decomposition of model representations into a dictionary of abstract, human-interpretable concepts. However, we reveal a fundamental limitation: existing SAEs exhibit severe instability, as identical models trained on similar datasets can produce sharply different dictionaries, undermining their reliability as an interpretability tool. To address this issue, we draw inspiration from the Archetypal Analysis framework introduced by Cutler & Breiman (1994) and present Archetypal SAEs (A-SAE), wherein dictionary atoms are constrained to the convex hull of data. This geometric anchoring significantly enhances the stability of inferred dictionaries, and their mildly relaxed variants RA-SAEs further match state-of-the-art reconstruction abilities. To rigorously assess dictionary quality learned by SAEs, we introduce two new benchmarks that test (i) plausibility, if dictionaries recover "true" classification directions and (ii) identifiability, if dictionaries disentangle synthetic concept mixtures. Across all evaluations, RA-SAEs consistently yield more structured representations while uncovering novel, semantically meaningful concepts in large-scale vision models.

Abstract PDF Upgrade to Chat

Authors (10)

Summary

The paper introduces Archetypal SAE that embeds convex hull constraints to enhance the stability of dictionary learning in sparse autoencoders.
It demonstrates a trade-off balancing reconstruction fidelity and sparsity, validated through extensive experiments on large vision models.
The approach improves model interpretability and paves the way for applications in diverse domains beyond vision.

Archetypal SAE: Adaptive and Stable Dictionary Learning for Concept Extraction in Large Vision Models

Introduction

The paper "Archetypal SAE: Adaptive and Stable Dictionary Learning for Concept Extraction in Large Vision Models" (2502.12892) addresses a critical limitation in Sparse Autoencoders (SAEs), which have been prominent in machine learning for interpretability tasks by decomposing model representations into a dictionary of human-interpretable concepts. Despite their utility, SAEs exhibit significant instability, leading to inconsistent dictionaries across similar datasets, thus undermining reliability in interpretability applications. To tackle this issue, the authors introduce Archetypal SAEs, drawing inspiration from the Archetypal Analysis framework. This novel approach constrains dictionary atoms to the convex hull of the data, significantly enhancing stability while preserving state-of-the-art reconstruction abilities.

Figure 1: Archetypal-SAEs constrain dictionary atoms to the data's convex hull, improving stability, which contrasts the instability seen in standard SAEs.

Methodology

Concept Extraction as Dictionary Learning

The framework proposed operates by mapping activations into a sparse concept space using dictionary learning methods. The optimization task involves minimizing reconstruction errors while adhering to sparsity constraints. Classical methods such as Non-negative Matrix Factorization and K-means have been utilized but lack modern scalability. Conversely, SAEs leverage backpropagation and batch-learning capabilities, making them appealing for large-scale datasets. However, these methods suffer from instability, leading to distinct dictionaries upon reruns, as demonstrated in recent empirical results.

Archetypal Sparse Autoencoder (A-SAE)

Archetypal SAEs introduce a geometric constraint by ensuring dictionary atoms reside within the convex hull of sample representations. This geometric anchoring yields substantial stability across training runs by preventing dictionary atoms from pointing in directions unrelated to the dataset.

Archetypal Constraint: Each dictionary atom is defined as a convex combination of sample representations, preventing pathological dictionary formations.
Relaxed Variant: Incorporates a mild deviation term allowing slight flexibility beyond the convex hull, balancing modeling adaptability and stability benefits.
Figure 2: Performance comparison across dictionary learning methods, highlighting SAEs' superior trade-offs between sparsity and reconstruction fidelity.

Experimental Validation

Stability and Reconstruction Trade-offs

Extensive experimentation demonstrates the stability-reconstruction trade-off achieved by Archetypal SAEs compared to conventional SAE methods. By implementing dictionary learning techniques across a variety of vision models, the authors establish Archetypal SAEs offering better stability without compromising reconstruction fidelity. The newly introduced stability metric leverages optimal average cosine similarity between dictionaries across multiple training runs, a vital aspect highlighted by empirical findings.

Figure 3: Stability-reconstruction tradeoff showcasing archetypal dictionaries reducing instability.

Impact of Relaxation Parameter

The work explores the impact of relaxation parameter adjustments, which allows exploration beyond the convex hull of data. This mechanism maintains a balance between standard SAE methods and the stringent constraints of archetypal analysis, proving effective in achieving comparable reconstruction performance alongside enhanced stability.

Figure 4: Impact of relaxation parameter on the performance of Archetypal SAE methods.

Implications and Future Work

The introduction of Archetypal SAEs posits significant implications for explainability in large vision models by enhancing stability and interpretability of learned representations. This approach paves the way for extending its application beyond vision, including LLMs and structured data domains. The balance between reconstruction fidelity and stability holds the potential for developing more reliable interpretability protocols, crucial for real-world applications requiring transparency, accountability, and ethical compliance.

Conclusion

In summary, the authors have presented an innovative solution to the instability problem in SAEs, leveraging archetypal anchoring to enhance dictionary learning stability. Experimental results cement the approach's validity, demonstrating marked improvements in structured concept extraction and interpretability. This work contributes meaningfully to the interpretability community, offering a robust method that addresses foundational reliability concerns in dictionary learning for AI models.