- The paper proposes a centroid-based method for feature identification that overcomes limitations of latent activation techniques.
- It leverages piecewise affine approximations and input-output Jacobians to compute centroids, reducing spurious feature detection and enhancing generalization.
- Empirical tests on vision and language models indicate that centroid-based probes improve linear probe accuracy and offer robust interpretability.
Authoritative Summary of "The Linear Centroids Hypothesis: How Deep Network Features Represent Data" (2604.11962)
Motivation and Hypothesis
The paper addresses foundational issues in mechanistic interpretability by proposing the Linear Centroids Hypothesis (LCH), an alternative to the Linear Representation Hypothesis (LRH) for feature identification in deep networks (DNs). While LRH asserts that features are represented as linear directions in the latent activation space, LCH asserts that features correspond to the linear directions formed by centroidsโexplicitly computable vectors summarizing the local functional behavior of a DN across partitions of its input space. This geometric formulation leverages the piecewise affine approximation of DNs and generalizes to all differentiable architectures via their input-output Jacobians.
LCH mitigates limitations of LRH, notably its abstraction from DN subcomponents, propensity for detecting spurious features, and difficulties in contextualizing features across layers or architectural hierarchies. The hypothesis is formally anchored in the geometry induced by power diagram subdivisions of the DN input space, wherein centroids correspond to local experts operating within implicitly defined polytopal regions.
Theoretical Framework
The paper formalizes features as characteristics extracted and utilized by DN subcomponents, with their influence traced via circuits. It establishes, via rigorous geometric arguments and analytic proofs, that centroids of DN-induced regions form approximate affine subspacesโso collections of centroids aligning into linear directions are equivalent to non-spurious features. For CPA DNs, the input-output Jacobian vector products yield centroids efficiently, grounding feature analysis in operational terms.
This centroid-based scheme enables mechanistic interpretability that is robust to the extraction of spurious features. Specifically, the LCH directly ties features to the computational graph, as centroids summarize the action of local experts specific to each region, rather than generic latent activations which may reflect extracted but unused features.
Empirical Evidence and Comparative Analyses
Evidence for LCH spans synthetic and real-world tasks. For example, in DNs trained to classify geometric regions, the centroids for points along domain boundaries cluster into linear subspaces, substantiating the hypothesis. In pre-trained vision models (e.g., ResNet50, Swin-B) and LLMs (e.g., GPT2, Llama-3.1-8B), centroids with distinct semantic input features separate into linear directions under principal component analysis.
Quantitative experiments demonstrate several advantages for LCH over LRH:
- Reduction of Spurious Features: Centroids avoid linear representation of features not causally tied to the computation (e.g., random coloring in FashionMNIST), whereas latent activations do not differentiate meaningful features from spurious ones, as evidenced by linear probe accuracy and feature transfer studies.
- Generalization Across Architectures/Sizes: Centroids-derived feature dictionaries consistently align across model versions (DINOv2โDINOv3), supporting the Platonic Representation Hypothesis, which predicts convergence of representations in larger models.
- Improved Downstream Utility: Feature dictionaries based on centroids show higher linear probe accuracy and increased activation frequency on unseen inputs, indicating better generalization and semantic coherence.
- Efficiency and Robustness in Circuit Discovery: Centroid-based attribution metrics facilitate rapid neuron filtering, exemplified in GPT2-Large, obviating the need for exhaustive ablation studies.
- Enhanced Probing and Saliency: Centroid-based linear probes trained on plausibility datasets generalize more reliably to truth-based datasets, indicating that centroids capture computational action rather than abstract semantic concepts. Local centroid-based saliency maps, averaged over neighborhoods, provide faithful highlight of relevant input features, outperforming gradient-based approaches particularly under adversarial training.
Computational Considerations
Centroid extraction incurs approximately 10โ15% computational overhead compared to collecting latent activations, primarily due to Jacobian calculations. However, this marginal increase is negligible relative to overall interpretability pipeline costs and is justified by the practical gains in interpretability, robustness, and downstream task performance.
Limitations and Future Directions
The current study restricts analysis to centroids; power diagram radii, which further parameterize the partition geometry and potentially refine feature boundaries, are not explored. Future research should integrate radius analysis to provide a more comprehensive geometric interpretability. Additionally, large-scale assessments of local centroid-based saliency across heterogeneous architectures and inputs are warranted.
A notable theoretical implication is the unification of disparate interpretability techniquesโdictionary learning, probing, circuit tracing, and saliencyโunder the centroid framework, enabling mechanistic analysis at any granularity (layer, neuron, or architectural module). This geometric, computation-grounded approach aligns interpretability research with the intrinsic partition structure of DNs rather than post hoc statistical abstractions.
Practical and Theoretical Implications
Adoption of the LCH framework enhances model reliability by directly correlating feature extraction to computational statistics and facilitating robust downstream probing, circuit analysis, and saliency generation. This reduces risk in critical deployments by suppressing spurious feature hallucination, supports cross-model feature alignment, and aids transparency in auditing neural computations.
As centroids are universally computable for differentiable DN components, the framework is broadly applicable across architectures, including transformers and convolutional networks. Its operational simplicity advocates seamless integration in interpretability pipelines, promising advancements in debugging, safety, and transparency. However, the same mechanistic clarity could enable targeted manipulation of model behaviors, underscoring the need for responsible tool development and careful evaluation of impact.
Conclusion
The Linear Centroids Hypothesis reframes deep network feature interpretability from latent activation linearity to geometric centroids, aligning feature identification with the intrinsic structure and computation of DNs. Empirical and theoretical analyses show improved reliability, semantic coherence, and generalization in feature discovery and circuit analysis compared to traditional LRH-based methods. The LCHโs operational efficiency enables broad applicability, offering a principled and mechanistic foundation for future interpretability research and practical model analysis.