The Linear Centroids Hypothesis: How Deep Network Features Represent Data

Published 13 Apr 2026 in cs.LG | (2604.11962v1)

Abstract: Identifying and understanding the features that a deep network (DN) extracts from its inputs to produce its outputs is a focal point of interpretability research. The Linear Representation Hypothesis (LRH) identifies features in terms of the linear directions formed by the inputs in a DN's latent space. However, the LRH is limited as it abstracts away from individual components (e.g., neurons and layers), is susceptible to identifying spurious features, and cannot be applied across sub-components (e.g., multiple layers). In this paper, we introduce the Linear Centroids Hypothesis (LCH) as a new framework for identifying the features of a DN. The LCH posits that features correspond to linear directions of centroids, which are vector summarizations of the functional behavior of a DN in a local region of its input space. Interpretability studies under the LCH can leverage existing LRH tools, such as sparse autoencoders, by applying them to the DN's centroids rather than to its latent activations. We demonstrate that doing so yields sparser feature dictionaries for DINO vision transformers, which also perform better on downstream tasks. The LCH also inspires novel approaches to interpretability; for example, LCH can readily identify circuits in GPT2-Large. For code to study the LCH https://github.com/ThomasWalker1/LinearCentroidsHypothesis .

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper proposes a centroid-based method for feature identification that overcomes limitations of latent activation techniques.
It leverages piecewise affine approximations and input-output Jacobians to compute centroids, reducing spurious feature detection and enhancing generalization.
Empirical tests on vision and language models indicate that centroid-based probes improve linear probe accuracy and offer robust interpretability.

Authoritative Summary of "The Linear Centroids Hypothesis: How Deep Network Features Represent Data" (2604.11962)

Motivation and Hypothesis

The paper addresses foundational issues in mechanistic interpretability by proposing the Linear Centroids Hypothesis (LCH), an alternative to the Linear Representation Hypothesis (LRH) for feature identification in deep networks (DNs). While LRH asserts that features are represented as linear directions in the latent activation space, LCH asserts that features correspond to the linear directions formed by centroids—explicitly computable vectors summarizing the local functional behavior of a DN across partitions of its input space. This geometric formulation leverages the piecewise affine approximation of DNs and generalizes to all differentiable architectures via their input-output Jacobians.

LCH mitigates limitations of LRH, notably its abstraction from DN subcomponents, propensity for detecting spurious features, and difficulties in contextualizing features across layers or architectural hierarchies. The hypothesis is formally anchored in the geometry induced by power diagram subdivisions of the DN input space, wherein centroids correspond to local experts operating within implicitly defined polytopal regions.

Theoretical Framework

The paper formalizes features as characteristics extracted and utilized by DN subcomponents, with their influence traced via circuits. It establishes, via rigorous geometric arguments and analytic proofs, that centroids of DN-induced regions form approximate affine subspaces—so collections of centroids aligning into linear directions are equivalent to non-spurious features. For CPA DNs, the input-output Jacobian vector products yield centroids efficiently, grounding feature analysis in operational terms.

This centroid-based scheme enables mechanistic interpretability that is robust to the extraction of spurious features. Specifically, the LCH directly ties features to the computational graph, as centroids summarize the action of local experts specific to each region, rather than generic latent activations which may reflect extracted but unused features.

Empirical Evidence and Comparative Analyses

Evidence for LCH spans synthetic and real-world tasks. For example, in DNs trained to classify geometric regions, the centroids for points along domain boundaries cluster into linear subspaces, substantiating the hypothesis. In pre-trained vision models (e.g., ResNet50, Swin-B) and LLMs (e.g., GPT2, Llama-3.1-8B), centroids with distinct semantic input features separate into linear directions under principal component analysis.

Quantitative experiments demonstrate several advantages for LCH over LRH:

Reduction of Spurious Features: Centroids avoid linear representation of features not causally tied to the computation (e.g., random coloring in FashionMNIST), whereas latent activations do not differentiate meaningful features from spurious ones, as evidenced by linear probe accuracy and feature transfer studies.
Generalization Across Architectures/Sizes: Centroids-derived feature dictionaries consistently align across model versions (DINOv2→DINOv3), supporting the Platonic Representation Hypothesis, which predicts convergence of representations in larger models.
Improved Downstream Utility: Feature dictionaries based on centroids show higher linear probe accuracy and increased activation frequency on unseen inputs, indicating better generalization and semantic coherence.
Efficiency and Robustness in Circuit Discovery: Centroid-based attribution metrics facilitate rapid neuron filtering, exemplified in GPT2-Large, obviating the need for exhaustive ablation studies.
Enhanced Probing and Saliency: Centroid-based linear probes trained on plausibility datasets generalize more reliably to truth-based datasets, indicating that centroids capture computational action rather than abstract semantic concepts. Local centroid-based saliency maps, averaged over neighborhoods, provide faithful highlight of relevant input features, outperforming gradient-based approaches particularly under adversarial training.

Computational Considerations

Centroid extraction incurs approximately 10–15% computational overhead compared to collecting latent activations, primarily due to Jacobian calculations. However, this marginal increase is negligible relative to overall interpretability pipeline costs and is justified by the practical gains in interpretability, robustness, and downstream task performance.

Limitations and Future Directions

The current study restricts analysis to centroids; power diagram radii, which further parameterize the partition geometry and potentially refine feature boundaries, are not explored. Future research should integrate radius analysis to provide a more comprehensive geometric interpretability. Additionally, large-scale assessments of local centroid-based saliency across heterogeneous architectures and inputs are warranted.

A notable theoretical implication is the unification of disparate interpretability techniques—dictionary learning, probing, circuit tracing, and saliency—under the centroid framework, enabling mechanistic analysis at any granularity (layer, neuron, or architectural module). This geometric, computation-grounded approach aligns interpretability research with the intrinsic partition structure of DNs rather than post hoc statistical abstractions.

Practical and Theoretical Implications

Adoption of the LCH framework enhances model reliability by directly correlating feature extraction to computational statistics and facilitating robust downstream probing, circuit analysis, and saliency generation. This reduces risk in critical deployments by suppressing spurious feature hallucination, supports cross-model feature alignment, and aids transparency in auditing neural computations.

As centroids are universally computable for differentiable DN components, the framework is broadly applicable across architectures, including transformers and convolutional networks. Its operational simplicity advocates seamless integration in interpretability pipelines, promising advancements in debugging, safety, and transparency. However, the same mechanistic clarity could enable targeted manipulation of model behaviors, underscoring the need for responsible tool development and careful evaluation of impact.

Conclusion

The Linear Centroids Hypothesis reframes deep network feature interpretability from latent activation linearity to geometric centroids, aligning feature identification with the intrinsic structure and computation of DNs. Empirical and theoretical analyses show improved reliability, semantic coherence, and generalization in feature discovery and circuit analysis compared to traditional LRH-based methods. The LCH’s operational efficiency enables broad applicability, offering a principled and mechanistic foundation for future interpretability research and practical model analysis.

Markdown Report Issue