Crosscoder-Based Feature Comparisons
- Crosscoder-based feature comparisons are methodologies that align neural representations across models using shared sparse autoencoder dictionaries.
- They quantify shared and specialized features through metrics like weighted max pairwise Pearson correlation and comparative sharedness.
- These techniques provide actionable insights for model interpretability, safety auditing, and causal attribution by isolating behaviorally relevant features.
Crosscoder-based feature comparisons comprise a rigorous class of methodologies for quantitatively and causally analyzing how neural representations—features, directions, or concepts—are shared, transferred, or specialized between multiple neural network models or checkpoints. These tools have seen increasing prominence as a unified and interpretable basis for comparing encoders across modalities, model sizes, architectures, fine-tuning regimes, and even at individual training stages. Unlike simple activation statistics or representation-similarity metrics, crosscoder-based comparisons are grounded in shared sparse autoencoder dictionaries, enabling directly aligned, interpretable feature spaces and fine-grained attribution of conceptual overlap or divergence across models.
1. Mathematical Framework and Metric Definitions
Central to crosscoder-based comparison is the learning of a shared (or partitioned) dictionary of sparse latent features that serves to align activation spaces across two or more models. Let and represent model activations (e.g., hidden states at a particular layer). The standard procedure involves training sparse autoencoders—typically with a TopK or hard constraint—such that both models' activations are reconstructed from the same code, or from codes sharing a portion of dictionary indices.
Key metrics for feature comparison include:
- Weighted Max Pairwise Pearson Correlation (wMPPC): For features (from a sparse autoencoder on model ) and , the maximum cross-model Pearson correlation is computed and weighted by the total activation . The aggregate wMPPC is given by
This quantifies feature-level representational alignment between entire models (Cornet et al., 24 Jul 2025).
- Comparative Sharedness (): For a source feature 0 in model 1, comparative sharedness quantifies how much more strongly that feature aligns with group 2 versus 3:
4
This construct identifies features that are shared with one model or family but not another (Cornet et al., 24 Jul 2025).
- Relative Decoder Norms (\textit{Model-Specificity}): For decoders 5 and 6 associated with feature 7, their relative norm
8
serves as a post-hoc measure of model-specificity or exclusivity (Jiralerspong et al., 12 Feb 2026).
- Relative Indirect Effect (RelIE): When studying evolution across training checkpoints, causal importance of features is quantified via RelIE, which normalizes ablation-induced performance drops across snapshots:
9
Features with RelIE near one are specific to a checkpoint; intermediate values denote shared features (Bayazit et al., 5 Sep 2025).
2. Crosscoder Architectures and Training Protocols
2.1 Sparse Autoencoder (SAE) and TopK Crosscoders
The principal architecture comprises encoders and decoders for each model, with the latent code typically constrained by sparsity: 0 Reconstructions are produced for each model: 1 Losses include mean-squared reconstruction error and per-feature sparsity regularization, enforced via L1 penalties or hard BatchTopK gating. The BatchTopK variant prevents representation drift and yields more robust inference of model-specific features (Minder et al., 3 Apr 2025, Chaudhari et al., 6 Mar 2026).
2.2 Dedicated Feature Crosscoder (DFC)
The DFC extends standard crosscoders by partitioning the dictionary into three non-overlapping index sets: 2-exclusive, 3-exclusive, and shared. Structural constraints—enforced via decoder clamping—yield exact model exclusivity without the approximation error of post-hoc norm analysis. This architecture is essential for robust model diffing and has demonstrated sharper separation of model-unique features, particularly for safety-relevant or policy-alignment features (Jiralerspong et al., 12 Feb 2026, Shportko et al., 25 Jun 2026).
2.3 Delta-Crosscoder and Narrow-Diff Regimes
For detecting subtle or extremely sparse representation changes (e.g., backdoors, fine-grained misalignment), the Delta-Crosscoder incorporates:
- Dual-K sparsity (shared and 4 code blocks),
- A dedicated 5-loss on the difference of code activations between models,
- An implicit contrastive loss on code similarity, yielding reliable isolation of causally relevant directions that govern the fine-tuned behavior (Kassem et al., 16 Feb 2026).
3. Empirical Findings in Crosscoder-based Comparisons
Studies applying crosscoder-based feature comparison have produced the following results:
| Comparison Context | Key Empirical Finding | Reference |
|---|---|---|
| Vision-Text-MM Models | Last-layer representations show high wMPPC both within and across modalities; shared high-level concepts concentrate in final layers. | (Cornet et al., 24 Jul 2025) |
| LLM Pretraining | Crosscoders track feature emergence and maintenance; RelIE reliably attributes causal importance; early-to-late checkpoints show distinct feature clusters. | (Bayazit et al., 5 Sep 2025) |
| Model Distillation | Unique reasoning features emerge in distilled models (e.g., "self-reflection" direction); geometry shifts correlate with performance improvement. | (Baek et al., 5 Mar 2025) |
| RL Fine-tuning | DFCs localize tool-use capability to a minimal set of features; single-neuron steering achieves large behavioral shifts; capability "spillover" is observed. | (Shportko et al., 25 Jun 2026) |
| Compression (VLMs) | Pruning rotates/attenuates features (high FSR, low alignment), quantization preserves alignment in surviving features; safety-critical circuits are affected. | (Elluru et al., 26 Mar 2026) |
| MoE vs Dense Models | MoE models develop fewer, higher-activation-density exclusive features; dense models distribute concepts across more sparse latents. | (Chaudhari et al., 6 Mar 2026) |
These findings indicate that crosscoder analysis is sensitive to both shared semantic structure and model-specific or stage-specific innovations, across a broad spectrum of architectures and interventions.
4. Practical Applications and Interpretability
Crosscoder-based feature comparison provides concrete tools for:
- Model Interpretability: Enabling concept-level analyses of which semantic features are shared or model-specific, including identification of safety-relevant or refusal-related features, reasoning directions, and even narrow behaviors such as tool-call triggers (Cornet et al., 24 Jul 2025, Jiralerspong et al., 12 Feb 2026, Zhang et al., 3 Apr 2026, Shportko et al., 25 Jun 2026).
- Model Diffing and Safety Auditing: Isolating precise directions responsible for behavioral shifts induced by fine-tuning, distillation, compression, or architecture changes, with direct connections to capabilities and safety (e.g., refusal mechanisms, policy alignment) (Elluru et al., 26 Mar 2026, Jiralerspong et al., 12 Feb 2026, Shportko et al., 25 Jun 2026).
- Causal Attribution and Feature Steering: Crosscoders support targeted ablation, steering, and runtime behavioral control by modulating activation along identified feature directions, with effects validated by reconstruction gains, capability spillover, or performance changes under feature intervention (Baek et al., 5 Mar 2025, Shportko et al., 25 Jun 2026, Kassem et al., 16 Feb 2026).
- Transfer Learning and Dataset Curation: Both wMPPC and 6 metrics are used to select robust, transferable features or audit cross-domain alignment (e.g., image-caption consistency) (Cornet et al., 24 Jul 2025).
5. Limitations, Challenges, and Best Practices
Several challenges are highlighted in the literature:
- Sparsity Artifacts: Standard L1 crosscoder objectives are vulnerable to "Complete Shrinkage" and "Latent Decoupling," which can misclassify shared features as specific to one model. Remedies include the BatchTopK constraint and Latent Scaling diagnostics, which provide more faithful partitions (Minder et al., 3 Apr 2025).
- Fine-tuning Specificity: In narrow-diff settings, the standard joint-reconstruction approach underfits rare, fine-tuned feature directions. Dedicated 7-blocks and contrastive loss, as in Delta-Crosscoder, are required for recovery (Kassem et al., 16 Feb 2026).
- Partitioning Sensitivity: The precision and recall of exclusive-feature discovery depends on DFC partition sizes and hyperparameters. Multi-run consensus and partitioning priors are suggested future directions (Jiralerspong et al., 12 Feb 2026).
- Benchmark and Annotation Bias: The interpretability and benchmarking of discovered features may depend on external automated LLM judges, and current causal validation remains semi-manual. Robust downstream annotations and large-scale pattern mining may address some limitations.
Best practice recommendations include:
- Use BatchTopK instead of L1 sparsity for cross-model feature attribution.
- Validate model-specific directions by intervention, not just decoder norms.
- Employ DFCs for clear model-exclusive feature demarcation, especially in safety contexts.
- Use Delta-Crosscoders for tasks involving extremely subtle or sparse fine-tuning changes.
6. Extensions and Future Directions
Potential avenues of ongoing research and extensions include:
- Scaling to Multiple Models/Modalities: Extending crosscoder-based comparison to entire families or groups, including continuous pretraining trajectories and multimodal encoders (Bayazit et al., 5 Sep 2025, Cornet et al., 24 Jul 2025).
- Automated Annotation and Cluster Discovery: Pattern-mining and unsupervised clustering of latent feature activation to automate semantic labeling at scale (Bayazit et al., 5 Sep 2025).
- Circuit-Level Analyses: Aligning groups of features ("circuits") and tracking their evolution or rewiring under interventions (Elluru et al., 26 Mar 2026, Bayazit et al., 5 Sep 2025).
- Interactive Auditing Pipelines: Developing interfaces that surface candidate exclusive features and present iterative behavioral or attributional analysis to human auditors (Jiralerspong et al., 12 Feb 2026).
Ultimately, crosscoder-based feature comparison constitutes a central pillar in mechanistic interpretability, providing fine-to-coarse tools for comparing, attributing, and intervening on concept representations across models, modalities, and developmental trajectories.