Papers
Topics
Authors
Recent
Search
2000 character limit reached

Crosscoder-Based Feature Comparisons

Updated 3 July 2026
  • Crosscoder-based feature comparisons are methodologies that align neural representations across models using shared sparse autoencoder dictionaries.
  • They quantify shared and specialized features through metrics like weighted max pairwise Pearson correlation and comparative sharedness.
  • These techniques provide actionable insights for model interpretability, safety auditing, and causal attribution by isolating behaviorally relevant features.

Crosscoder-based feature comparisons comprise a rigorous class of methodologies for quantitatively and causally analyzing how neural representations—features, directions, or concepts—are shared, transferred, or specialized between multiple neural network models or checkpoints. These tools have seen increasing prominence as a unified and interpretable basis for comparing encoders across modalities, model sizes, architectures, fine-tuning regimes, and even at individual training stages. Unlike simple activation statistics or representation-similarity metrics, crosscoder-based comparisons are grounded in shared sparse autoencoder dictionaries, enabling directly aligned, interpretable feature spaces and fine-grained attribution of conceptual overlap or divergence across models.

1. Mathematical Framework and Metric Definitions

Central to crosscoder-based comparison is the learning of a shared (or partitioned) dictionary of sparse latent features that serves to align activation spaces across two or more models. Let AA and BB represent model activations (e.g., hidden states at a particular layer). The standard procedure involves training sparse autoencoders—typically with a TopK or hard 0\ell_0 constraint—such that both models' activations are reconstructed from the same code, or from codes sharing a portion of dictionary indices.

Key metrics for feature comparison include:

  • Weighted Max Pairwise Pearson Correlation (wMPPC): For features fiAf_i^A (from a sparse autoencoder on model AA) and fjBf_j^B, the maximum cross-model Pearson correlation ρiAB\rho_i^{A\to B} is computed and weighted by the total activation SiAS_i^A. The aggregate wMPPC is given by

wMPPCAB=i=1MSiAρiABi=1MSiA\text{wMPPC}^{A\to B} = \frac{ \sum_{i=1}^M S_i^A \cdot \rho_i^{A\to B} }{ \sum_{i=1}^M S_i^A }

This quantifies feature-level representational alignment between entire models (Cornet et al., 24 Jul 2025).

  • Comparative Sharedness (ΔiMA,B\Delta_i^{M \to A,B}): For a source feature BB0 in model BB1, comparative sharedness quantifies how much more strongly that feature aligns with group BB2 versus BB3:

BB4

This construct identifies features that are shared with one model or family but not another (Cornet et al., 24 Jul 2025).

  • Relative Decoder Norms (\textit{Model-Specificity}): For decoders BB5 and BB6 associated with feature BB7, their relative norm

BB8

serves as a post-hoc measure of model-specificity or exclusivity (Jiralerspong et al., 12 Feb 2026).

  • Relative Indirect Effect (RelIE): When studying evolution across training checkpoints, causal importance of features is quantified via RelIE, which normalizes ablation-induced performance drops across snapshots:

BB9

Features with RelIE near one are specific to a checkpoint; intermediate values denote shared features (Bayazit et al., 5 Sep 2025).

2. Crosscoder Architectures and Training Protocols

2.1 Sparse Autoencoder (SAE) and TopK Crosscoders

The principal architecture comprises encoders and decoders for each model, with the latent code typically constrained by sparsity: 0\ell_00 Reconstructions are produced for each model: 0\ell_01 Losses include mean-squared reconstruction error and per-feature sparsity regularization, enforced via L1 penalties or hard BatchTopK gating. The BatchTopK variant prevents representation drift and yields more robust inference of model-specific features (Minder et al., 3 Apr 2025, Chaudhari et al., 6 Mar 2026).

2.2 Dedicated Feature Crosscoder (DFC)

The DFC extends standard crosscoders by partitioning the dictionary into three non-overlapping index sets: 0\ell_02-exclusive, 0\ell_03-exclusive, and shared. Structural constraints—enforced via decoder clamping—yield exact model exclusivity without the approximation error of post-hoc norm analysis. This architecture is essential for robust model diffing and has demonstrated sharper separation of model-unique features, particularly for safety-relevant or policy-alignment features (Jiralerspong et al., 12 Feb 2026, Shportko et al., 25 Jun 2026).

2.3 Delta-Crosscoder and Narrow-Diff Regimes

For detecting subtle or extremely sparse representation changes (e.g., backdoors, fine-grained misalignment), the Delta-Crosscoder incorporates:

  • Dual-K sparsity (shared and 0\ell_04 code blocks),
  • A dedicated 0\ell_05-loss on the difference of code activations between models,
  • An implicit contrastive loss on code similarity, yielding reliable isolation of causally relevant directions that govern the fine-tuned behavior (Kassem et al., 16 Feb 2026).

3. Empirical Findings in Crosscoder-based Comparisons

Studies applying crosscoder-based feature comparison have produced the following results:

Comparison Context Key Empirical Finding Reference
Vision-Text-MM Models Last-layer representations show high wMPPC both within and across modalities; shared high-level concepts concentrate in final layers. (Cornet et al., 24 Jul 2025)
LLM Pretraining Crosscoders track feature emergence and maintenance; RelIE reliably attributes causal importance; early-to-late checkpoints show distinct feature clusters. (Bayazit et al., 5 Sep 2025)
Model Distillation Unique reasoning features emerge in distilled models (e.g., "self-reflection" direction); geometry shifts correlate with performance improvement. (Baek et al., 5 Mar 2025)
RL Fine-tuning DFCs localize tool-use capability to a minimal set of features; single-neuron steering achieves large behavioral shifts; capability "spillover" is observed. (Shportko et al., 25 Jun 2026)
Compression (VLMs) Pruning rotates/attenuates features (high FSR, low alignment), quantization preserves alignment in surviving features; safety-critical circuits are affected. (Elluru et al., 26 Mar 2026)
MoE vs Dense Models MoE models develop fewer, higher-activation-density exclusive features; dense models distribute concepts across more sparse latents. (Chaudhari et al., 6 Mar 2026)

These findings indicate that crosscoder analysis is sensitive to both shared semantic structure and model-specific or stage-specific innovations, across a broad spectrum of architectures and interventions.

4. Practical Applications and Interpretability

Crosscoder-based feature comparison provides concrete tools for:

5. Limitations, Challenges, and Best Practices

Several challenges are highlighted in the literature:

  • Sparsity Artifacts: Standard L1 crosscoder objectives are vulnerable to "Complete Shrinkage" and "Latent Decoupling," which can misclassify shared features as specific to one model. Remedies include the BatchTopK constraint and Latent Scaling diagnostics, which provide more faithful partitions (Minder et al., 3 Apr 2025).
  • Fine-tuning Specificity: In narrow-diff settings, the standard joint-reconstruction approach underfits rare, fine-tuned feature directions. Dedicated 0\ell_07-blocks and contrastive loss, as in Delta-Crosscoder, are required for recovery (Kassem et al., 16 Feb 2026).
  • Partitioning Sensitivity: The precision and recall of exclusive-feature discovery depends on DFC partition sizes and hyperparameters. Multi-run consensus and partitioning priors are suggested future directions (Jiralerspong et al., 12 Feb 2026).
  • Benchmark and Annotation Bias: The interpretability and benchmarking of discovered features may depend on external automated LLM judges, and current causal validation remains semi-manual. Robust downstream annotations and large-scale pattern mining may address some limitations.

Best practice recommendations include:

  • Use BatchTopK instead of L1 sparsity for cross-model feature attribution.
  • Validate model-specific directions by intervention, not just decoder norms.
  • Employ DFCs for clear model-exclusive feature demarcation, especially in safety contexts.
  • Use Delta-Crosscoders for tasks involving extremely subtle or sparse fine-tuning changes.

6. Extensions and Future Directions

Potential avenues of ongoing research and extensions include:

Ultimately, crosscoder-based feature comparison constitutes a central pillar in mechanistic interpretability, providing fine-to-coarse tools for comparing, attributing, and intervening on concept representations across models, modalities, and developmental trajectories.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Crosscoder-Based Feature Comparisons.