Cross-Snapshot Crosscoders

Updated 4 November 2025

The paper presents a novel sparse dictionary learning framework that discovers and aligns interpretable features across distinct model snapshots and transformations.
It extends sparse autoencoders with a joint feature code and per-source decoders to enable precise tracking of feature evolution and symmetry in neural networks.
Applications include mechanistic interpretability, model diffing, and causal feature attribution in language and vision models.

Cross-Snapshot Crosscoders are a family of sparse dictionary learning methods designed to systematically discover, track, and analyse interpretable features across distinct model snapshots, input transformations, or model variants. These techniques address the challenge of aligning concept representations that emerge, shift, or disappear between layers, training checkpoints, or even models differing by fine-tuning, symmetry, or domain shift. Originally motivated by mechanistic interpretability and model diffing in neural networks and LLMs, cross-snapshot crosscoders now provide an essential toolkit for causal, fine-grained analysis of feature formation, evolution, and symmetry.

1. Foundations of Cross-Snapshot Crosscoders

Crosscoders extend sparse autoencoders (SAEs) by learning a joint interpretable feature space aligned across two or more sources. Given a collection of model activations (snapshots)—which may be different layers, models, transformations, or checkpoints—the crosscoder trains a shared sparse code $\mathbf{f}$ with separate encoder and decoder mappings for each source. The loss function incentivizes sparse, interpretable features that reconstruct the activations well across sources, allowing direct tracking and comparison of latent directions (“features”) as they evolve or transform.

Formally, for snapshots indexed by $c$ (e.g., layers, training epochs, group-transformed inputs), activations $x_c$ from each source are encoded as

$f = \mathrm{ReLU} \left( \sum_c W_c^{(\mathrm{enc})} x_c + b_c^{(\mathrm{enc})} \right)$

with per-source decoders: $\hat{x}_c = W_c^{(\mathrm{dec})} f + b_c^{(\mathrm{dec})}$ The loss aggregates reconstruction error and sparsity penalties across all sources.

2. Group Crosscoders: Dictionary Learning Over Symmetry Groups

Group crosscoders introduce group actions as the fundamental axis for cross-snapshot analysis (Gorton, 2024). Rather than training across model layers or checkpoints, group crosscoders are trained across responses to inputs transformed by the elements of a symmetry group $G$ . For instance, in image models, activations are collected for every group-transformed input $gI$ ( $g\in G$ ); the group crosscoder must reconstruct the entire orbit of activations from the untransformed input, making symmetry a core property of the learned representation.

This leads to a block-structured dictionary, where each block corresponds to a group element (e.g., rotation, reflection). Feature clustering under a metric respecting group symmetry reveals interpretable “feature families” (e.g., curve and line detectors for InceptionV1's mixed3b layer under the dihedral group $\mathrm{D}_{32}$ ), significantly sharpening the separation compared to SAE clustering. Symmetry properties (invariance/equivariance) become directly testable by inspecting blockwise similarities and permutations induced by group actions.

Method	Trained On	Discovers
Standard SAE	Activations, single source	Disentangled features
Standard Crosscoder	Multiple sources (layer/model)	Analogous cross-source features
Group Crosscoder	Group transformations of input	Full symmetry families, equivariance/invariance

3. Feature Tracking Across Pretraining Snapshots

In LLMs, crosscoders enable detailed tracking of feature emergence, persistence, and retirement throughout pretraining (Ge et al., 21 Sep 2025, Bayazit et al., 5 Sep 2025). By training on a set of model checkpoints, a unified sparse feature code is established so that each feature in $\mathbf{f}$ has interpretable decoder projections into all snapshots.

Decoder norm trajectories and similarity matrices provide quantitative proxies for feature “strength” and directional stability. Analysis reveals key phenomena: features are typically born abruptly during a “turning point” in training (e.g., after statistical fitting of unigrams/bigrams), with more complex features appearing later; feature directions reorient dramatically at the onset of feature learning. Empirically, earlier features tend to be simple lexical detectors, whereas those forming after the turning point are compositional or context-sensitive.

4. Model Diffing and Chat-Specific Feature Identification

Crosscoders are also central to model diffing—interpreting representational shifts during fine-tuning, e.g., when a base model becomes chat-tuned (Minder et al., 3 Apr 2025). Here, crosscoders learn a shared dictionary for the base and fine-tuned model. However, standard L1 sparsity induces artifacts: (1) complete shrinkage, whereby base decoder norms are zeroed and chat-latents are spuriously labelled “chat-only,” and (2) latent decoupling, hiding shared concepts under non-overlapping latents.

To address this, the BatchTopK crosscoder method activates only the top $K$ latents per batch, reducing spurious attribution and yielding genuinely chat-specific features (e.g., refusal, false information detection, personal questions). Latent scaling analyses further disentangle true model-specific features from artifacts. At a technical level, robust causal testing via KL divergence and activation patching validates the functional impact of discovered features.

5. Metrics and Causal Attribution of Feature Importance

Novel metrics such as Relative Indirect Effect (RelIE) provide causal, checkpoint-specific quantification of feature importance (Bayazit et al., 5 Sep 2025). For each feature, RelIE measures its proportional contribution to downstream task metrics via integrated gradients or ablation. For checkpoint $c_i$ , RelIE is given by: $\mathrm{RelIE}_{\text{3-way}, i} = \frac{\left(|\mathrm{IE}_i^{c_1}|, |\mathrm{IE}_i^{c_2}|, |\mathrm{IE}_i^{c_3}|\right)}{\sum_{c \in \{c_1,c_2,c_3\}} |\mathrm{IE}_i^{c}|}$ Enabling explicit annotation of whether, for example, a grammatical feature is born at a specific stage, persists, or vanishes. Table-based annotations directly map features to stages of emergence and discontinuation with actual causal grounding.

6. Broader Mechanistic Interpretability and Applications

Cross-snapshot crosscoders generalize across domains and axes—layers, models, group actions, language/model checkpoints. Their capacity for systematic, objective discovery and tracking of feature families, symmetry properties, and evolution makes them suitable for fine-grained mechanistic interpretability, model comparison, and longitudinal analysis. Group crosscoders reveal automatic symmetry clustering in vision models; multi-snapshot crosscoders track the emergence of linguistic abstractions; model diffing crosscoders diagnose and attribute chat-specific behavioral changes at the latent-feature level.

A plausible implication is that these techniques will enable circuit-level interventions, monitoring, and alignment validation during model pretraining, fine-tuning, or structure-aware model design. Their architecture-agnostic framework offers a scalable path to precise mechanistic understanding, rigorous model diffing, and robust feature-level interpretability.

7. Limitations and Practical Considerations

Artifacts may arise under suboptimal sparsity regimes—standard L1 penalty is susceptible to spurious latent attribution and decoupling, particularly when models differ drastically. For model diffing, BatchTopK crosscoder and latent scaling analyses are recommended best practices. Group crosscoders presuppose symmetry groups that are manifest in input space and may not generalize to abstract (non-geometric) transformations without substantial redesign. Mechanistic interpretability remains challenging for features representing distributed or highly nonlinear concepts.

In summary, cross-snapshot crosscoders and their extensions—most notably group crosscoders and BatchTopK diffing—constitute a rigorously formulated, empirically validated methodology for discovering, aligning, and causally interpreting concept representations across model axes, transformations, and training trajectories. They provide unique analytical leverage for mechanistic interpretability, symmetry analysis, and targeted model comparison.