Cross-Layer Transcoder for Transformer Analysis
- Cross-Layer Transcoder (CLT) is a mechanistic interpretability tool that decomposes MLP outputs across transformer layers into sparse, unified features.
- It reconstructs outputs using cross-layer encoder–decoder mappings, facilitating circuit-level analysis in domains like language, vision, and protein modeling.
- Empirical findings demonstrate that CLTs enhance model interpretability with efficient attribution graphs, enabling targeted interventions and scalable analysis.
A Cross-Layer Transcoder (CLT) is a mechanistic interpretability tool for deep neural networks, particularly transformer-based models. Unlike traditional single-layer dictionary learning, a CLT provides a unified, sparse feature decomposition of all MLP (multi-layer perceptron) outputs across layers, enabling explicit tracing and reconstruction of transformative operations through the depth of the network. CLTs are trained as stand-alone, post-hoc surrogates, not by finetuning or augmenting the base model, but by learning cross-layer encoder–decoder mappings that reconstruct activations from a compressed latent space. This architecture provides circuit-level analysis of model computations, supporting the discovery of semantic, syntactic, decision, and function-specific features across a diverse range of domains including language, vision, and protein sequence modeling (Harrasse et al., 13 Nov 2025, Kim et al., 2 Apr 2026, Draye et al., 22 Mar 2026, Tsui et al., 12 Feb 2026).
1. Formal Architecture of Cross-Layer Transcoders
A Cross-Layer Transcoder replaces every transformer MLP block with a sum of cross-layer contributions from sparse latent features. Let denote the number of transformer layers and the model’s hidden dimension. For each layer , the input to the MLP block is encoded to a high-dimensional sparse feature vector :
where , , and is a sparsifying nonlinearity (such as JumpReLU or TopK).
The MLP output at any downstream layer is reconstructed as: 0 with 1 and 2.
The full model uses 3 parameters. Sparsity is enforced either by a nonlinearity (JumpReLU, TopK), an 4 or 5 regularization, and explicit dead-unit penalties. The cross-layer decoding architecture enables interpreting the function of latent features as they propagate through (and interact within) the computational graph of the transformer (Harrasse et al., 13 Nov 2025, Draye et al., 22 Mar 2026, Tsui et al., 12 Feb 2026).
2. Training Objectives and Optimization Procedure
CLTs are trained on a large set of frozen activation pairs 6 extracted from the base model. The objective is to minimize a composite loss: 7 where the first term is a mean-squared error reconstruction loss, the second enforces sparsity over feature activations, and the third penalizes “dead” (unused) features (Harrasse et al., 13 Nov 2025, Draye et al., 22 Mar 2026).
Specialized formulations appear in domain-specific settings. For example, in protein LLMs, the ProtoMech framework applies a two-headed loss: a principal mean-squared reconstruction loss and an auxiliary “dead unit” loss over TopK-activated latents, supporting joint encoder–decoder optimization and consistent use of the latent space (Tsui et al., 12 Feb 2026). Typical optimization employs Adam, distributed feature-wise sharding, activation caching with quantization, and batch-minded datastreaming for scalability (Draye et al., 22 Mar 2026).
3. Attribution Graphs and Mechanistic Analysis
Once trained, a CLT supports the construction of attribution graphs: directed graphs where nodes are feature activations at different layers or positions, and edges represent mechanistic influence traced via the CLT’s linear decoders and the frozen transformer Jacobian: 8 where 9 and 0 index decoders and encoders, 1 is the frozen Jacobian, and the sum proceeds over allowed cross-layer connections (Harrasse et al., 13 Nov 2025, Draye et al., 22 Mar 2026). Pruning to the dominant features and edges produces compact, human-interpretable circuits underlying specific outputs or behaviors.
In circuit-guided safety frameworks such as CRaFT, CLT-based attribution graphs enable selecting features by their causal influence (as opposed to merely their activation magnitude) on specific behaviors, e.g., next-token refusal heads. Circuit influence for a feature 2 is defined as: 3 where 4 is the normalized adjacency matrix of direct-effect weights and 5 specifies the distribution over terminal nodes (e.g., refusal logits) (Kim et al., 2 Apr 2026).
4. Empirical Findings and Applications
CLTs have enabled a series of empirical discoveries in both language and protein modeling:
- Multilingual LLMs: CLT studies on multilingual GPT-2–style and LLaMA models reveal a three-phase organization. Early layers show language-specific token reassembly, intermediate layers display high “multilingual entropy” (a U-shaped rise in feature entropy 6), signaling a shared “pivot” representation, and late layers specialize into language-specific decoding via a small set of high-frequency “language identity” features. Causal manipulation of these features in the CLT surrogates can systematically switch the model’s language output channel (Harrasse et al., 13 Nov 2025).
- Safety and Refusal Circuits: In CRaFT, CLT-based circuits enable identification and ablation of truly causal features for refusal (or noncompliance) behavior in instructions. CLT-based intervention achieves a ∼48.2% attack success rate (vs. baseline ∼6.7%) in jailbreak settings, highlighting the method’s sensitivity to functional causality (Kim et al., 2 Apr 2026).
- Protein LLMs: ProtoMech demonstrates that CLTs can recover up to 89% of original model performance in protein family/function prediction, and compress interpretable circuits to less than 1% of the latent space with only 74–79% performance loss. Analysis reveals that top-activated latents track known sequence motifs (e.g., enzymatic binding domains, catalytic sites), and steering along these circuits substantially improves design of high-fitness protein sequences (Tsui et al., 12 Feb 2026).
- Model Replacement and Compression: On LLMs such as GPT-2 and LLaMA 1B, CLTs achieve explained variance ≈0.80 at 100 active features per layer, and their compact attribution graphs reduce typical node count 3–4× relative to per-layer (single-layer) transcoders (Draye et al., 22 Mar 2026).
5. Implementation Infrastructure and Scalability
The CLT-Forge open-source library provides an end-to-end platform for scalable CLT training, analysis, and visualization (Draye et al., 22 Mar 2026). Key components include:
- Activation Caching and Quantization: Efficient storage through int8/int4 quantization and zstd compression yields 4–12× storage reduction with marginal (<3%) loss in reconstruction quality.
- Feature-wise Model Sharding: GPU memory footprint is minimized by splitting the feature dimension across devices; distributed microbatching enables training large CLTs (e.g., 1.5M features on LLaMA 1B) within tractable timescales (∼10 days on 8x80GB GPUs).
- Automated Interpretability (AutoInterp) and Circuit-Tracer: Parallelized analysis of top-activating tokens/sequences per feature, LLM-based feature annotation, and rapid construction/pruning of attribution graphs.
- Visualization: Lightweight Dash-based interface supporting cluster analysis, edge inspection, and interactive interventions in the latent feature space.
By merging cross-layer feature dictionaries and layer-dependent decoding, CLTs deliver more compact, interpretable, and scalable mechanistic models of transformer computation than independent per-layer transcoders (Draye et al., 22 Mar 2026).
6. Extensions, Limitations, and Prospective Directions
Trade-offs in CLT design include a quadratic parameter count in the number of layers and bottlenecking between sparsity (circuit compressibility) and fidelity (explained variance). Approaches such as low-rank or parameter-shared decoders are plausible directions to reduce resource requirements for very deep or wide models (Tsui et al., 12 Feb 2026). Extensions of the CLT paradigm encompass:
- Cross-modal and multi-domain applications (e.g., vision-LLMs, demographic/bias feature tracing, factual editing circuits).
- Hybrid models combining CLTs for MLPs with simplified surrogates for attention layers.
- Automated circuit annotation linking latents to external knowledge sources (e.g., PDB for proteins).
- Comparative representation analyses across architectures or fine-tuning objectives (Harrasse et al., 13 Nov 2025, Tsui et al., 12 Feb 2026).
A plausible implication is that by systematically mapping computational pathways in state-of-the-art models, CLTs offer a practical foundation for causal interpretability, alignment diagnostics, circuit discovery, and targeted model steering. In domains such as language safety or biomolecular design, sparse cross-layer circuit discovery may support both transparency and fine-grained functional control.
Key References:
- "Tracing Multilingual Representations in LLMs with Cross-Layer Transcoders" (Harrasse et al., 13 Nov 2025)
- "CRaFT: Circuit-Guided Refusal Feature Selection via Cross-Layer Transcoders" (Kim et al., 2 Apr 2026)
- "CLT-Forge: A Scalable Library for Cross-Layer Transcoders and Attribution Graphs" (Draye et al., 22 Mar 2026)
- "Protein Circuit Tracing via Cross-layer Transcoders" (Tsui et al., 12 Feb 2026)