Cross-Layer Transcoder Analysis
- CLT analysis is a mechanistic interpretability paradigm that explicitly traces representational transformations and information flow across deep neural network layers.
- It factorizes layer transformations into dedicated encoders and decoders to compress activations into sparse features and reconstruct downstream computations accurately.
- CLTs enable targeted interventions across languages and domains by extracting functional circuits, improving model interpretability and steering capabilities.
Cross-Layer Transcoder (CLT) analysis is a mechanistic interpretability paradigm for deep neural networks, designed to explicitly trace representational transformations and information flow across multiple network layers. The CLT framework factorizes the mapping between layers, compresses activations into sparse, interpretable features, and reconstructs downstream computations via dedicated decoder projections. This approach has been applied to LLMs for multilingual interpretability (Harrasse et al., 13 Nov 2025) and to protein LLMs (pLMs) for computational circuit tracing (Tsui et al., 12 Feb 2026), providing insights into shared representations, language and function-specific decoding, and experimentally steerable latent circuits.
1. Mathematical Formulation and Architecture
The canonical CLT architecture consists of layer-specific encoders and decoders that map a model’s residual-stream activations into a shared latent feature space and reconstruct feedforward (MLP) outputs downstream.
Let the residual-stream activation at layer and position be . For each , the CLT factorizes the transformation into:
- Encoder:
where is the sparse latent representation.
- Decoder:
reconstructing the MLP output at downstream layer .
In ProtoMech for protein LMs (Tsui et al., 12 Feb 2026), a “joint” CLT variant encodes the full set of layerwise activations , aggregates latents across all layers, and uses a TopK nonlinearity to enforce exact sparsity.
2. Training Objectives and Sparsity Constraints
CLTs are trained to accurately reconstruct target MLP outputs from earlier activations, with additional regularization to yield interpretable, sparse feature sets. The full loss can be expressed as:
- Reconstruction loss (MSE):
- feature-count sparsity:
- Dead-feature penalty:
The total loss is:
ProtoMech also incorporates an auxiliary error decoding loss and optionally regularization on encoder weights. Sparsity is typically enforced by a TopK nonlinearity per layer.
3. Cross-Language and Cross-Layer Alignment
CLTs enable systematic measurement of cross-language and cross-layer alignment through the following methods (Harrasse et al., 13 Nov 2025):
- Cosine Similarity: For activations ,
- CCA Similarity: Finds linear projections that maximize
Expresses the aggregate alignment of two languages at a given layer.
- Multilingual Score (): For feature , counts activation rates per language , forming probabilities and entropy
High denotes a multilingual feature, low language specificity.
Layerwise averaging of reveals a characteristic “U-shape”: entropy (cross-language sharing) is minimal in early/late layers, reflecting language-specific or decoding features, and maximal in the middle (shared pivot representation).
Statistical tests confirm that middle-layer alignments between all language pairs are nearly identical (, Wilcoxon signed-rank test), implicating a single pivot representational space.
4. Attribution Graphs and Circuit Extraction
Cross-layer attribution quantifies the contribution of features across layers to targeted outputs. Pairwise attribution scores from source node to target are given by (Harrasse et al., 13 Nov 2025):
where (decoder row), (stop-gradient Jacobian), and (encoder activation) quantify the effect on the logit.
Pruning the resulting graph to the minimal edge set explaining ≥95% of the logit effect yields sparse attribution circuits spanning multiple layers.
ProtoMech extracts functional “circuits” by greedily selecting latent channels with highest attribution (gradients of probe scores w.r.t. latent activations), iteratively accumulating those that recover ≥70% of baseline performance (Tsui et al., 12 Feb 2026).
| Framework | Circuit Size () | Recovery |
|---|---|---|
| ProtoMech (pLM, ESM2) | 1% of latents (150) | 74–79% |
| LLM CLT (GPT-2, 12L) | 4–6 features for language ID | 90% swap |
5. Intervention and Steering
CLT representations allow for targeted interventions on model computations. In multilingual LLMs, a small set of “language features” in the final layers have activation rates 50–100% on tokens of language (Harrasse et al., 13 Nov 2025). By zeroing out features associated with a source language () and activating those of a target language (), model output can be switched to a different language with high reliability.
The intervention formula is:
Logit changes are predominantly linear, with language swaps raising the target token rank from 100 to top-5, and success rates with only 4–6 features manipulated.
In protein LMs, steering along extracted circuits by clamping relevant latents yields variant sequences with superior design fitness; for instance, on 7 deep mutational scanning (DMS) assays, circuit steering produced the single highest-fitness variant in 71% of cases (mean ProtoMech score 4.17 vs. CAA 2.93 and random 2.74 for GFP_AEQVI) (Tsui et al., 12 Feb 2026).
6. Empirical Results and Benchmarks
CLT-based analysis provides high-fidelity and compressible mechanistic models:
- For protein classification (InterPro, ESM2), CLT replicates 82–89% of the original model’s F; ProtoMech circuits (0.8%) recover 79% (Tsui et al., 12 Feb 2026).
- In LLMs, linear language-ID probes on early-layer features achieve 98% accuracy by layer 2 (Harrasse et al., 13 Nov 2025).
- The multilingual entropy peak and cross-language alignment maxima are observed in the midlayers, supporting the pivot representation hypothesis (cosine-CCA peaks at $0.93$ in GPT-2 layers 5–8, with minima $0.2$, maxima $1.4$ for ).
- Sparse CLT interventions induce robust changes despite extreme compression: in pLMs, using 1% of the latent space retains 75–80% of baseline task performance at .
7. Interpretability, Mechanistic Insights, and Implications
CLT analysis enables a compositional, circuit-level view of deep model computations. In multilingual LLMs, shared pivot-layer representations are evident, with language identity encoded and decodable from early features and final outputs determined by high-frequency language-specific activations. In pLMs, CLTs recover circuits that map onto known structural and functional protein motifs (e.g., catalytic loops, binding pockets).
A notable finding is that architectural or tokenization design influences downstream feature activation and language-specific decoding, rather than formation of the shared pivot. For non-English languages, failures correlate with weaker activation of late-layer language features and increased sub-token fragmentation (embedding-to-feature edge strengths correlate with average subword count per token in LLMs).
This suggests that CLTs bridge the methodological gap between solely layerwise interpretability (as in sparse autoencoders) and end-to-end circuit tracing, producing interpretable surrogates that maintain high fidelity to the original model’s computations.
References
- "Tracing Multilingual Representations in LLMs with Cross-Layer Transcoders" (Harrasse et al., 13 Nov 2025)
- "Protein Circuit Tracing via Cross-layer Transcoders" (Tsui et al., 12 Feb 2026)