Papers
Topics
Authors
Recent
Search
2000 character limit reached

Cross-Layer Transcoder Analysis

Updated 23 February 2026
  • CLT analysis is a mechanistic interpretability paradigm that explicitly traces representational transformations and information flow across deep neural network layers.
  • It factorizes layer transformations into dedicated encoders and decoders to compress activations into sparse features and reconstruct downstream computations accurately.
  • CLTs enable targeted interventions across languages and domains by extracting functional circuits, improving model interpretability and steering capabilities.

Cross-Layer Transcoder (CLT) analysis is a mechanistic interpretability paradigm for deep neural networks, designed to explicitly trace representational transformations and information flow across multiple network layers. The CLT framework factorizes the mapping between layers, compresses activations into sparse, interpretable features, and reconstructs downstream computations via dedicated decoder projections. This approach has been applied to LLMs for multilingual interpretability (Harrasse et al., 13 Nov 2025) and to protein LLMs (pLMs) for computational circuit tracing (Tsui et al., 12 Feb 2026), providing insights into shared representations, language and function-specific decoding, and experimentally steerable latent circuits.

1. Mathematical Formulation and Architecture

The canonical CLT architecture consists of layer-specific encoders and decoders that map a model’s residual-stream activations into a shared latent feature space and reconstruct feedforward (MLP) outputs downstream.

Let the residual-stream activation at layer \ell and position kk be h(k)Rdmodel\mathbf{h}_\ell^{(k)} \in \mathbb{R}^{d_\mathrm{model}}. For each \ell, the CLT factorizes the transformation into:

  • Encoder:

z(k)=ReLU(Wench(k)+benc)\mathbf{z}_\ell^{(k)} = \mathrm{ReLU}(W_\mathrm{enc}^\ell \mathbf{h}_\ell^{(k)} + b_\mathrm{enc}^\ell)

where z(k)Rdfeat\mathbf{z}_\ell^{(k)} \in \mathbb{R}^{d_\mathrm{feat}} is the sparse latent representation.

  • Decoder:

m^(k)=(Wdecz(k)+bdec)\hat{\mathbf{m}}_{\ell'}^{(k)} = \sum_{\ell \leq \ell'} (W_\mathrm{dec}^{\ell \to \ell'} \mathbf{z}_\ell^{(k)} + b_\mathrm{dec}^{\ell \to \ell'})

reconstructing the MLP output at downstream layer \ell'.

In ProtoMech for protein LMs (Tsui et al., 12 Feb 2026), a “joint” CLT variant encodes the full set of layerwise activations X()X^{(\ell)}, aggregates latents across all layers, and uses a TopK nonlinearity to enforce exact sparsity.

2. Training Objectives and Sparsity Constraints

CLTs are trained to accurately reconstruct target MLP outputs from earlier activations, with additional regularization to yield interpretable, sparse feature sets. The full loss can be expressed as:

  • Reconstruction loss (MSE):

Lrec=km^(k)m(k)22\mathcal{L}_\mathrm{rec} = \sum_{\ell'} \sum_k \|\hat{\mathbf{m}}_{\ell'}^{(k)} - \mathbf{m}_{\ell'}^{(k)}\|_2^2

  • L0L_0 feature-count sparsity:

LL0=λ0,ktanh(Cz(k)Wdec)\mathcal{L}_{L_0} = \lambda_0 \sum_{\ell,k} \tanh(C|\mathbf{z}_\ell^{(k)}|\odot\|W_\mathrm{dec}^{\ell}\|)

  • Dead-feature penalty:

Ldf=λdf,kReLU(exp(τ)h(k),preWdec)\mathcal{L}_\mathrm{df} = \lambda_\mathrm{df} \sum_{\ell,k} \mathrm{ReLU}(\exp(\tau) - \|\mathbf{h}_\ell^{(k),\mathrm{pre}}\|\|W_\mathrm{dec}^\ell\|)

The total loss is:

L=Lrec+LL0+Ldf\mathcal{L} = \mathcal{L}_\mathrm{rec} + \mathcal{L}_{L_0} + \mathcal{L}_\mathrm{df}

ProtoMech also incorporates an auxiliary error decoding loss and optionally L1L_1 regularization on encoder weights. Sparsity is typically enforced by a TopK nonlinearity per layer.

3. Cross-Language and Cross-Layer Alignment

CLTs enable systematic measurement of cross-language and cross-layer alignment through the following methods (Harrasse et al., 13 Nov 2025):

  • Cosine Similarity: For activations X,YRN×dX, Y \in \mathbb{R}^{N\times d},

cosine(X,Y)=1Ni=1NXiYiXiYi\mathrm{cosine}(X,Y) = \frac{1}{N} \sum_{i=1}^N \frac{X_i \cdot Y_i}{\|X_i\|\|Y_i\|}

  • CCA Similarity: Finds linear projections w1,w2w_1, w_2 that maximize

ρ=corr(H(1)w1,H(2)w2)\rho = \mathrm{corr}(H^{(1)}w_1, H^{(2)}w_2)

Expresses the aggregate alignment of two languages at a given layer.

  • Multilingual Score (H(f)H(f)): For feature ff, counts activation rates Al(f)A_l(f) per language ll, forming probabilities pl(f)p_l(f) and entropy

H(f)=l=1Lpl(f)logpl(f)H(f) = -\sum_{l=1}^L p_l(f) \log p_l(f)

High H(f)H(f) denotes a multilingual feature, low H(f)H(f) language specificity.

Layerwise averaging of H(f)H(f) reveals a characteristic “U-shape”: entropy (cross-language sharing) is minimal in early/late layers, reflecting language-specific or decoding features, and maximal in the middle (shared pivot representation).

Statistical tests confirm that middle-layer alignments between all language pairs are nearly identical (p0.05p \gg 0.05, Wilcoxon signed-rank test), implicating a single pivot representational space.

4. Attribution Graphs and Circuit Extraction

Cross-layer attribution quantifies the contribution of features across layers to targeted outputs. Pairwise attribution scores from source node (,k,n)(\ell,k,n) to target (,k,n)(\ell',k',n') are given by (Harrasse et al., 13 Nov 2025):

a,k,n  ,k,n=s=fk,n  sJ,k  ,kgk,na_{\ell,k,n}^{\;\ell',k',n'} = \sum_{s=\ell}^{\ell'} f_{k,n}^{\;\ell\to s} J_{\ell,k}^{\;\ell',k'} g_{k',n'}^{\,\ell'}

where fk,nsf_{k,n}^{\,\ell\to s} (decoder row), J,k  ,kJ_{\ell,k}^{\;\ell',k'} (stop-gradient Jacobian), and gk,ng_{k',n'}^{\,\ell'} (encoder activation) quantify the effect on the logit.

Pruning the resulting graph to the minimal edge set explaining ≥95% of the logit effect yields sparse attribution circuits spanning multiple layers.

ProtoMech extracts functional “circuits” by greedily selecting latent channels with highest attribution (gradients of probe scores w.r.t. latent activations), iteratively accumulating those that recover ≥70% of baseline performance (Tsui et al., 12 Feb 2026).

Framework Circuit Size (γ\gamma) Recovery
ProtoMech (pLM, ESM2) <<1% of latents (\sim150) 74–79%
LLM CLT (GPT-2, 12L) 4–6 features for language ID >>90% swap

5. Intervention and Steering

CLT representations allow for targeted interventions on model computations. In multilingual LLMs, a small set of “language features” FLF_L in the final layers have activation rates \approx50–100% on tokens of language LL (Harrasse et al., 13 Nov 2025). By zeroing out features associated with a source language (FSF_S) and activating those of a target language (FTF_T), model output can be switched to a different language with high reliability.

The intervention formula is:

z~=z+fFTαfeffFSβfef\tilde{\mathbf{z}}_\ell = \mathbf{z}_\ell + \sum_{f \in F_T} \alpha_f \mathbf{e}_f - \sum_{f \in F_S} \beta_f \mathbf{e}_f

Logit changes are predominantly linear, with language swaps raising the target token rank from \sim100 to top-5, and >90%>90\% success rates with only 4–6 features manipulated.

In protein LMs, steering along extracted circuits by clamping relevant latents yields variant sequences with superior design fitness; for instance, on 7 deep mutational scanning (DMS) assays, circuit steering produced the single highest-fitness variant in 71% of cases (mean ProtoMech score 4.17 vs. CAA 2.93 and random 2.74 for GFP_AEQVI) (Tsui et al., 12 Feb 2026).

6. Empirical Results and Benchmarks

CLT-based analysis provides high-fidelity and compressible mechanistic models:

  • For protein classification (InterPro, ESM2), CLT replicates 82–89% of the original model’s F1_1; ProtoMech circuits (γ\gamma \approx0.8%) recover 79% (Tsui et al., 12 Feb 2026).
  • In LLMs, linear language-ID probes on early-layer features achieve \sim98% accuracy by layer 2 (Harrasse et al., 13 Nov 2025).
  • The multilingual entropy peak and cross-language alignment maxima are observed in the midlayers, supporting the pivot representation hypothesis (cosine-CCA peaks at $0.93$ in GPT-2 layers 5–8, with minima $0.2$, maxima $1.4$ for H(f)H(f)).
  • Sparse CLT interventions induce robust changes despite extreme compression: in pLMs, using <<1% of the latent space retains >>75–80% of baseline task performance at γ1%\gamma \leq 1\%.

7. Interpretability, Mechanistic Insights, and Implications

CLT analysis enables a compositional, circuit-level view of deep model computations. In multilingual LLMs, shared pivot-layer representations are evident, with language identity encoded and decodable from early features and final outputs determined by high-frequency language-specific activations. In pLMs, CLTs recover circuits that map onto known structural and functional protein motifs (e.g., catalytic loops, binding pockets).

A notable finding is that architectural or tokenization design influences downstream feature activation and language-specific decoding, rather than formation of the shared pivot. For non-English languages, failures correlate with weaker activation of late-layer language features and increased sub-token fragmentation (embedding-to-feature edge strengths correlate r=0.82r=-0.82 with average subword count per token in LLMs).

This suggests that CLTs bridge the methodological gap between solely layerwise interpretability (as in sparse autoencoders) and end-to-end circuit tracing, producing interpretable surrogates that maintain high fidelity to the original model’s computations.

References

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Cross-Layer Transcoder (CLT) Analysis.