Papers
Topics
Authors
Recent
Search
2000 character limit reached

Cross-Layer Transcoders (CLTs)

Updated 27 March 2026
  • CLTs are architectural modules that reformat and share sparse feature codes across layers for improved interpretability.
  • They employ dedicated encoders and decoders to reconstruct activations, reducing attribution graph nodes by around 3×.
  • Scalable techniques like activation caching and feature sharding enable efficient training and causal interventions in large models.

Cross-Layer Transcoders (CLTs) are architectural modules for analyzing and reconstructing the computation within deep neural networks, particularly transformer-based models. CLTs are designed to enable mechanistic interpretability by representing the distributed activation dynamics of multilayer perceptrons (MLPs) as sparse, interpretable features that are shared across network layers but decoded in a layer-specific manner. This methodology provides significant compression of attribution graphs, increases interpretability, and enables causal interventions for probing model behavior across a range of modalities, including language and protein models (Draye et al., 22 Mar 2026, Harrasse et al., 13 Nov 2025, Tsui et al., 12 Feb 2026).

1. Formal Definition and Objectives

A Cross-Layer Transcoder (CLT) consists of layer-wise encoders and cross-layer decoders. For a transformer MLP at layer \ell with input activation hRdh_\ell \in \mathbb{R}^{d}, a standard single-layer transcoder computes

z=σ(Wench+benc),m^=Wdecz+bdec,z_\ell = \sigma(W^{\ell}_{enc} h_\ell + b^{\ell}_{enc}), \quad \hat{m}_\ell = W^{\ell}_{dec} z_\ell + b^{\ell}_{dec},

where σ\sigma is a sparsifying nonlinearity (e.g., JumpReLU or ReLU).

The distinguishing feature of a CLT is that the sparse feature code zz_\ell for each layer \ell is not discarded after reconstructing mm_\ell but shared with subsequent layers. The output at any layer \ell' is reconstructed as

m^==1Wdecz+bdec,\hat{m}_{\ell'} = \sum_{\ell=1}^{\ell'} W^{\ell \to \ell'}_{dec} z_\ell + b^{\ell'}_{dec},

with WdecRd×eW^{\ell \to \ell'}_{dec} \in \mathbb{R}^{d \times e} specifying a unique decoder for each cross-layer pair (Draye et al., 22 Mar 2026, Harrasse et al., 13 Nov 2025). The objective minimized during training is

L=m^m22+λ0tanh(C(zWdec))+λ1ReLU(eτhpre)Wdec,\mathcal{L} = \sum_{\ell'} \| \hat{m}_{\ell'} - m_{\ell'} \|_2^2 + \lambda_0 \sum_\ell \tanh\left(C (z_\ell \odot \|W^{\ell}_{dec}\|)\right) + \lambda_1 \sum_\ell \text{ReLU}\left(e^{\tau} - h^{pre}_\ell \right) \|W^{\ell}_{dec}\|,

where the first term is mean-squared error (MSE) reconstruction, the second term enforces sparsity, and the third is a “dead-feature” penalty (Draye et al., 22 Mar 2026).

In protein models, ProtoMech extends this framework by forming a global sparse latent zz as the concatenation of all layerwise features, enabling joint reconstruction and task-specific prediction (Tsui et al., 12 Feb 2026).

2. Cross-Layer Feature Sharing and Attribution Graphs

CLTs introduce a cross-layer sum that preserves layer specificity via unique decoders but compacts the overall representation by sharing feature bases across layers. This is expressed as

m^=Wdecz\hat{m}_{\ell'} = \sum_{\ell \leq \ell'} W^{\ell \to \ell'}_{dec} z_\ell

for the language modeling context (Draye et al., 22 Mar 2026, Harrasse et al., 13 Nov 2025), and as

h^==1Wa\hat{h}_\ell = \sum_{\ell'=1}^{\ell} W^{\ell' \to \ell} a_{\ell'}

in the protein context, where aa_{\ell'} are layerwise TopK-activated features (Tsui et al., 12 Feb 2026). This architecture yields attribution graphs where nodes are features and edges represent the causal influence of one feature (or set of features) in an upstream layer on features in a downstream layer via the chain rule and frozen-nonlinearity Jacobians.

A key metric is redundancy reduction: CLTs yield attribution graphs with approximately 3× fewer nodes versus traditional per-layer transcoders, yielding more tractable and interpretable representations of feature-level computation (Draye et al., 22 Mar 2026).

3. Training Pipelines, Sparsity Controls, and Scalability

CLT-Forge and ProtoMech frameworks have introduced scalable and distributed training approaches necessary for large models:

  • Feature-wise sharding: CLT parameters and activations are partitioned across devices by feature index, enabling training with millions of features and large batch sizes (e.g., e=48e=48 features, $1.5$M total), using 8 × 80GB GPUs (Draye et al., 22 Mar 2026).
  • Compressed activation caching: Precomputing, quantizing, and chunking activations reduces storage requirements by 4–12× (e.g., 300M tokens from 20TB to 4TB), with minimal loss in reconstruction accuracy (≈2–3%) (Draye et al., 22 Mar 2026).
  • Optimization: CLTs are trained with the Adam optimizer (β₁=0.9, β₂=0.999), using mixed precision, JumpReLU or TopK nonlinearity, L₀ target scheduling for sparsity, and explicit penalties for dead features (Draye et al., 22 Mar 2026, Harrasse et al., 13 Nov 2025, Tsui et al., 12 Feb 2026).

In ProtoMech, an auxiliary linear probe loss can be combined with the reconstruction objective to select sparse circuits relevant for a downstream task, with explicit 1\ell_1 regularization on zz for compression (Tsui et al., 12 Feb 2026).

4. Automated Interpretability and Causal Interventions

Unified automated interpretability pipelines are central to the CLT approach:

  • Feature analysis: Identify top-KK sequences/tokens per feature and compute summary statistics across large corpora (e.g., 10M tokens) (Draye et al., 22 Mar 2026).
  • LLM-prompted explanations: Automatically generate natural-language descriptions for features by prompting a LLM with representative activations, storing explanations in a unified database (Draye et al., 22 Mar 2026).
  • Attribution graph computation (Circuit-Tracer): For all feature pairs (,n)(,n)(\ell, n) \to (\ell', n') and token positions, compute edge weights using decoder and encoder vectors and Jacobians. Edges below a threshold are pruned, and clustering can further simplify the graph (Draye et al., 22 Mar 2026).
  • Direct interventions: In multilingual LLMs, manipulation of late-layer language-specific features (e.g., suppressing English, boosting French) demonstrably causes the model to switch output language, validating the causal role of these features (Harrasse et al., 13 Nov 2025). In protein models, clamping circuit-relevant latents can be used for targeted protein design, outperforming dense and per-layer baselines (Tsui et al., 12 Feb 2026).

5. Empirical Findings Across Modalities

CLTs have been quantitatively and qualitatively evaluated in both language and biological sequence settings:

  • LLMs: On GPT-2, CLTs achieve explained variance ≈0.8, replacement score ≈0.8, and attribution graph completeness ≈0.95 under standard prompts (Draye et al., 22 Mar 2026). They reproduce known circuits such as antonym (“The opposite of…”) pathways and enable direct tracing of multilingual representations—including identifying “pivot” hidden states in middle layers and language-specific decoding in final layers (Harrasse et al., 13 Nov 2025).
  • Protein LLMs: In ProtoMech, CLTs reconstruct 82–89% of full model performance for protein family classification and function prediction, while compressed (<1%) circuits retain up to 79% of task performance. Compared to per-layer transcoders (PLTs), CLT-based circuits recover much more of the original model's accuracy (e.g., F1 of 0.82±0.19 for CLT vs. 0.50±0.34 for PLT) (Tsui et al., 12 Feb 2026).
  • Compression and Interpretability: Sparse feature sharing drastically reduces attribution graph nodes, aligns individual dimensions with known functional motifs in both protein and LLMs, and allows for steerable interventions validated by downstream task metrics (Tsui et al., 12 Feb 2026, Harrasse et al., 13 Nov 2025).
  • Scalability and Resource Utilization: Activation caching and distributed sharding enable end-to-end pipelines at trillion-token scales and with models exceeding 1B parameters (Draye et al., 22 Mar 2026).

6. Implementation Frameworks and API Functionality

The CLT-Forge library is the reference implementation for CLTs in LLMs, providing:

  • Installation: pip install clt-forge
  • Activation caching: Interfaces to load large pre-trained transformer models, generate, quantize, and store activations efficiently.
  • Training: Modular runners for distributed, sharded training of CLTs, including support for low-rank finetuning (Draye et al., 22 Mar 2026).
  • Interpretability: Automated feature extraction, LLM-based explanations, and direct computation of attribution graphs.
  • Visualization: Dash-based UI for interactive exploration of circuits and feature graphs.
  • Case study runner: API for evaluating interventions (e.g., testing the antonym circuit or multilingual feature tracing).

Table: API workflow components in CLT-Forge (Draye et al., 22 Mar 2026)

Component Description Example Call
Activation Store Loads/generates compressed activations store.generate_and_save_activations()
CLT Training Trains CLTs with distributed sharding, L₀ scheduling trainer.run()
AutoInterp Runs feature analysis and LLM-based natural language explanations autointerp.run("path/to/features")
Attribution Runner Computes and visualizes feature attribution graphs runner.run("prompt")

7. Implications and Research Applications

CLTs provide a tractable and principled framework for mechanistically probing neural representations:

  • In LLMs, CLTs have revealed U-shaped multilingual entropy patterns, existence of “pivot” representations, and the emergence of sparse, causally decisive language features in late layers (Harrasse et al., 13 Nov 2025).
  • In protein models, CLTs recover compositional circuits aligned with structural/functional biological motifs and enable circuit-based design by direct manipulation of sparse latents (Tsui et al., 12 Feb 2026).
  • Efficient scaling via sharding and compression, as well as integration with interpretability workflows, advances the practical applicability of mechanistic feature-based understanding to large, real-world models (Draye et al., 22 Mar 2026).
  • CLTs support both direct causal validation and empirical discovery of circuits, informing theory and design across domains where multilayer, distributed representations dominate.

A plausible implication is that CLT-based approaches can be generalized to other model modalities and may help bridge the gap between interpretable sparse coding and the black-box statistics of modern large-scale neural networks.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Cross-Layer Transcoders (CLTs).