Papers
Topics
Authors
Recent
2000 character limit reached

Central Dogma Transformer (CDT)

Updated 10 January 2026
  • Central Dogma Transformer (CDT) is a multi-modal model that aligns with the directional flow of DNA→RNA→Protein to integrate genomic, transcriptomic, and proteomic data.
  • The architecture unifies three frozen, pre-trained language models via directional cross-attention, producing a Virtual Cell Embedding for enhanced predictive accuracy.
  • CDT demonstrates robust performance on enhancer perturbation tasks and offers mechanistic insights through attention and gradient attribution analyses.

The Central Dogma Transformer (CDT) is a transformer-based architecture designed to model integrated cellular processes by aligning its structure with the directional information flow of the Central Dogma of molecular biology (DNA → RNA → Protein). CDT unifies three pre-trained LLMs, one per molecular modality, through directional cross-attention modules, resulting in a Virtual Cell Embedding that facilitates both predictive accuracy and mechanistic interpretability in genomics and cellular biology applications (Ota, 3 Jan 2026).

1. Architectural Design and Information Flow

CDT is composed of three frozen, pre-trained LMs, each embedding a distinct molecular modality into a common 768-dimensional latent space:

  • DNA LM: Utilizes Enformer, encoding 114 kb genomic windows around each enhancer into 896 positional embeddings of dimension 3072.
  • RNA LM: Employs scGPT, producing static 512-dimensional gene-token embeddings for approximately 2360 genes based on co-expression across 33 million single cells.
  • Protein LM: Incorporates ESM-C / ProteomeLM, yielding 768-dimensional protein embeddings informed by primary sequence and protein–protein context.

The workflow projects each raw modality embedding XmRnm×dmX_m \in \mathbb{R}^{n_m \times d_m} via linear transformation to the shared space: Hm=LayerNorm(XmWm+bm),WmRdm×d.H_m = \mathrm{LayerNorm}(X_m W_m + b_m),\quad W_m \in \mathbb{R}^{d_m \times d}. Each modality then undergoes intra-modal multi-head self-attention (8 heads), refining dependencies: SelfAttn(Hm)=softmax(QmKmTdk)Vm\mathrm{SelfAttn}(H_m) = \mathrm{softmax}\left(\frac{Q_m K_m^T}{\sqrt{d_k}}\right) V_m where Qm,Km,VmQ_m, K_m, V_m are linear projections of HmH_m.

CDT enforces the unidirectional, biologically-motivated information flow of the Central Dogma through two distinct cross-attention stages:

  1. DNA→RNA (Transcription cross-attention): Queries from RNA embeddings, keys/values from DNA. Output, "RNAfused_\text{fused}", integrates regulatory signals:

HRNAfused=HRNA+softmax(QRNAKDNATdk)VDNAH_{\mathrm{RNA}\,\text{fused}} = H_{\mathrm{RNA}} + \mathrm{softmax}\left(\frac{Q_{\mathrm{RNA}} K_{\mathrm{DNA}}^T}{\sqrt{d_k}}\right) V_{\mathrm{DNA}}

The attention weight matrix ADNARNARngenes×896A^{\rm DNA \to RNA} \in \mathbb{R}^{n_\text{genes} \times 896} is interpretable as gene–locus relevance.

  1. RNA→Protein (Translation cross-attention): Queries from protein embeddings, keys/values from RNAfused_\text{fused}; output, "Proteinfused_\text{fused}", integrates transcriptomic and genomic context:

HProtfused=HProt+softmax(QProtKRNAfusedTdk)VRNAfusedH_{\mathrm{Prot}\,\text{fused}} = H_{\mathrm{Prot}} + \mathrm{softmax}\left(\frac{Q_{\mathrm{Prot}} K_{\mathrm{RNA}\,\text{fused}}^T}{\sqrt{d_k}}\right) V_{\mathrm{RNA}\,\text{fused}}

Attention ARNAProtRnproteins×ngenesA^{\rm RNA \to Prot} \in \mathbb{R}^{n_\text{proteins} \times n_\text{genes}} encodes transcript–protein associations.

Reverse attention (RNA→DNA, Prot→RNA) is disallowed, ensuring biological interpretability of attention weights.

2. Virtual Cell Embedding and Pooling

Following cross-attention, each modality yields a refined representation:

  • DNA (self-attended), RNAfused_\text{fused}, and Proteinfused_\text{fused}.

Three learned pooling queries (qDNA,qRNA,qProtRdq_\text{DNA}, q_\text{RNA}, q_\text{Prot} \in \mathbb{R}^d) aggregate these modality representations, via: zm=softmax(qmHmTdk)Hm,zmRd.z_m = \mathrm{softmax}\left(\frac{q_m H_m^T}{\sqrt{d_k}}\right) H_m, \quad z_m \in \mathbb{R}^d. Their concatenation [zDNA;zRNA;zProt]R3d[z_\text{DNA}; z_\text{RNA}; z_\text{Prot}] \in \mathbb{R}^{3d} is processed by a two-layer GELU MLP, producing the final 768-dimensional Virtual Cell Embedding hVCEh_\text{VCE}. This unified embedding encodes cell-state as a function of genomic, transcriptomic, and proteomic context.

3. Training Procedures and Datasets

CDT leverages pre-trained weights for its DNA, RNA, and protein LMs:

  • DNA: Enformer, trained on ∼20,000 Mb of human and mouse chromatin data (ATAC, histone ChIP-seq)
  • RNA: scGPT, ~33 million single-cell transcriptomes across diverse tissues
  • Protein: ProteomeLM, 32,000 proteomes plus protein–protein interaction graphs

During fine-tuning, LM weights remain frozen. CDT introduces ∼60 million trainable parameters (<8% of total), preserving prior knowledge while constraining optimization.

The downstream task is supervised regression on K562 CRISPRi enhancer perturbation data (GSE120861), with β (ln fold-change) for enhancer–gene pairs as the target. The model uses a Huber loss (δ = 1.0) for regression robustness: LHuber(y,β)={12(yβ)2,yβδ δ(yβ12δ),otherwise\mathcal{L}_\text{Huber}(y, \beta) = \begin{cases} \tfrac{1}{2} (y - \beta)^2, & |y - \beta| \leq \delta \ \delta \left(|y - \beta| - \tfrac{1}{2}\delta\right), & \text{otherwise} \end{cases} Optimization employs AdamW (learning rate 1e-4, weight decay 1e-5), a batch size of 64, and early stopping.

4. Predictive Accuracy and Ablation Studies

On held-out validation from the Gasperini et al. K562 CRISPRi screen, CDT achieves:

  • Pearson correlation r=0.503r = 0.503 (p<1064p < 10^{-64}) between predicted and experimental β, capturing 63% of the maximal (reproducibility) ceiling (rceiling=0.797r_\text{ceiling} = 0.797).
  • Explained variance: R20.25R^2 \approx 0.25 for β values.
  • Ablations: Removing DNA self-attention or cross-attention layers reduces performance by 0.08\sim0.08 in r; decreasing projection dimension below 512 is deleterious.

These results indicate that CDT’s architectural alignment with biological information flow is necessary for optimal predictive power.

5. Interpretability and Mechanistic Insights

CDT delivers two complementary interpretative frameworks:

  • Attention-based analysis (forward interpretation): DNA→RNA attention typically peaks within ±50 kb of enhancers (mean |Δ| ≃ 30 kb, 82% within 50 kb), recapitulating biological enhancer–promoter distances. RNA→Protein attention recovers transcript–protein modules. Self-attention heads demonstrate heterogeneous locality/globality.
  • Gradient-based attribution (reverse tracing): Gradients xpy^\nabla_{x_p}\hat{y} of predicted β with respect to DNA embeddings (xpx_p) pinpoint loci whose perturbation most affects predictions. Across 100 samples, overlap between the top-20 bins by attention and by gradient is ~10%, suggesting distinct “what the model attends to” vs. “what drives prediction” axes.

Case Study: FNDC5 Enhancer

  • The FNDC5 enhancer exhibited the strongest β (−1.31). Attention peaked at +47.4 kb—an ENCODE candidate enhancer and CTCF site; gradient attribution identified −56.7 kb, matching a catalogued CTCF binding site (E1308103).
  • Hi-C maps revealed physical proximity between enhancer and FNDC5 promoter (~726 kb apart within the same TAD, A compartment), supporting a chromatin-looping mechanism for long-range regulation.

A hypothesized workflow emerges: use attention to flag putative regulatory elements, refine via gradient attribution, then validate with external genomics resources (e.g., ENCODE annotations, Hi-C).

6. Modular Extension and Significance

CDT's modular design allows initiation with any improved DNA, RNA, or protein foundation models without extensive retraining. The architecture acts as a template for “mechanism-oriented AI,” where computational graphs are intentionally aligned to biological directionality. This alignment not only enables high predictive accuracy on tasks such as enhancer perturbation but also supports mechanistic inference—allowing scientific hypothesis generation from model internals (Ota, 3 Jan 2026). A plausible implication is that similar architecture principles may generalize to other multi-modal biological data integration challenges.

7. Context and Implications

CDT demonstrates that transformer-based architectures reflecting molecular biology’s directional logic can bridge previously siloed modalities, achieving interpretability and robust prediction. The avoidance of reverse attention ensures single-interpretation attention weights and mechanistic clarity. As foundation models for the underlying modalities improve, CDT’s architecture is positioned to absorb these advancements, maintain interpretability, and serve as a foundation for further mechanism-oriented advances in computational biology.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Central Dogma Transformer (CDT).