Papers
Topics
Authors
Recent
2000 character limit reached

HELM: Hierarchical mRNA Language Modeling

Updated 7 January 2026
  • The paper introduces HELM, which employs codon-level hierarchical encoding and a modulated loss function to better capture biochemical relationships in mRNA sequences.
  • HELM systematically outperforms standard cross-entropy approaches, displaying up to an 8% improvement in predictive accuracy and tighter clustering of synonymous codon variants.
  • The framework’s extension into hyperbolic geometry (HyperHELM) further enhances modeling robustness and fidelity, supporting diverse biological tasks and sequence properties.

Hierarchical Encoding for mRNA Language Modeling (HELM) is a methodological framework for pre-training LLMs on mRNA sequences that explicitly incorporates the biological hierarchy of codon structure. Standard language modeling approaches applied to biological sequences treat nucleotide or k-mer tokens without regard to codon synonymity, failing to reflect the biochemical relationships underpinning protein synthesis. HELM addresses this limitation by encoding mRNA at the codon level and introducing a hierarchy-modulated loss function that aligns the model’s learning process with the biological roles of mRNA sequences. This strategy results in systematic improvements on both property prediction and generative synthesis of mRNA, especially for tasks sensitive to synonymous codon usage (Yazdani-Jahromi et al., 2024).

1. Codon-Level Hierarchical Representation

mRNA sequences are most fundamentally interpreted by the translational machinery in triplets of nucleotides, termed codons, corresponding to 64 possible tokens (including start, stop, and special symbols). HELM formalizes the biological hierarchy as a tree H=(V,E)H=(V,E). The root node represents all codons, splitting primarily into “non-coding” (start, stop) and “coding” partitions. Each coding node further branches into child amino acid nodes, with each amino acid AjA_j associated with a set of synonymous codon leaves Codons(Aj)={Cj,1,...,Cj,m}Codons(A_j) = \{C_{j,1}, ..., C_{j,m}\}.

Tokenization experiments compared nucleotide-level (vocabulary = 4), 6-mer (vocabulary ≈ 4⁶), and codon-level (vocabulary = 64) representations. Codon-level tokenization outperformed others on 5/6 evaluated tasks, confirming the codon as the optimal modeling unit for biological property prediction and generative modeling of mRNA (Yazdani-Jahromi et al., 2024).

2. Model Architectures and Hierarchy Integration

All HELM implementations employ models with 50 million parameters using codon tokenization. Three architectural baselines were evaluated: a Transformer (GPT-2 style: 10 layers, hidden size 640, sequence length up to 2048 codons), Hyena (7 layers, hidden size 768), and Mamba (40 layers, hidden size 256). The Transformer architecture was selected due to superior downstream performance.

Crucially, the biological hierarchy is not embedded directly within network layers. Instead, HELM introduces hierarchy through a modulated loss function that dynamically weights errors according to the codon synonymity structure, thereby externalizing the biological inductive bias to the training objective (Yazdani-Jahromi et al., 2024).

3. Hierarchically Modulated Loss Function

HELM’s core innovation is the hierarchical cross-entropy (HXE) loss, designed to respect codon–amino acid relationships. For a sequence x=(x1,...,xT)x=(x_1,...,x_T) with codon ctc_t at position tt, the model computes probabilities along the path from ctc_t (leaf) up to the root. The HXE loss at ctc_t is

LHXE(ct)=l=0h(ct)1λ(Ct(l))  logp(Ct(l)Ct(l+1))L_{HXE}(c_t) = - \sum_{l=0}^{h(c_t)-1} \lambda(C^{(l)}_t)\; \log p(C^{(l)}_t \mid C^{(l+1)}_t)

where Ct(0)=ctC^{(0)}_t = c_t and Ct(h)=RC^{(h)}_t = R. The weighting factor is λ(C)=exp(αh(C))\lambda(C) = \exp(-\alpha h(C)) with α0.2\alpha \approx 0.2. A flattened form simplifies to

L=t=1Tw(ct)  (p^t,ct)L = \sum_{t=1}^T w(c_t)\;\ell(\hat p_t, c_t)

with w(ct)=λ(ct)w(c_t)=\lambda(c_t) encoding codon-synonymity bias and \ell the standard cross-entropy loss.

This hierarchical weighting allows the model to penalize mispredictions proportionally to their biological nonsynonymity (e.g., synonym misclassifications are penalized less than incorrect amino acid assignments), aligning the learning objective with translational fidelity (Yazdani-Jahromi et al., 2024).

4. Pre-training Data, Curation, and Optimization

HELM is pre-trained using mRNA sequences from the Observed Antibody Space (OAS), comprising approximately 15.3 million curated mRNA sequences (7.7M heavy chain, 7.6M light chain). Rigorous data curation includes ANARCI-based filtering, restriction to productive and complete VDJ recombination events, redundancy removal (≤50% sequence identity), and class balance maintenance. Training proceeds for 40 epochs on 8 × A100 GPUs using both masked (MLM) and causal (CLM) language modeling objectives, with AdamW optimization (learning rates: 1e-3 for XE, 1e-4 for HXE) and linear warmup plus cosine decay (Yazdani-Jahromi et al., 2024).

5. Downstream Performance and Quantitative Results

Evaluation includes property prediction (via frozen-backbone TextCNN probe, reporting Spearman’s ρ\rho) and region annotation (prediction accuracy on antibody boundary detection). On seven property prediction benchmarks and an antibody annotation task, HELM achieves an average 8% improvement over standard cross-entropy pre-training.

Selected quantitative results are presented below:

Model Ab1 MLOS Tc-Rb mRFP Annotation (MLM)
Transformer XE (MLM) 0.748 0.653 0.569 0.753 78.68%
Transformer HELM (MLM) 0.767 0.701 0.626 0.822 83.39%

Gains are disproportionately higher on datasets exhibiting strong codon-usage bias (MLOS, Tc-Riboswitch, mRFP). In generative tasks, such as context-conditional generation (quantified by Frechet Biological Distance), HELM produces codon distributions more closely matching empirical data and lowers mean squared error in property prediction (by 2–31% relative to standard XE), especially for codon-biased targets (Yazdani-Jahromi et al., 2024).

6. Model and Training Ablations: Insights and Robustness

Ablation studies reveal that codon-level representation offers superior performance over nucleotide and 6-mer tokenizations in 5/6 property prediction tasks. Weighting hyperparameter α\alpha controls the trade-off between penalizing within- versus across-amino acid errors; α=0.2\alpha=0.2–$0.4$ yields optimal generalization, while higher values degrade cross-amino-acid discrimination.

Scaling experiments show diminishing returns beyond 50 million parameters. Clustering of sequence representations (measured via k-means Silhouette scores) indicates that HELM embeddings cluster synonymous codon variants more tightly than XE baselines (e.g., XE-MLM 0.74 → HELM-MLM 0.91), demonstrating that hierarchy-aware loss enhances internalization of codon synonymity (Yazdani-Jahromi et al., 2024).

7. Extensions and Comparative Methods

HyperHELM generalizes these ideas by embedding codon hierarchies within hyperbolic geometry, employing the Poincaré ball model and codon prototypes explicitly reflecting tree structure. HyperHELM further outperforms HELM (Euclidean) by an average of 10% on multi-species property prediction tasks and produces further robustness to out-of-distribution shifts in sequence length and base composition. Annotation accuracy also increases by 3 points relative to HELM (Spengler et al., 29 Sep 2025).

A plausible implication is that the hyperbolic geometric inductive bias allows for more faithful modeling of biological sequence hierarchies, especially when data are tree-structured or otherwise hierarchically organized. Current limitations include fixed prototype embeddings and restriction to Euclidean or single-curvature hyperbolic manifolds; further extending learnable or adaptive hierarchy representations and applying hierarchy encoding across biomolecular domains may yield additional advances (Spengler et al., 29 Sep 2025).


In summary, HELM provides a codon-aware, hierarchy-modulated pre-training approach for mRNA language modeling that achieves consistent improvements on biologically motivated predictive and generative tasks, especially in codon-usage biased contexts, and establishes a foundation for future work leveraging advanced geometric and hierarchical priors in biological sequence modeling (Yazdani-Jahromi et al., 2024, Spengler et al., 29 Sep 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Hierarchical Encoding for mRNA Language Modeling (HELM).

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube