HELM: Hierarchical mRNA Language Modeling
- The paper introduces HELM, which employs codon-level hierarchical encoding and a modulated loss function to better capture biochemical relationships in mRNA sequences.
- HELM systematically outperforms standard cross-entropy approaches, displaying up to an 8% improvement in predictive accuracy and tighter clustering of synonymous codon variants.
- The framework’s extension into hyperbolic geometry (HyperHELM) further enhances modeling robustness and fidelity, supporting diverse biological tasks and sequence properties.
Hierarchical Encoding for mRNA Language Modeling (HELM) is a methodological framework for pre-training LLMs on mRNA sequences that explicitly incorporates the biological hierarchy of codon structure. Standard language modeling approaches applied to biological sequences treat nucleotide or k-mer tokens without regard to codon synonymity, failing to reflect the biochemical relationships underpinning protein synthesis. HELM addresses this limitation by encoding mRNA at the codon level and introducing a hierarchy-modulated loss function that aligns the model’s learning process with the biological roles of mRNA sequences. This strategy results in systematic improvements on both property prediction and generative synthesis of mRNA, especially for tasks sensitive to synonymous codon usage (Yazdani-Jahromi et al., 2024).
1. Codon-Level Hierarchical Representation
mRNA sequences are most fundamentally interpreted by the translational machinery in triplets of nucleotides, termed codons, corresponding to 64 possible tokens (including start, stop, and special symbols). HELM formalizes the biological hierarchy as a tree . The root node represents all codons, splitting primarily into “non-coding” (start, stop) and “coding” partitions. Each coding node further branches into child amino acid nodes, with each amino acid associated with a set of synonymous codon leaves .
Tokenization experiments compared nucleotide-level (vocabulary = 4), 6-mer (vocabulary ≈ 4⁶), and codon-level (vocabulary = 64) representations. Codon-level tokenization outperformed others on 5/6 evaluated tasks, confirming the codon as the optimal modeling unit for biological property prediction and generative modeling of mRNA (Yazdani-Jahromi et al., 2024).
2. Model Architectures and Hierarchy Integration
All HELM implementations employ models with 50 million parameters using codon tokenization. Three architectural baselines were evaluated: a Transformer (GPT-2 style: 10 layers, hidden size 640, sequence length up to 2048 codons), Hyena (7 layers, hidden size 768), and Mamba (40 layers, hidden size 256). The Transformer architecture was selected due to superior downstream performance.
Crucially, the biological hierarchy is not embedded directly within network layers. Instead, HELM introduces hierarchy through a modulated loss function that dynamically weights errors according to the codon synonymity structure, thereby externalizing the biological inductive bias to the training objective (Yazdani-Jahromi et al., 2024).
3. Hierarchically Modulated Loss Function
HELM’s core innovation is the hierarchical cross-entropy (HXE) loss, designed to respect codon–amino acid relationships. For a sequence with codon at position , the model computes probabilities along the path from (leaf) up to the root. The HXE loss at is
where and . The weighting factor is with . A flattened form simplifies to
with encoding codon-synonymity bias and the standard cross-entropy loss.
This hierarchical weighting allows the model to penalize mispredictions proportionally to their biological nonsynonymity (e.g., synonym misclassifications are penalized less than incorrect amino acid assignments), aligning the learning objective with translational fidelity (Yazdani-Jahromi et al., 2024).
4. Pre-training Data, Curation, and Optimization
HELM is pre-trained using mRNA sequences from the Observed Antibody Space (OAS), comprising approximately 15.3 million curated mRNA sequences (7.7M heavy chain, 7.6M light chain). Rigorous data curation includes ANARCI-based filtering, restriction to productive and complete VDJ recombination events, redundancy removal (≤50% sequence identity), and class balance maintenance. Training proceeds for 40 epochs on 8 × A100 GPUs using both masked (MLM) and causal (CLM) language modeling objectives, with AdamW optimization (learning rates: 1e-3 for XE, 1e-4 for HXE) and linear warmup plus cosine decay (Yazdani-Jahromi et al., 2024).
5. Downstream Performance and Quantitative Results
Evaluation includes property prediction (via frozen-backbone TextCNN probe, reporting Spearman’s ) and region annotation (prediction accuracy on antibody boundary detection). On seven property prediction benchmarks and an antibody annotation task, HELM achieves an average 8% improvement over standard cross-entropy pre-training.
Selected quantitative results are presented below:
| Model | Ab1 | MLOS | Tc-Rb | mRFP | Annotation (MLM) |
|---|---|---|---|---|---|
| Transformer XE (MLM) | 0.748 | 0.653 | 0.569 | 0.753 | 78.68% |
| Transformer HELM (MLM) | 0.767 | 0.701 | 0.626 | 0.822 | 83.39% |
Gains are disproportionately higher on datasets exhibiting strong codon-usage bias (MLOS, Tc-Riboswitch, mRFP). In generative tasks, such as context-conditional generation (quantified by Frechet Biological Distance), HELM produces codon distributions more closely matching empirical data and lowers mean squared error in property prediction (by 2–31% relative to standard XE), especially for codon-biased targets (Yazdani-Jahromi et al., 2024).
6. Model and Training Ablations: Insights and Robustness
Ablation studies reveal that codon-level representation offers superior performance over nucleotide and 6-mer tokenizations in 5/6 property prediction tasks. Weighting hyperparameter controls the trade-off between penalizing within- versus across-amino acid errors; –$0.4$ yields optimal generalization, while higher values degrade cross-amino-acid discrimination.
Scaling experiments show diminishing returns beyond 50 million parameters. Clustering of sequence representations (measured via k-means Silhouette scores) indicates that HELM embeddings cluster synonymous codon variants more tightly than XE baselines (e.g., XE-MLM 0.74 → HELM-MLM 0.91), demonstrating that hierarchy-aware loss enhances internalization of codon synonymity (Yazdani-Jahromi et al., 2024).
7. Extensions and Comparative Methods
HyperHELM generalizes these ideas by embedding codon hierarchies within hyperbolic geometry, employing the Poincaré ball model and codon prototypes explicitly reflecting tree structure. HyperHELM further outperforms HELM (Euclidean) by an average of 10% on multi-species property prediction tasks and produces further robustness to out-of-distribution shifts in sequence length and base composition. Annotation accuracy also increases by 3 points relative to HELM (Spengler et al., 29 Sep 2025).
A plausible implication is that the hyperbolic geometric inductive bias allows for more faithful modeling of biological sequence hierarchies, especially when data are tree-structured or otherwise hierarchically organized. Current limitations include fixed prototype embeddings and restriction to Euclidean or single-curvature hyperbolic manifolds; further extending learnable or adaptive hierarchy representations and applying hierarchy encoding across biomolecular domains may yield additional advances (Spengler et al., 29 Sep 2025).
In summary, HELM provides a codon-aware, hierarchy-modulated pre-training approach for mRNA language modeling that achieves consistent improvements on biologically motivated predictive and generative tasks, especially in codon-usage biased contexts, and establishes a foundation for future work leveraging advanced geometric and hierarchical priors in biological sequence modeling (Yazdani-Jahromi et al., 2024, Spengler et al., 29 Sep 2025).