DNACHUNKER: Dynamic DNA Tokenization
- DNACHUNKER is a dynamic, learnable tokenization mechanism that segments DNA sequences into variable-length, content-adaptive chunks aligned with biological features.
- It utilizes a two-stage dynamic chunking router integrated with a Transformer backbone, enhancing efficiency and robustness against mutations.
- Empirical evaluations show improved performance on nucleotide benchmarks and variant prediction tasks, emphasizing its practical utility in genomics.
DNACHUNKER is a learnable, dynamic tokenization mechanism designed specifically for DNA LLMs, integrating biologically aware segmentation directly into the language modeling pipeline. It segments DNA sequences into variable-length, content-adaptive "chunks," contrasting with static, fixed-length k-mer or BPE-based approaches. DNACHUNKER approach enables more efficient encoding of biological information, is robust to sequence mutations, and outputs tokenizations that align with functionally significant genomic features (Kim et al., 6 Jan 2026).
1. Dynamic Chunking Mechanism
DNACHUNKER employs a two-stage dynamic chunking procedure inspired by H-Net, but adapted for bidirectional masked language modeling and tailored to biological sequence properties. For each stage , the model computes, at every position , a boundary probability , applies a threshold, and selects positions where to delineate chunk boundaries. This segmentation is driven by the following equations:
To enforce correct supervision on masked tokens, chunk boundaries are forced before and after every masked base. The chunked representations are further compressed through two consecutive stages, each controlled by an auxiliary "ratio-loss" to maintain the desired compression ratio :
with
In the prescribed configuration, each stage targets , halving sequence length twice before the main network.
2. Model Architecture and Training Objective
DNACHUNKER's architecture is partitioned into encoder, main network, and decoder components:
- Encoder: Two stages of 2-layer bidirectional Caduceus (BiMamba) blocks, each followed by the dynamic chunking router. Output embeddings are iteratively downsampled by selection of chunk boundaries.
- Main Network: An 8-layer Transformer (Pre-LayerNorm, 8 attention heads, Rotary Position Embeddings, hidden dimension 4096) processes the compressed representations.
- Decoder: Two cross-attention upsampler stages invert the compression, using the saved embeddings as queries, before a final 2-layer BiMamba and linear head to predict the masked base at each position.
Training follows masked language modeling (MLM), masking 15% of bases according to standard BERT-style strategies, down-weighting repetitive regions:
with effectively balancing chunking and prediction.
3. Implementation and Hyperparameters
DNACHUNKER is pretrained on the human reference genome (HG38), split into non-overlapping 1 Mb segments, each yielding up to bp with a batch token budget of . Adam optimizer (), initial learning rate , and cosine decay schedule are used. Key model statistics:
| Parameter | Value |
|---|---|
| Total parameters | 156M |
| Embedding dimension | 1024 |
| Encoder/Decoder | 2-layer BiMamba |
| Transformer blocks | 8 |
| Compression targets | 0.5, 0.5 (stages 0,1) |
Masking regime is 80% , 10% random, 10% unchanged. Vocabulary comprises 16 symbols (A, C, G, T, plus special tokens).
4. Empirical Evaluation
DNACHUNKER demonstrates competitive or superior performance across biological sequence benchmarks:
- Nucleotide Transformer Benchmark (18 tasks, MCC, 10-fold CV):
- Histone markers: 0.701 (vs 0.625, Generator 1.2B)
- Regulatory: 0.796 (vs 0.786)
- Splice sites: 0.965 (vs 0.979)
- Total avg. MCC: 0.772 (vs 0.728)
- Total avg. rank: 1.67 (best)
- Genomic Benchmarks (8 tasks):
- Avg. accuracy: 0.879 (second best)
- Avg. rank: 2.19 (best)
- Robustness to Mutations (ClinVar, similarity metric ):
| Mutation | BPE tokenizer | DNACHUNKER S1 | DNACHUNKER S2 |
|---|---|---|---|
| SNV (benign) | 0.9993 | 0.9987 | 0.9940 |
| InDel | 0.7506 | 0.8512 | 0.7932 |
DNACHUNKER exhibits pronounced robustness under insertions/deletions, attributed to local perturbations of dynamic chunk boundaries.
5. Tokenization Behavior and Biological Relevance
Ablation and qualitative analysis reveal that DNACHUNKER's learned token boundaries correlate with biologically significant sequence features:
- Stage 1 chunk lengths: 4–320 bp; Stage 2: 16–1024 bp.
- Functional regions (promoters, exons, splice sites): tokenized into small chunks ( 10–20 bp).
- Repetitive/non-functional regions: larger chunks ( 100–300 bp).
- Compared to uniform BPE segmentation (10–12 bp), the adaptivity preserves detail where needed and compresses redundancy elsewhere.
DNACHUNKER's dynamic routing ensures that key regulatory motifs and high-information elements are processed at higher resolution, aligning well with known regulatory and structural "grammar" of the genome.
6. Biological and Practical Implications
Adaptive chunking in DNACHUNKER yields several advantages:
- Biological interpretability: Chunks tend to align with known genetic elements, facilitating extraction of biologically relevant motifs and enhancing model explainability.
- Robustness: Localized impact of single-base insertions or deletions limits spurious merging/splitting of semantic units, ensuring model stability.
- Downstream tasks: Improved accuracy and consistency on variant effect prediction, long-range regulatory modeling, and generative tasks such as CRISPR guide design.
- Cross-species applications: Possible transfer to comparative genomics, where model-learned tokens may highlight conserved "words" across genomes.
A plausible implication is that learnable tokenization may advance unsupervised discovery of DNA grammar and contribute to hypothesis generation regarding the sequence-function relationship (Kim et al., 6 Jan 2026).
7. Implementation Considerations and Reproducibility
A complete DNACHUNKER system requires:
- Two dynamic chunking routers implementing the boundary equations and auxiliary losses.
- Bidirectional Caduceus (BiMamba) encoder and corresponding decoder.
- 8-layer Transformer backbone with RoPE positional encoding and cross-attention upsamplers.
- Combined masked language modeling plus compression ratio supervision.
All code, hyperparameter tables, and data preprocessing details are released in the authors' repository. The system—as reported—offers a compact 156M-parameter LLM, with empirical tokenization and benchmark behavior reproducible on standard genomic datasets (Kim et al., 6 Jan 2026).