Papers
Topics
Authors
Recent
Search
2000 character limit reached

Species-Aware DNABERT-S

Updated 26 February 2026
  • The paper introduces a framework that develops species-aware DNA embeddings through a two-phase curriculum contrastive learning strategy.
  • It integrates Weighted SimCLR and MI-Mix to interpolate latent representations and create challenging virtual anchors for fine-grained species differentiation.
  • Empirical results demonstrate state-of-the-art performance in unsupervised clustering, few-shot classification, and metagenomic binning across diverse genomic benchmarks.

Species-Aware DNABERT-S is a genome modeling framework designed to develop species-aware embeddings that facilitate natural clustering and segregation of DNA sequences from different species in embedding space. Built on the DNABERT-2 genome foundation model, DNABERT-S incorporates two key contrastive learning strategies—Weighted SimCLR and Manifold Instance Mixup (MI-Mix)—organized in a curriculum contrastive learning regimen. This configuration enables unsupervised differentiation of species in genomic datasets, particularly addressing challenges posed by unknown or uncharacterized species for which reference genomes are unavailable. The model demonstrates empirical superiority in unsupervised clustering, few-shot classification, and metagenomic binning, achieving state-of-the-art results across highly diverse species benchmarks (Zhou et al., 2024).

1. Model Architecture and Input Representation

DNABERT-S adopts the DNABERT-2 Transformer-based encoder as its backbone. This encoder is pre-trained on large-scale, unlabeled DNA corpora. Each DNA input sequence of length LL is tokenized into overlapping kk-mers (with k=6k=6 typical), resulting in (Lk+1)(L-k+1) k-mer tokens. Each token is embedded into a dd-dimensional space (d=768d=768), followed by processing through M=12M=12 self-attention layers. In contrast to DNABERT-2, where a masked language modeling head is used, DNABERT-S removes this head and instead appends a mean-pooling layer over the final token embeddings. The output f(x)Rdf(x) \in \mathbb{R}^d—the mean of last-layer hidden states—acts as the fixed-size, species-aware embedding for further tasks.

2. Manifold Instance Mixup (MI-Mix)

MI-Mix is central to the learning curriculum’s second phase, generating more challenging virtual anchors by interpolating the latent representations of DNA sequences at randomly selected transformer layers. Given a batch {xi}\{x_i\}, for each instance:

  1. A layer mm is sampled from the set of eligible transformer layers SS.
  2. Hidden states Hi=gm(xi)H_i = g_m(x_i) are computed after layer mm.
  3. The batch is randomly permuted to produce paired representations H^i\widehat{H}_i and corresponding one-hot virtual labels, and mixing coefficients λiBeta(α,α)\lambda_i \sim \mathrm{Beta}(\alpha, \alpha) are sampled.
  4. Mixed hidden states him=λiHi+(1λi)H^ih_i^m = \lambda_i H_i + (1-\lambda_i)\widehat{H}_i and label mixtures vimix=λivi+(1λi)v^iv_i^{mix} = \lambda_i v_i + (1-\lambda_i)\widehat{v}_i are formed.
  5. Final embeddings zi=fm(him)z_i = f_m(h_i^m) are computed using the remaining transformer layers.

The objective for each mixed anchor applies a weighted contrastive loss:

(zi,vimix)=nvimix[n]logexp(s(zi,f(xn+))/τ)jαij+exp(s(zi,f(xj+))/τ)\ell(z_i, v_i^{mix}) = -\sum_n v_i^{mix}[n] \cdot \log \frac{\exp(s(z_i, f(x_{n^+}))/\tau)}{\sum_j \alpha_{ij^+} \exp(s(z_i, f(x_{j^+}))/\tau)}

Here, s(,)s(\cdot, \cdot) denotes cosine similarity, τ\tau is a temperature parameter, and αij+\alpha_{ij^+} is a hard-negative sampling weight. MI-Mix encourages the model to accurately recognize and differentiate “partial” mixtures of species-specific features, leading to sharper discrimination in the embedding space.

Pseudocode sketch:

1
2
3
4
5
6
7
8
9
10
11
for batch {(x_i, x_i^+)}:
    sample layer m
    H_i = g_m(x_i)
    H_i^+ = g_m(x_i^+)
    create positive labels v_i, shuffle for Ĥ_i, v̂_i
    λ_i ~ Beta(α, α)
    h_i^m = λ_i*H_i + (1-λ_i)*Ĥ_i
    v_i^{mix} = λ_i*v_i + (1-λ_i)*v̂_i
    z_i = f_m(h_i^m)
    compute ℓ(z_i, v_i^{mix}) with weighted negatives
    backpropagate L_{MI-Mix}

3. Curriculum Contrastive Learning (C²LR)

Curriculum Contrastive Learning (C2^2LR) structures training as two sequential phases of increasing difficulty:

  • Phase I: Weighted SimCLR (Easy Anchors) Applies standard contrastive learning to each xix_i and its non-overlapping partner xi+x_i^+ as positives, with all other $2B-2$ in-batch sequences as negatives. The weighted loss function is:

(f(xi),vi)=nvi[n]logexp(s(f(xi),f(xn))/τ)jiαijexp(s(f(xi),f(xj))/τ)\ell(f(x_i), v_i) = -\sum_n v_i[n] \log \frac{\exp(s(f(x_i), f(x_n))/\tau)}{\sum_{j\neq i} \alpha_{ij} \exp(s(f(x_i), f(x_j))/\tau)}

The batch loss is LWSimCLR=12Bi[(f(xi),vi)+(f(xi+),vi)]L_{WSimCLR} = \frac{1}{2B}\sum_i [\ell(f(x_i), v_i) + \ell(f(x_i^+), v_i)].

  • Phase II: MI-Mix (Hard Anchors) MI-Mix loss LMIMixL_{MI-Mix} is applied as described above.

Scheduling consists of one epoch of Weighted SimCLR, followed by two epochs of MI-Mix, using τ=0.05\tau = 0.05 and α=1.0\alpha=1.0 as temperature and mixing parameters, respectively.

4. Training Protocol and Implementation

DNABERT-S is trained on 2 million pairs of 10 kbp non-overlapping DNA sequences from over 1,000 GenBank genomes, including 6,402 bacteria, 5,011 fungi, and 17,636 viruses. The transformer backbone is initialized from the DNABERT-2 checkpoint (available via Huggingface). Optimization is performed with Adam (learning rate 3×1063 \times 10^{-6}, batch size 48), using mean-pooled last token states as output embeddings. Training utilizes 8 NVIDIA A100 GPUs over approximately 48 hours. Model checkpoints are saved every 10,000 steps, selecting the best based on validation loss.

5. Empirical Evaluation

DNABERT-S is evaluated on 18 datasets spanning three primary tasks: unsupervised clustering, few-shot species classification, and metagenomic binning. Benchmark baselines include TNF, TNF-K, TNF-VAE, DNA2Vec, DNABERT-2, HyenaDNA, Nucleotide Transformer, and contrastive learning variants.

Summary of empirical findings:

Task Baseline Performance DNABERT-S Performance
Clustering (ARI) TNF/TNF-K \sim26, DNABERT-2 \sim14 \sim54 (2x best baseline)
Few-Shot Macro F1 10-shot : baseline < DNABERT-S (2-shot) 2-shot outperforms 10-shot baseline
Synthetic Few-Shot - 5-shot F1 >> 0.8 (200 classes)
Binning (F1 >0.5>0.5) Best baseline: X species \sim2\timesspecies</td></tr><tr><td>Errorfreebinning</td><td></td><td> species</td> </tr> <tr> <td>Error-free binning</td> <td>-</td> <td>>80\%speciesrecovered</td></tr><tr><td>Noisyrealdata</td><td></td><td> species recovered</td> </tr> <tr> <td>Noisy real data</td> <td>-</td> <td>\sim40%40\% species recovered

These results indicate that DNABERT-S embeddings substantially improve the separability of species in unsupervised and semi-supervised settings, especially under label-scarce conditions.

6. Applications, Limitations, and Future Directions

Species-aware embeddings from DNABERT-S facilitate accurate unsupervised species clustering, dramatically reduce labeled data requirements in few-shot classification, and enable efficient metagenomic binning—overcoming major challenges in genomics where reference genomes are incomplete or absent. The MI-Mix objective compels the model to capture compositional latent spaces, improving its ability to distinguish fine-grained species-specific features. The C²LR curriculum prevents overfitting to trivial negatives and enables smooth progression from simple to complex contrastive discrimination.

Limitations include the model’s specialized optimization for species differentiation; it does not yield automatic improvements on other genomics tasks, such as human promoter prediction. Prospective research avenues involve adaptive, layer-wise mixing schedules, integration with multi-modal biological data (e.g., gene expression), and extension of the C²LR paradigm to other biological representation learning scenarios (Zhou et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Species-Aware DNABERT-S.