Papers
Topics
Authors
Recent
Search
2000 character limit reached

DeepProphet2: Transformer Gene Recommender

Updated 2 March 2026
  • DeepProphet2 is a transformer-based gene recommendation engine that learns semantic relationships from large-scale biomedical literature without using curated pathway labels.
  • It employs a 6-layer, 8-head encoder to generate 64-dimensional embeddings for genes and diseases, achieving state-of-the-art ROC AUCs in benchmark tests.
  • The model supports practical applications such as hypothesis generation, rare disease gene nomination, and pathway completion by delivering actionable gene rankings.

DeepProphet2 (DP2) is a transformer-based gene recommendation engine designed to identify and rank genes likely to be functionally related, given a user-specified set of genes and diseases, using large-scale biomedical literature co-citation data. The model maps human genes and diseases into a low-dimensional metric space via learned embeddings and processes co-occurrence sequences to derive semantic proximities, offering gene prioritization without reliance on curated pathway labels or interaction databases (Brambilla et al., 2022).

1. Model Structure and Embedding Framework

DP2 represents each distinct human gene (approximately 25,052) and disease term (approximately 11,655) as discrete tokens embedded in ℝ⁶⁴, parameterized by embedding matrices EgRG×64E_g \in \mathbb{R}^{|G| \times 64} and EdRD×64E_d \in \mathbb{R}^{|D| \times 64}, respectively. Genes and diseases are referenced using NCBI gene IDs and MeSH terms; no sub-tokenization is applied. The architectural core consists of L=6L=6 identical transformer encoder blocks, each utilizing H=8H=8 attention heads. Each input sequence, comprised of up to 30 genes and associated diseases extracted from one PubMed article, passes through the embeddings and transformer stack. Final sequence representations are scored against all gene embeddings via matrix multiplication, with softmax ranking over the entire gene vocabulary (no negative sampling).

Mathematically, the input XR(Ng+Nd1)×64X \in \mathbb{R}^{(N_g + N_d - 1) \times 64} is projected via self-attention and feedforward operations within each block. For head hh, queries, keys, and values are computed as Qh=XWhQQ_h = X W^Q_h, Kh=XWhKK_h = X W^K_h, Vh=XWhVV_h = X W^V_h with scaled dot-product attention given by:

Attention(Qh,Kh,Vh)=softmax(QhKhT64/H)Vh\mathrm{Attention}(Q_h, K_h, V_h) = \mathrm{softmax}\left( \frac{Q_h K_h^T}{\sqrt{64/H}} \right) V_h

Blocks produce contextual representations through concatenation, projection, residual connections, and layer normalization. The final context-aware sequence embedding is projected onto the gene vocabulary via:

s=X(L)EgTs = X^{(L)} E_g^T

where sjs_j denotes the unnormalized logit for gene gjg_j. A softmax yields p(gjcontext)p(g_j|context) for ranking.

2. Training Procedure and Optimization

Training is self-supervised, with each example comprising the gene and disease mentions from a single PubMed article annotated via PubTator/TaggerOne. The model randomly orders all gene mentions (g1,...,gNg)(g_1, ..., g_{N_g}), masks the final gene, and predicts it based on the preceding genes and all diseases. This causal next-gene prediction uses the categorical cross-entropy (CE) loss over the full gene vocabulary:

LCE=logp(gt{g1,...,gt1},{diseases})=logexp(st,gt)j=1Gexp(st,j)\mathcal{L}_{CE} = -\log p(g_t^* \mid \{g_1, ..., g_{t-1}\}, \{\text{diseases}\}) = -\log \frac{\exp(s_{t, g_t^*})}{\sum_{j=1}^{|G|} \exp(s_{t,j})}

Causal masking ensures each sequence position attends only to current and preceding tokens, and sliding this mask across sequence prefixes augments the dataset, generating multiple training instances per article analogously to GPT-style pretraining. Optimization uses Adam (β₁=0.9, β₂=0.999, ε=10810^{-8}); learning rates in [5×105,104][5 \times 10^{-5}, 10^{-4}] selected via grid search. Early stopping and regularization via layer norms are incorporated. The parameter count totals ≈2.5 million, with ≈300,000 in the transformer stack and the remainder in embeddings.

3. Data Acquisition, Preprocessing, and Filtering

DP2 was pre-trained on approximately 7.23 million PubTator Central articles. NLP tagging achieved F1 ≈ 0.87 (genes) and 0.84 (diseases). To ensure information density and manageable context length, only articles with 2–30 gene citations were retained. Where full text was unavailable, abstracts were used. Gene and disease mentions are mapped to standard identifiers (NCBI and MeSH), with each entity treated as an atomic token. This explicit curation and entity-level vocabulary mitigate ambiguity and noise from sub-token or character representations.

4. Quantitative Evaluation and Benchmarking

Performance was assessed with leave-one-out cross-validation (LOO/CV) on benchmark gene sets, including Reactome pathways, Gene Ontology (GO) categories, DISEASES data (with experimental, curated, and text-mined subsets), and high-confidence STITCH chemical–protein interaction data. For each test set S={g1,...,gn}S=\{g_1,...,g_n\}, each gig_i is held out, the remainder input to DP2, and the model is scored on its ability to rank gig_i highly. Receiver Operating Characteristic Area Under the Curve (ROC AUC) is computed for each application.

Dataset ROC AUC
DISEASES exp 0.793
DISEASES know 0.962
DISEASES text 0.852
GO biological process 0.923
GO cellular component 0.950
GO molecular function 0.953
Reactome pathways 0.982
STITCH chemicals 0.972

Baseline comparisons include random selection (AUC ≈ 0.50), most-frequent-gene by global citation (AUC ≈ 0.66–0.80), and nearest-neighbor retrieval with BioBERT, SapBERT, and KRISSBERT embeddings (AUCs in 0.58–0.88). DP2 consistently outperforms all baselines by at least 20% relative in most cases. DP2 is unique in never incorporating explicit pathway or disease labels during training; instead, functional relationships emerge solely from gene and disease co-citation patterns.

5. Embedding Space Geometry and Visualization

DP2’s embedding space was probed using Uniform Manifold Approximation & Projection (UMAP) to reduce the 64-dimensional gene vectors to two-dimensional visualizations. UMAP constructs fuzzy simplicial sets and minimizes cross-entropy between high- and low-dimensional affinities:

C=i<j[wijlogwijw~ij+(1wij)log1wij1w~ij]C = \sum_{i<j} \left[ w_{ij} \log\frac{w_{ij}}{\tilde{w}_{ij}} + (1-w_{ij}) \log\frac{1-w_{ij}}{1-\tilde{w}_{ij}} \right]

Case studies include the co-localization of keratinization genes (DSG, PKP, DSC, DSP, JUP) with related methylation genes (e.g., CALM2, due to shared Ca²⁺ binding; GO:0005509, p<7×1010p<7 \times 10^{-10}), accurate projection of mitotic prometaphase pathway subcomplexes (α/β-tubulin, PP2A, nuclear pore, centromere/kinetochore), and clear grouping by pathway (organic anion/cation transport, mRNA editing). These findings indicate that DP2’s embeddings encode axes of functional similarity, subcellular localization, and biological role without direct supervision—a result validated by gene set enrichment within clusters.

6. Applications and Practical Impact

DP2 provides utility across diverse research and translational settings:

  • Scientific Hypothesis Generation: Recommends candidate genes for stalled projects, yielding suggestions not easily obtained by human curation and broadening hypothesis discovery.
  • Rare Disease Gene Nomination: For conditions such as facioscapulohumeral dystrophy (FSHD), inputting known disease genes produced recommendations, seven of which corresponded to later experimentally confirmed hits within the top 30 suggestions, ahead of costly wet-lab validation.
  • Pathway Completion: Retrospective prediction of Reactome pathway revisions showed that 47% of revised pathways contained at least one DP2-predicted gene, with 9% of all newly added genes having been suggested by prior DP2 runs.
  • Grant Preparation: Enables rapid identification of additional pathway- or disease-relevant genes for non-specialists drafting proposals based on “anchor” genes.
  • Prioritization for Expression Studies: Intersecting DP2 scores with differentially expressed gene lists supports focused experimental validation of putatively biologically salient targets.

DP2 is openly accessible at https://www.generecommender.com, with documentation and practical tools for gene set exploration and prioritization.


DeepProphet2 embodies a 6-layer, 8-head transformer encoder trained on masked gene prediction across 7.2 million PubMed/PubTator articles, with self-supervised optimization and a total of ≈2.5 million model parameters. By leveraging large-scale unsupervised literature context, it achieves state-of-the-art gene set completion AUCs (0.92–0.98), outperforms biomedical BERT embeddings and frequency-based heuristics, and provides actionable recommendations in both exploratory and applied genomics contexts (Brambilla et al., 2022).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DeepProphet2.