Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
GPT-5.1
GPT-5.1 130 tok/s
Gemini 3.0 Pro 29 tok/s Pro
Gemini 2.5 Flash 145 tok/s Pro
Kimi K2 191 tok/s Pro
Claude Sonnet 4.5 34 tok/s Pro
2000 character limit reached

DSRPGO: Multimodal Protein Function Prediction

Updated 10 November 2025
  • The paper introduces DSRPGO, a dual-stage deep learning framework that integrates spatial and sequence modalities for improved multi-label protein function prediction.
  • The method employs reconstructive pre-training, bidirectional attention, and dynamic selection to robustly fuse heterogeneous protein data from various sources.
  • Evaluation on GO ontologies shows significant gains over baselines, validating DSRPGO’s potential to advance functional genomics and proteomics.

The Multimodal Protein Function Prediction Method known as DSRPGO (Dynamic Selection and Reconstructive Pre-training with Genetic Optimization) is a dual-stage, dual-branch deep learning framework designed to address the complex challenge of protein function prediction using heterogeneous biological data modalities. DSRPGO integrates spatial, sequence, and functional protein data through reconstructive pre-training, cross-modal bidirectional attention, and an adaptive dynamic selection mechanism. This approach achieves state-of-the-art results on hierarchical, multi-label classification of protein function, notably on Gene Ontology (GO) ontologies—Biological Process (BPO), Molecular Function (MFO), and Cellular Component (CCO)—modernizing the predictive pipeline for functional genomics and proteomics (Luo et al., 6 Nov 2025).

1. Model Architecture and Modalities

DSRPGO uses a two-stage pipeline. In the first (pre-training) stage, separate encoder–decoder pairs process (a) protein spatial structural information (PSSI), encoding both protein–protein interaction (PPI) networks and bag-of-words features for subcellular localization and domains, and (b) protein sequence information (PSeI), using pretrained token embeddings from ProtT5. Each branch learns a fine-grained, low-semantic feature space via reconstruction losses. In the second (fine-tuning) stage, these encoders initialize a dual-branch classification system with a Multimodal Shared Learning (MSL) branch that aggregates all modalities, and a Multimodal Interactive Learning (MIL) branch containing the Bidirectional Interaction Module (BInM) for explicit cross-modal attention between sequence and spatial features.

Each branch produces three channel outputs—PPI, attribute, and sequence feature vectors—totaling six feature vectors per protein for downstream fusion. This dual-branch design enables both global integration (through shared learning) and detailed, bidirectional cross-modal exchange.

2. Reconstructive Pre-training and Encoders

Reconstructive pre-training is central to DSRPGO. The goal is to learn encoders that extract more informative and fine-grained features from both spatial and sequence modalities:

a) PSSI encoder–decoder:

  • Input: xih(k)RHikx_i^{h(k)} \in \mathbb{R}^{H_i^k}, with k=1k=1 for flattened PPI adjacency matrix, k=2k=2 for concatenated bag-of-words localization and domain features.
  • Internals: Uses BiMamba blocks built from state-space models (SSM) and selective scanning, combining forward and backward scan layers, linear mapping, and FiLM-like gating.
  • Output: Encoded vectors xid(k)x_i^{d(k)}, decoded to reconstruct the input via a mirrored decoder.
  • Loss: Binary cross-entropy on reconstruction,

Lsp=1Ni=1Nk=1Kj=1Hik[xijh(k)logxˉijh(k)(1xijh(k))log(1xˉijh(k))]\mathcal{L}_{sp} = \frac{1}{N} \sum_{i=1}^N \sum_{k=1}^K \sum_{j=1}^{H_i^k} [-x_{ij}^{h(k)}\log \bar x_{ij}^{h(k)} - (1-x_{ij}^{h(k)})\log(1-\bar x_{ij}^{h(k)}) ]

b) PSeI encoder–decoder:

  • Input: ProtT5-embedded sequence sihs_i^h.
  • Architecture: MLP followed by six-layer Transformer encoder, symmetric decoder.
  • Loss: Binary cross-entropy,

Lse=1Ni=1Nj=1Hi[sijhlogsˉijh(1sijh)log(1sˉijh)]\mathcal{L}_{se} = \frac{1}{N} \sum_{i=1}^N \sum_{j=1}^{H_i} [-s_{ij}^h\log \bar s_{ij}^h - (1-s_{ij}^h)\log(1-\bar s_{ij}^h) ]

This reconstructive formulation ensures both modalities are deeply encoded before fine-tuning on functional labels.

3. Bidirectional Interaction and Dynamic Selection

Bidirectional Interaction Module (BInM):

Enables cross-attention between spatial and sequence representations in the MIL-Branch. After initial MLP mapping and splitting into multi-head attention space, BInM realizes bidirectional, pairwise attention, projecting attended outputs back to their initial feature spaces. For input sets x~iB\widetilde x_i^B (PPI + attribute) and xiB\overline x_i^B (sequence), BInM models both directions: Fc1=softmax(Q1(Fb1)K2(Fb2))V2(Fb2),Fc2=softmax(Q2(Fb2)K1(Fb1))V1(Fb1)F_c^1 = \text{softmax}(Q^1(F_b^1)K^2(F_b^2)^\top)V^2(F_b^2), \quad F_c^2 = \text{softmax}(Q^2(F_b^2)K^1(F_b^1)^\top)V^1(F_b^1)

Dynamic Selection Module (DSM):

Fuses the six output channels (three from MSL, three from MIL) using adaptive weighting. For each protein, DSM computes “expert confidence” for each channel by MLP + softmax: p^=softmax(MLP(Xdsm))\hat p = \mathrm{softmax}(\mathrm{MLP}(X_{dsm})) Channels with confidence above a threshold tt are selected, renormalized, and passed through “expert” layers. The concatenated output passes to the classifier, providing flexibility and adaptivity to the functional prediction process.

4. Hierarchical Multi-Label Classification and Losses

DSRPGO treats each GO ontology (BPO, MFO, CCO) as an independent multi-label classification task. The classifier is trained on the DSM-fused features with a focal-style asymmetric cross-entropy loss to accommodate class imbalances: L=1NMi=1Nm=1M[yim(1pim)γ+logpim(1yim)(pim)γlog(1pim)]\mathcal L = \frac{1}{N M} \sum_{i=1}^N \sum_{m=1}^M \Bigl[ -y_i^m (1-p_i^m)^{\gamma^+} \log p_i^m - (1-y_i^m) (p_i^m)^{\gamma^-} \log(1-p_i^m) \Bigr] There is no explicit regularization for hierarchy-awareness; each ontology is handled independently.

5. Training, Implementation, and Hyperparameters

Data Preparation

  • Pretraining: 19,385 human proteins (PPI from STRING v11.5; sequence, localization, domain from UniProt v3.5.175)
  • Fine-tuning: Split by timestamp into train/val/test for each ontology (e.g., BPO: 3,197/304/182)

Hyperparameters

  • Pre-training: AdamW, 5,000 epochs, learning rate 1×1051 \times 10^{-5} then 1×1061 \times 10^{-6}, dropout 0.1
  • Fine-tuning: AdamW, 100 epochs, learning rate 1×1031 \times 10^{-3} then 1×1041 \times 10^{-4}, dropout 0.3
  • Hardware: NVIDIA RTX 4090 or better (≥16 GB VRAM)
  • Training time: ~24–48 hours (pre-training), ~1–3 hours per ontology (fine-tuning)

Implementation Notes

  • Frozen ProtT5 for sequence, random (Xavier) init for MLPs, BiMamba, Transformers
  • The full process is encapsulated in reproducible pseudocode for pre-training, fine-tuning, and inference as described in (Luo et al., 6 Nov 2025).

6. Evaluation and Comparative Performance

Performance is assessed on held-out test sets for each GO ontology using Fmax_{\max}, micro/macro AUPR, and accuracy:

Ontology CFAGO (best baseline) DSRPGO
BPO Fmax_{\max} 0.439 ± 0.007 0.458 ± 0.006
MFO Fmax_{\max} 0.236 ± 0.004 0.254 ± 0.022
CCO Fmax_{\max} 0.366 ± 0.018 0.452 ± 0.019

Ablation studies reveal that omitting reconstructive pre-training, the bidirectional interaction module, or the dynamic selection module reduces Fmax_{\max} substantially; for example, "no pre-training" scores 0.297/0.167/0.356 (BPO/MFO/CCO), and BInM/DSM removal yields drops of up to 0.07 on MFO and CCO. Sequence-only or spatial-only baselines underperform dual-modality models.

Standard deviations across five runs attest to stability; paired t-tests are not reported, but all gains demonstrate strong robustness.

7. Significance, Context, and Implications

DSRPGO establishes a new paradigm for multimodal protein function prediction by coupling the strengths of reconstructive pre-training, dynamic cross-modal feature interaction, and adaptive channel selection. By explicitly disentangling modality-specific representations and enabling selective fusion, DSRPGO achieves consistent improvement over strong multimodal and unimodal baselines in hierarchical GO label prediction.

This methodology is applicable particularly to eukaryotic function prediction tasks where context-rich, heterogeneous protein attributes are available. The reconstructive pre-training strategy anchors the framework’s generalizability, while DSM ensures context-aware inference. Comparison with generative LLM models for protein QA (Xiao et al., 21 Aug 2024) and contrastive/modal alignment paradigms (Wang et al., 24 May 2024) highlights DSRPGO’s focus on direct discriminative function label prediction under complex multi-label, multi-ontology settings.

A plausible implication is that further extensions—such as joint optimization with domain-embedding or family-aware graph modalities—may yield additional functional gains, particularly in settings with extreme class imbalance or limited experimental annotations. The modular pipeline and explicit channel weighting in DSRPGO provide routes for continual learning, hard-negative mining, and integration with knowledge graph-enhanced ontological priors.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Multimodal Protein Function Prediction Method (DSRPGO).