DSRPGO: Multimodal Protein Function Prediction
- The paper introduces DSRPGO, a dual-stage deep learning framework that integrates spatial and sequence modalities for improved multi-label protein function prediction.
- The method employs reconstructive pre-training, bidirectional attention, and dynamic selection to robustly fuse heterogeneous protein data from various sources.
- Evaluation on GO ontologies shows significant gains over baselines, validating DSRPGO’s potential to advance functional genomics and proteomics.
The Multimodal Protein Function Prediction Method known as DSRPGO (Dynamic Selection and Reconstructive Pre-training with Genetic Optimization) is a dual-stage, dual-branch deep learning framework designed to address the complex challenge of protein function prediction using heterogeneous biological data modalities. DSRPGO integrates spatial, sequence, and functional protein data through reconstructive pre-training, cross-modal bidirectional attention, and an adaptive dynamic selection mechanism. This approach achieves state-of-the-art results on hierarchical, multi-label classification of protein function, notably on Gene Ontology (GO) ontologies—Biological Process (BPO), Molecular Function (MFO), and Cellular Component (CCO)—modernizing the predictive pipeline for functional genomics and proteomics (Luo et al., 6 Nov 2025).
1. Model Architecture and Modalities
DSRPGO uses a two-stage pipeline. In the first (pre-training) stage, separate encoder–decoder pairs process (a) protein spatial structural information (PSSI), encoding both protein–protein interaction (PPI) networks and bag-of-words features for subcellular localization and domains, and (b) protein sequence information (PSeI), using pretrained token embeddings from ProtT5. Each branch learns a fine-grained, low-semantic feature space via reconstruction losses. In the second (fine-tuning) stage, these encoders initialize a dual-branch classification system with a Multimodal Shared Learning (MSL) branch that aggregates all modalities, and a Multimodal Interactive Learning (MIL) branch containing the Bidirectional Interaction Module (BInM) for explicit cross-modal attention between sequence and spatial features.
Each branch produces three channel outputs—PPI, attribute, and sequence feature vectors—totaling six feature vectors per protein for downstream fusion. This dual-branch design enables both global integration (through shared learning) and detailed, bidirectional cross-modal exchange.
2. Reconstructive Pre-training and Encoders
Reconstructive pre-training is central to DSRPGO. The goal is to learn encoders that extract more informative and fine-grained features from both spatial and sequence modalities:
a) PSSI encoder–decoder:
- Input: , with for flattened PPI adjacency matrix, for concatenated bag-of-words localization and domain features.
- Internals: Uses BiMamba blocks built from state-space models (SSM) and selective scanning, combining forward and backward scan layers, linear mapping, and FiLM-like gating.
- Output: Encoded vectors , decoded to reconstruct the input via a mirrored decoder.
- Loss: Binary cross-entropy on reconstruction,
b) PSeI encoder–decoder:
- Input: ProtT5-embedded sequence .
- Architecture: MLP followed by six-layer Transformer encoder, symmetric decoder.
- Loss: Binary cross-entropy,
This reconstructive formulation ensures both modalities are deeply encoded before fine-tuning on functional labels.
3. Bidirectional Interaction and Dynamic Selection
Bidirectional Interaction Module (BInM):
Enables cross-attention between spatial and sequence representations in the MIL-Branch. After initial MLP mapping and splitting into multi-head attention space, BInM realizes bidirectional, pairwise attention, projecting attended outputs back to their initial feature spaces. For input sets (PPI + attribute) and (sequence), BInM models both directions:
Dynamic Selection Module (DSM):
Fuses the six output channels (three from MSL, three from MIL) using adaptive weighting. For each protein, DSM computes “expert confidence” for each channel by MLP + softmax: Channels with confidence above a threshold are selected, renormalized, and passed through “expert” layers. The concatenated output passes to the classifier, providing flexibility and adaptivity to the functional prediction process.
4. Hierarchical Multi-Label Classification and Losses
DSRPGO treats each GO ontology (BPO, MFO, CCO) as an independent multi-label classification task. The classifier is trained on the DSM-fused features with a focal-style asymmetric cross-entropy loss to accommodate class imbalances: There is no explicit regularization for hierarchy-awareness; each ontology is handled independently.
5. Training, Implementation, and Hyperparameters
Data Preparation
- Pretraining: 19,385 human proteins (PPI from STRING v11.5; sequence, localization, domain from UniProt v3.5.175)
- Fine-tuning: Split by timestamp into train/val/test for each ontology (e.g., BPO: 3,197/304/182)
Hyperparameters
- Pre-training: AdamW, 5,000 epochs, learning rate then , dropout 0.1
- Fine-tuning: AdamW, 100 epochs, learning rate then , dropout 0.3
- Hardware: NVIDIA RTX 4090 or better (≥16 GB VRAM)
- Training time: ~24–48 hours (pre-training), ~1–3 hours per ontology (fine-tuning)
Implementation Notes
- Frozen ProtT5 for sequence, random (Xavier) init for MLPs, BiMamba, Transformers
- The full process is encapsulated in reproducible pseudocode for pre-training, fine-tuning, and inference as described in (Luo et al., 6 Nov 2025).
6. Evaluation and Comparative Performance
Performance is assessed on held-out test sets for each GO ontology using F, micro/macro AUPR, and accuracy:
| Ontology | CFAGO (best baseline) | DSRPGO |
|---|---|---|
| BPO F | 0.439 ± 0.007 | 0.458 ± 0.006 |
| MFO F | 0.236 ± 0.004 | 0.254 ± 0.022 |
| CCO F | 0.366 ± 0.018 | 0.452 ± 0.019 |
Ablation studies reveal that omitting reconstructive pre-training, the bidirectional interaction module, or the dynamic selection module reduces F substantially; for example, "no pre-training" scores 0.297/0.167/0.356 (BPO/MFO/CCO), and BInM/DSM removal yields drops of up to 0.07 on MFO and CCO. Sequence-only or spatial-only baselines underperform dual-modality models.
Standard deviations across five runs attest to stability; paired t-tests are not reported, but all gains demonstrate strong robustness.
7. Significance, Context, and Implications
DSRPGO establishes a new paradigm for multimodal protein function prediction by coupling the strengths of reconstructive pre-training, dynamic cross-modal feature interaction, and adaptive channel selection. By explicitly disentangling modality-specific representations and enabling selective fusion, DSRPGO achieves consistent improvement over strong multimodal and unimodal baselines in hierarchical GO label prediction.
This methodology is applicable particularly to eukaryotic function prediction tasks where context-rich, heterogeneous protein attributes are available. The reconstructive pre-training strategy anchors the framework’s generalizability, while DSM ensures context-aware inference. Comparison with generative LLM models for protein QA (Xiao et al., 21 Aug 2024) and contrastive/modal alignment paradigms (Wang et al., 24 May 2024) highlights DSRPGO’s focus on direct discriminative function label prediction under complex multi-label, multi-ontology settings.
A plausible implication is that further extensions—such as joint optimization with domain-embedding or family-aware graph modalities—may yield additional functional gains, particularly in settings with extreme class imbalance or limited experimental annotations. The modular pipeline and explicit channel weighting in DSRPGO provide routes for continual learning, hard-negative mining, and integration with knowledge graph-enhanced ontological priors.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free