Multi-Omics Integration with MOTGNN
- The paper introduces a novel framework that uses XGBoost-generated supervised graphs and modality-specific GNNs for precise disease classification.
- It reduces dimensionality and captures hierarchical interdependencies in high-dimensional omics data while effectively handling class imbalance.
- The framework achieves superior accuracy and robust performance, offering scalable interpretability and actionable biomarker discovery for precision medicine.
Integrating multi-omics data is a central challenge in computational biology and precision medicine, demanding methodologies that can model complex, nonlinear interdependencies and facilitate interpretability alongside predictive performance. Multi-Omics Integration with Tree-generated Graph Neural Network (MOTGNN) is an interpretable deep learning framework that leverages omics-specific tree-based supervised graph construction and graph neural networks to achieve robust, efficient, and biologically meaningful disease classification (Yang et al., 10 Aug 2025). MOTGNN addresses the hurdles posed by high-dimensional data, complex interaction structures, class imbalance, and the need for feature-level and modality-level interpretability.
1. MOTGNN Framework Architecture
MOTGNN is a modular pipeline comprising three principle components:
- XGBoost-based Tree-generated Graph Construction: For each omics modality (e.g., DNA methylation, mRNA, miRNA), an independent eXtreme Gradient Boosting (XGBoost) classifier is fitted to the labeled data. Each fitted model yields an ensemble of decision trees, each tree capturing hierarchical, supervised relationships among the input features. The union of tree-split features defines a reduced set of informative features X* (with ).
- Nodes: Individual retained features.
- Edges: An undirected connection is drawn between features co-occurring along a decision path (i.e., parent–child relationships within the tree splits).
- This process results in a sparse, supervised, modality-specific feature graph , with each graph reflecting the task-driven inter-feature structure deemed informative by the ensemble.
- Modality-specific Graph Neural Networks: Each reduced omics feature set , together with its corresponding graph , is paired with an independent GNN implemented as a graph-embedded deep feedforward network (GEDFN). GEDFN layers are mathematically formalized as:
where is the parameter matrix, is the adjacency matrix (with self-loops), denotes element-wise multiplication, and is an activation function (e.g., ReLU). This design enables hierarchical, locality-aware representation learning: the GNN learns latent features that honor the tree-guided graph structure, capturing both low-level and higher-order relationships.
- Deep Feedforward Network for Cross-Omics Integration: Modality-specific embeddings (, , ) are concatenated to form a unified latent representation:
This vector is passed through a deep feedforward network (DFN), which models cross-modality interactions and produces the final prediction. The DFN is trained using standard binary cross-entropy:
2. Tree-generated Supervised Graph Construction
Distinct from generic or similarity-driven networks, MOTGNN’s feature graphs capture task-relevant, hierarchical relationships by exploiting the structure of supervised decision tree ensembles:
- Each tree selects a split subset of features, implicitly encoding paths critical for classification.
- Parent-child feature co-occurrence in a tree is mapped to an undirected edge; the union over the ensemble spans the graph.
- Only features involved as splits in any tree are retained, markedly reducing dimensionality ().
- The resulting graphs are typically sparse: empirical data show an average edge-to-node ratio in the range of 2.1–2.8.
- This construction is repeated independently for each omics modality, preserving biological specificity and enabling interpretability at the input feature level.
This paradigm achieves dimensionality reduction, denoising, and network induction in a single data-driven step, while reflecting the “decision logic” of tree models in the constructed graph topology.
3. Hierarchical and Cross-Modal Representation Learning
The embedding pipeline of MOTGNN ensures separation of local (within-omics) and global (cross-omics) structure:
- Within modality: GNNs process each reduced-modal dataset and its supervised feature graph, learning hierarchical latent variables that reflect both feature values and the XGBoost-inferred structure.
- Across modalities: The integration via DFN merges the independent representations, modeling nonlinear interactions between omics types.
Together, these design choices facilitate:
- Propagation and combination of information across multiple biological scales (from individual genes or CpGs to their multi-omics interactions).
- Flexible handling of missing or dropped modalities: each GNN operates on its specific omics channel, enabling robust operations in multi-view fusion contexts.
4. Performance and Computational Efficiency
Empirical evaluation on three real-world cancer datasets (COADREAD, LGG, STAD) demonstrates that MOTGNN exhibits superiority across standard metrics:
- Accuracy: Achieves improvements of 5–10% over strong baselines (e.g., XGBoost, RF, DFN, GCN), with values up to 93.9%.
- ROC-AUC: High consistency (~96.9%), robust even in imbalanced datasets.
- F1-score: Substantially elevated in minority/rare class detection (e.g., 87.2% in COADREAD versus 33.4% for Random Forest).
- Efficiency: The reduced feature set () and sparse graphs stabilize memory and computational costs (training converges within 1–2.5 minutes). The separable design allows parallelization across modalities.
This substantiates the claim that tree-generated, supervision-guided feature graphs can confer both predictive and computational benefits as compared to classical GNN integration strategies.
5. Interpretability and Biomarker Identification
MOTGNN directly supports fine-grained interpretability at both the feature and modality levels, a key demand in translational applications:
- Feature-level weights: Using a variant of the Olden and Jackson connection weights algorithm, feature importances are assigned as
where denotes the indicator function for graph connectivity. Only weights supported by learned graph structure contribute, enhancing the biological interpretability of the rankings.
- Omics-level weights: The relative importance of each omics modality is quantified by the L1 norm of the corresponding section of the feedforward network’s input weight matrix:
allowing direct comparison of the contributions of, for example, DNA methylation, mRNA, and miRNA.
- Biomarker discovery: Case studies confirm that the highest-scoring features correspond to biologically validated disease markers. For example, SFRP4 was robustly identified as a key discriminating feature for colorectal cancer in COADREAD, consistent with literature.
6. Robustness to Class Imbalance
MOTGNN exhibits high robustness to severe class imbalance, a typical scenario in clinical omics datasets. For example, F1-score is improved from 33.4% with Random Forests to 87.2% with MOTGNN under substantial imbalance, without requiring explicit resampling or class weighting strategies.
7. Scientific Context and Extensions
MOTGNN is distinguished from earlier frameworks in several dimensions:
- Combination of Tree Guidance and Graph Learning: The supervised, modality-specific, tree-generated graphs sharply contrast with conventional approaches that use a priori or unsupervised similarity networks for omics data integration.
- Modular Graph Fusion: The explicit, parallel modality-specific GNNs furnish flexibility not inherent in early or late concatenation schemes.
- Biological Interpretability: The dual-level interpretability (feature-level and omics-level) facilitates actionable insights directly from the trained model, supporting biological hypothesis generation.
- Computational Scalability: The reliance on sparse, supervised graphs markedly reduces the memory and time complexity for high-dimensional omics settings.
A plausible implication is that the MOTGNN architecture serves as a scalable template for future multi-omics integrative frameworks that require not only state-of-the-art predictive ability but also inherent interpretability constrained by biological structures.
Table: Motif Components and Outcomes in MOTGNN
Component | Approach | Outcome/Advantage |
---|---|---|
Graph Construction | XGBoost-derived tree feature graphs | Sparsity, dimension reduction, biological structure |
Embedding Learning | Modality-specific GNNs (GEDFN) | Local/hierarchical representation |
Cross-Omics Fusion | Deep Feedforward Network (DFN) | Nonlinear, global integration |
Feature Importance | Weight-sum over graph connections | Biomarker discovery, interpretability |
Omics Contribution | Normed fusion weight for each modality | Multi-level interpretability |
Evaluation Metrics | Accuracy, ROC-AUC, F1, computational analysis | Robustness to imbalance, scalability, efficiency |
Summary
MOTGNN represents a comprehensive, interpretable, and efficient approach to multi-omics integration for disease modeling. By combining tree-generated, supervised feature graphs per omics with dedicated modality-specific GNNs and a cross-omics integration network, it achieves superior accuracy, robust handling of class imbalance, and detailed interpretability. The framework’s modular structure, computational scalability, and built-in biological plausibility establish MOTGNN as a significant advance in integrative biomedical machine learning for precision medicine (Yang et al., 10 Aug 2025).