Multi-Omics Integration with MOTGNN

Updated 17 August 2025

The paper introduces a novel framework that uses XGBoost-generated supervised graphs and modality-specific GNNs for precise disease classification.
It reduces dimensionality and captures hierarchical interdependencies in high-dimensional omics data while effectively handling class imbalance.
The framework achieves superior accuracy and robust performance, offering scalable interpretability and actionable biomarker discovery for precision medicine.

Integrating multi-omics data is a central challenge in computational biology and precision medicine, demanding methodologies that can model complex, nonlinear interdependencies and facilitate interpretability alongside predictive performance. Multi-Omics Integration with Tree-generated Graph Neural Network (MOTGNN) is an interpretable deep learning framework that leverages omics-specific tree-based supervised graph construction and graph neural networks to achieve robust, efficient, and biologically meaningful disease classification (Yang et al., 10 Aug 2025). MOTGNN addresses the hurdles posed by high-dimensional data, complex interaction structures, class imbalance, and the need for feature-level and modality-level interpretability.

1. MOTGNN Framework Architecture

MOTGNN is a modular pipeline comprising three principle components:

XGBoost-based Tree-generated Graph Construction: For each omics modality (e.g., DNA methylation, mRNA, miRNA), an independent eXtreme Gradient Boosting (XGBoost) classifier is fitted to the labeled data. Each fitted model yields an ensemble of decision trees, each tree capturing hierarchical, supervised relationships among the input features. The union of tree-split features defines a reduced set of informative features X* (with $p^* < p$ $p^{*} < p$ ).
- Nodes: Individual retained features.
- Edges: An undirected connection is drawn between features co-occurring along a decision path (i.e., parent–child relationships within the tree splits).
- This process results in a sparse, supervised, modality-specific feature graph $G_i(V_i, E_i)$ , with each graph reflecting the task-driven inter-feature structure deemed informative by the ensemble.
Modality-specific Graph Neural Networks: Each reduced omics feature set $X_i^*$ , together with its corresponding graph $G_i$ , is paired with an independent GNN implemented as a graph-embedded deep feedforward network (GEDFN). GEDFN layers are mathematically formalized as:

$Z_{(1)} = \sigma(X^* \cdot (W_{in} \odot \nabla A) + b_{in})$

where $W_{in}$ is the parameter matrix, $\nabla A$ is the adjacency matrix (with self-loops), $\odot$ denotes element-wise multiplication, and $\sigma(\cdot)$ is an activation function (e.g., ReLU). This design enables hierarchical, locality-aware representation learning: the GNN learns latent features that honor the tree-guided graph structure, capturing both low-level and higher-order relationships.

Deep Feedforward Network for Cross-Omics Integration: Modality-specific embeddings ( $Z_1$ , $Z_2$ , $Z_3$ ) are concatenated to form a unified latent representation:

$Z = [Z_1 | Z_2 | Z_3]$

This vector is passed through a deep feedforward network (DFN), which models cross-modality interactions and produces the final prediction. The DFN is trained using standard binary cross-entropy:

$\mathcal{L}_{BCE}(Y, \hat{Y}) = - \frac{1}{n} \sum_{i=1}^n \left[y_i \ln(\hat{Y}_i) + (1 - y_i)\ln(1 - \hat{Y}_i)\right]$

2. Tree-generated Supervised Graph Construction

Distinct from generic or similarity-driven networks, MOTGNN’s feature graphs capture task-relevant, hierarchical relationships by exploiting the structure of supervised decision tree ensembles:

Each tree selects a split subset of features, implicitly encoding paths critical for classification.
Parent-child feature co-occurrence in a tree is mapped to an undirected edge; the union over the ensemble spans the graph.
Only features involved as splits in any tree are retained, markedly reducing dimensionality ( $p^* \ll p$ ).
The resulting graphs are typically sparse: empirical data show an average edge-to-node ratio in the range of 2.1–2.8.
This construction is repeated independently for each omics modality, preserving biological specificity and enabling interpretability at the input feature level.

This paradigm achieves dimensionality reduction, denoising, and network induction in a single data-driven step, while reflecting the “decision logic” of tree models in the constructed graph topology.

The embedding pipeline of MOTGNN ensures separation of local (within-omics) and global (cross-omics) structure:

Within modality: GNNs process each reduced-modal dataset and its supervised feature graph, learning hierarchical latent variables that reflect both feature values and the XGBoost-inferred structure.
Across modalities: The integration via DFN merges the independent representations, modeling nonlinear interactions between omics types.

Together, these design choices facilitate:

Propagation and combination of information across multiple biological scales (from individual genes or CpGs to their multi-omics interactions).
Flexible handling of missing or dropped modalities: each GNN operates on its specific omics channel, enabling robust operations in multi-view fusion contexts.

4. Performance and Computational Efficiency

Empirical evaluation on three real-world cancer datasets (COADREAD, LGG, STAD) demonstrates that MOTGNN exhibits superiority across standard metrics:

Accuracy: Achieves improvements of 5–10% over strong baselines (e.g., XGBoost, RF, DFN, GCN), with values up to 93.9%.
ROC-AUC: High consistency (~96.9%), robust even in imbalanced datasets.
F1-score: Substantially elevated in minority/rare class detection (e.g., 87.2% in COADREAD versus 33.4% for Random Forest).
Efficiency: The reduced feature set ( $p^* < p$ ) and sparse graphs stabilize memory and computational costs (training converges within 1–2.5 minutes). The separable design allows parallelization across modalities.

This substantiates the claim that tree-generated, supervision-guided feature graphs can confer both predictive and computational benefits as compared to classical GNN integration strategies.

5. Interpretability and Biomarker Identification

MOTGNN directly supports fine-grained interpretability at both the feature and modality levels, a key demand in translational applications:

Feature-level weights: Using a variant of the Olden and Jackson connection weights algorithm, feature importances are assigned as

$IF_j^{(i)} = \sum_{u=1}^{p_i} |W_{\text{in}, ju} \cdot I(\nabla A_{ju}^{(i)} = 1)|$

where $I(\cdot)$ denotes the indicator function for graph connectivity. Only weights supported by learned graph structure contribute, enhancing the biological interpretability of the rankings.

Omics-level weights: The relative importance of each omics modality is quantified by the L1 norm of the corresponding section of the feedforward network’s input weight matrix:

$RIG_i = \frac{||W_{(Z_i \leftrightarrow f)}||_1}{\sum_{j=1}^3 ||W_{(Z_j \leftrightarrow f)}||_1}$

allowing direct comparison of the contributions of, for example, DNA methylation, mRNA, and miRNA.

Biomarker discovery: Case studies confirm that the highest-scoring features correspond to biologically validated disease markers. For example, SFRP4 was robustly identified as a key discriminating feature for colorectal cancer in COADREAD, consistent with literature.

6. Robustness to Class Imbalance

MOTGNN exhibits high robustness to severe class imbalance, a typical scenario in clinical omics datasets. For example, F1-score is improved from 33.4% with Random Forests to 87.2% with MOTGNN under substantial imbalance, without requiring explicit resampling or class weighting strategies.

7. Scientific Context and Extensions

MOTGNN is distinguished from earlier frameworks in several dimensions:

Combination of Tree Guidance and Graph Learning: The supervised, modality-specific, tree-generated graphs sharply contrast with conventional approaches that use a priori or unsupervised similarity networks for omics data integration.
Modular Graph Fusion: The explicit, parallel modality-specific GNNs furnish flexibility not inherent in early or late concatenation schemes.
Biological Interpretability: The dual-level interpretability (feature-level and omics-level) facilitates actionable insights directly from the trained model, supporting biological hypothesis generation.
Computational Scalability: The reliance on sparse, supervised graphs markedly reduces the memory and time complexity for high-dimensional omics settings.

A plausible implication is that the MOTGNN architecture serves as a scalable template for future multi-omics integrative frameworks that require not only state-of-the-art predictive ability but also inherent interpretability constrained by biological structures.

Table: Motif Components and Outcomes in MOTGNN

Component	Approach	Outcome/Advantage
Graph Construction	XGBoost-derived tree feature graphs	Sparsity, dimension reduction, biological structure
Embedding Learning	Modality-specific GNNs (GEDFN)	Local/hierarchical representation
Cross-Omics Fusion	Deep Feedforward Network (DFN)	Nonlinear, global integration
Feature Importance	Weight-sum over graph connections	Biomarker discovery, interpretability
Omics Contribution	Normed fusion weight for each modality	Multi-level interpretability
Evaluation Metrics	Accuracy, ROC-AUC, F1, computational analysis	Robustness to imbalance, scalability, efficiency

Summary

MOTGNN represents a comprehensive, interpretable, and efficient approach to multi-omics integration for disease modeling. By combining tree-generated, supervised feature graphs per omics with dedicated modality-specific GNNs and a cross-omics integration network, it achieves superior accuracy, robust handling of class imbalance, and detailed interpretability. The framework’s modular structure, computational scalability, and built-in biological plausibility establish MOTGNN as a significant advance in integrative biomedical machine learning for precision medicine (Yang et al., 10 Aug 2025).

PDF Markdown Chat (Pro)

References (1)

MOTGNN: Interpretable Graph Neural Networks for Multi-Omics Disease Classification (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Multi-Omics Integration with Tree-generated Graph Neural Network (MOTGNN).