RGE-GCN: Recursive Gene Elimination with GCN
- RGE-GCN is a framework that fuses recursive gene elimination with graph convolutional networks to select concise, biologically interpretable gene panels from high-dimensional RNA-seq data.
- It constructs sample–sample graphs using Pearson correlation and uses Integrated Gradients for principled, robust gene ranking during iterative elimination.
- The approach achieves superior accuracy and F1-scores in early cancer detection by converging on critical biomarkers validated against established DEG and machine learning classifiers.
RGE-GCN (Recursive Gene Elimination with Graph Convolutional Networks) is a data-driven framework designed for joint feature selection and classification from high-dimensional RNA-seq expression data, with direct application to early cancer detection and biomarker discovery. RGE-GCN integrates a recursive gene elimination scheme with a graph convolutional network (GCN) classifier and utilizes Integrated Gradients (IG) for principled feature attribution. By iteratively removing genes deemed uninformative for classification, the approach converges on a concise and biologically interpretable gene panel, while maintaining or improving predictive performance over standard differentially expressed gene (DEG) methods (Shende et al., 3 Dec 2025).
1. Data Preprocessing and Recursive Elimination Protocol
RGE-GCN begins with an RNA-seq count or normalized expression matrix , paired with class labels . The pipeline applies a two-tiered data split:
- Outer Split: Randomly hold out of the samples as a test set; the remaining forms a “train–validation pool” for recursive selection.
- Recursive Gene Elimination (RGE) Loop: On each iteration with the current gene set , perform a $75/25$ split on the train–validation pool. Operations within each loop include (a) sample–sample graph construction, (b) GCN training and classification, (c) feature attribution using Integrated Gradients, and (d) elimination of the bottom genes ranked by aggregate IG score. The process continues until reaches a minimum threshold (default of ), with the best-performing gene subset (by validation accuracy) retained.
The final model is retrained on the full pool using this optimal gene set, and its accuracy and F1-score are reported on the original held-out test set (Shende et al., 3 Dec 2025).
2. Sample–Sample Graph Construction
A Pearson correlation-based, sample–sample graph underpins the GCN classification for each training subset. For :
- Graph Weights: , where are expression vectors for samples .
- Adjacency Matrix: if , otherwise $0$ (with in all experiments).
- Self-loops: ensures the stability of spectral methods during GCN propagation.
This construction captures expression profile similarity and encodes it as a binary graph suitable for convolutional processing.
3. Graph Convolutional Network Architecture
RGE-GCN employs a standard spectral GCN as introduced by Kipf & Welling (2016), with the following propagation rule:
where is the adjacency with self-loops, its degree matrix, the feature matrix at layer , learnable weights, and a nonlinearity (ReLU).
Architecture Details:
| Layer | Input Dim | Output Dim | Operations |
|---|---|---|---|
| 1 | 64 | BatchNorm, ReLU, Dropout(0.4) | |
| 2 | 64 | 32 | BatchNorm, ReLU, Dropout(0.4) |
| 3 | 32 | 16 | ReLU, Dropout(0.4) |
| 4 | 16 | Linear projection to class logits |
Loss is weighted cross-entropy (class weights = inverse frequency). The AdamW optimizer is used (learning rate 0.01, weight decay 1e-3), with 200 epochs per training fold (Shende et al., 3 Dec 2025).
4. Integrated Gradients for Gene Ranking
Feature attribution leverages Integrated Gradients (IG) computed for each gene over the trained GCN:
with as the GCN output, baseline , and integration approximated using 50 Riemann steps. For multiclass settings, the per-gene IG scores are aggregated over all classes:
Genes are sorted by their aggregate score; the lowest-ranked are eliminated per iteration. This approach guarantees interpretability grounded in established axioms (sensitivity, completeness).
5. Experimental Setup, Results, and Biological Interpretation
Datasets and Baselines
Experiments were conducted on:
- Synthetic cohorts (1000 genes, Negative Binomial noise, varying sample size and DEG fraction)
- Cervical cancer miRNA data (58 paired tumor/control, 714 miRNAs)
- TCGA LUAD/LUSC (1128 samples, 20,531 genes)
- TCGA kidney cohorts (1020 samples, 20,531 genes)
Baselines included DEG tools (DESeq2, edgeR, limma-voom) and machine learning classifiers (RF, SVM, MLP, GCN) re-trained on the selected gene panels. Accuracy and (macro-)F1-scores were averaged over 5 random splits.
Key Results
| Dataset | RGE-GCN Accuracy | limma-voom Accuracy | edgeR Accuracy |
|---|---|---|---|
| Cervical | |||
| Lung (LUAD) | |||
| Kidney |
On synthetic data, RGE-GCN selected 50–90 genes (out of 1000) and achieved near-perfect scores, outperforming or matching other classifiers on both selected panels and the full DEG sets.
Biomarker Panels and Pathway Enrichment
- Lung cancer: CEACAM3, CEACAM6, SUMO4, FOLR2, OR52I1, OR10A3, ADAM6; pathway enrichment indicated purine metabolism (p=0.01596, OR=11.16), one-carbon pool by folate (p=0.0296, OR=36.2), pentose phosphate, cytokine signaling, and IgA network.
- Cervical cancer: miR-374b (linked to FOXM1), miR-133a (LAMB3–PI3K/AKT), miR-486-5p (PTEN/PI3K–AKT), and miR-489.
- Kidney cancer: SLC26A8, C14orf19, MGC34034 (LINC01187), MST1P2, GNRH2.
Selected gene sets overlapped with established cancer pathways, notably PI3K–AKT, MAPK, SUMOylation, and immune modulation axes (Shende et al., 3 Dec 2025).
6. Interpretability, Computational Complexity, and Limitations
IG-based attributions satisfy sensitivity and completeness, ensuring a transparent link between the selected gene panel and the prediction. The final gene set (default 5% of initial genes) is validated via pathway analysis and literature.
Computationally, each RGE iteration re-trains a 3-layer GCN (200 epochs) and recomputes IG for every sample (50 steps each), leading to a total cost proportional to the number of iterations and network size (e.g., for , runtime is several hours on commodity GPUs).
Notably, the greedy selection procedure is not globally optimal, as the elimination problem is NP-hard. Ablation studies with different stopping criteria (5%, 10%, 20% minima) found that a 5% minimum best balanced panel size and accuracy. Prospective remedies for computational efficiency include warm starts, dimensionality reduction (e.g., PCA), or alternative stopping rules. Currently, RGE-GCN is unimodal; extension to multi-omic or heterogeneous data is considered feasible by augmenting node features or GNN stacking (Shende et al., 3 Dec 2025).
7. Conclusion
RGE-GCN demonstrates a robust, interpretable, and general approach for biomarker discovery and early cancer detection using RNA-seq data. By integrating GCN-based classification with recursive elimination directed by IG attributions, it refines the gene panel to a subset that delivers superior accuracy and F1-score relative to established DEG pipelines. The resultant biomarkers are not only predictive but also biologically meaningful, with direct mapping to known oncogenic pathways and the identification of novel candidate genes for downstream validation.