Papers
Topics
Authors
Recent
Search
2000 character limit reached

RGE-GCN: Recursive Gene Elimination with GCN

Updated 8 January 2026
  • RGE-GCN is a framework that fuses recursive gene elimination with graph convolutional networks to select concise, biologically interpretable gene panels from high-dimensional RNA-seq data.
  • It constructs sample–sample graphs using Pearson correlation and uses Integrated Gradients for principled, robust gene ranking during iterative elimination.
  • The approach achieves superior accuracy and F1-scores in early cancer detection by converging on critical biomarkers validated against established DEG and machine learning classifiers.

RGE-GCN (Recursive Gene Elimination with Graph Convolutional Networks) is a data-driven framework designed for joint feature selection and classification from high-dimensional RNA-seq expression data, with direct application to early cancer detection and biomarker discovery. RGE-GCN integrates a recursive gene elimination scheme with a graph convolutional network (GCN) classifier and utilizes Integrated Gradients (IG) for principled feature attribution. By iteratively removing genes deemed uninformative for classification, the approach converges on a concise and biologically interpretable gene panel, while maintaining or improving predictive performance over standard differentially expressed gene (DEG) methods (Shende et al., 3 Dec 2025).

1. Data Preprocessing and Recursive Elimination Protocol

RGE-GCN begins with an RNA-seq count or normalized expression matrix XRN×GX \in \mathbb{R}^{N \times G}, paired with class labels y{1K}Ny \in \{1\dots K\}^N. The pipeline applies a two-tiered data split:

  • Outer Split: Randomly hold out 20%20\% of the samples as a test set; the remaining 80%80\% forms a “train–validation pool” for recursive selection.
  • Recursive Gene Elimination (RGE) Loop: On each iteration with the current gene set GG', perform a $75/25$ split on the train–validation pool. Operations within each loop include (a) sample–sample graph construction, (b) GCN training and classification, (c) feature attribution using Integrated Gradients, and (d) elimination of the bottom p%p\% genes ranked by aggregate IG score. The process continues until G|G'| reaches a minimum threshold (default 5%5\% of GG), with the best-performing gene subset (by validation accuracy) retained.

The final model is retrained on the full 80%80\% pool using this optimal gene set, and its accuracy and F1-score are reported on the original 20%20\% held-out test set (Shende et al., 3 Dec 2025).

2. Sample–Sample Graph Construction

A Pearson correlation-based, sample–sample graph underpins the GCN classification for each training subset. For XtrainRN×GX_{\text{train}} \in \mathbb{R}^{N' \times G'}:

  • Graph Weights: wij=cov(xi,xj)/(σxiσxj)w_{ij} = \text{cov}(x_i, x_j) / (\sigma_{x_i} \sigma_{x_j}), where xi,xjx_i, x_j are expression vectors for samples i,ji, j.
  • Adjacency Matrix: Aij=1A_{ij} = 1 if wijτ|w_{ij}| \geq \tau, otherwise $0$ (with τ=0.7\tau = 0.7 in all experiments).
  • Self-loops: A^=A+I\hat{A} = A + I ensures the stability of spectral methods during GCN propagation.

This construction captures expression profile similarity and encodes it as a binary graph suitable for convolutional processing.

3. Graph Convolutional Network Architecture

RGE-GCN employs a standard spectral GCN as introduced by Kipf & Welling (2016), with the following propagation rule:

H(+1)=σ(D^1/2A^D^1/2H()W())H^{(\ell+1)} = \sigma\left( \hat{D}^{-1/2} \hat{A} \hat{D}^{-1/2} H^{(\ell)} W^{(\ell)} \right)

where A^\hat{A} is the adjacency with self-loops, D^\hat{D} its degree matrix, H()H^{(\ell)} the feature matrix at layer \ell, W()W^{(\ell)} learnable weights, and σ\sigma a nonlinearity (ReLU).

Architecture Details:

Layer Input Dim Output Dim Operations
1 GG' 64 BatchNorm, ReLU, Dropout(0.4)
2 64 32 BatchNorm, ReLU, Dropout(0.4)
3 32 16 ReLU, Dropout(0.4)
4 16 KK Linear projection to class logits

Loss is weighted cross-entropy (class weights = inverse frequency). The AdamW optimizer is used (learning rate 0.01, weight decay 1e-3), with 200 epochs per training fold (Shende et al., 3 Dec 2025).

4. Integrated Gradients for Gene Ranking

Feature attribution leverages Integrated Gradients (IG) computed for each gene over the trained GCN:

IGi(x)=(xixi)α=01F(x+α(xx))xidαIG_i(x) = (x_i - x'_i) \int_{\alpha=0}^1 \frac{\partial F(x' + \alpha(x-x'))}{\partial x_i} d\alpha

with F()F(\cdot) as the GCN output, baseline x=0x'=0, and integration approximated using 50 Riemann steps. For multiclass settings, the per-gene IG scores are aggregated over all KK classes:

scorei=c=1KIGic\text{score}_i = \sum_{c=1}^K |IG_i^c|

Genes are sorted by their aggregate score; the lowest-ranked p%p\% are eliminated per iteration. This approach guarantees interpretability grounded in established axioms (sensitivity, completeness).

5. Experimental Setup, Results, and Biological Interpretation

Datasets and Baselines

Experiments were conducted on:

  • Synthetic cohorts (1000 genes, Negative Binomial noise, varying sample size and DEG fraction)
  • Cervical cancer miRNA data (58 paired tumor/control, 714 miRNAs)
  • TCGA LUAD/LUSC (1128 samples, 20,531 genes)
  • TCGA kidney cohorts (1020 samples, 20,531 genes)

Baselines included DEG tools (DESeq2, edgeR, limma-voom) and machine learning classifiers (RF, SVM, MLP, GCN) re-trained on the selected gene panels. Accuracy and (macro-)F1-scores were averaged over 5 random splits.

Key Results

Dataset RGE-GCN Accuracy limma-voom Accuracy edgeR Accuracy
Cervical 0.900±0.0420.900 \pm 0.042 0.893±0.0550.893 \pm 0.055 0.883±0.0690.883 \pm 0.069
Lung (LUAD) 0.942±0.0160.942 \pm 0.016 0.922±0.0290.922 \pm 0.029 0.913±0.0370.913 \pm 0.037
Kidney 0.942±0.0070.942 \pm 0.007 0.892±0.0220.892 \pm 0.022 0.889±0.0330.889 \pm 0.033

On synthetic data, RGE-GCN selected \sim50–90 genes (out of 1000) and achieved near-perfect scores, outperforming or matching other classifiers on both selected panels and the full DEG sets.

Biomarker Panels and Pathway Enrichment

  • Lung cancer: CEACAM3, CEACAM6, SUMO4, FOLR2, OR52I1, OR10A3, ADAM6; pathway enrichment indicated purine metabolism (p=0.01596, OR=11.16), one-carbon pool by folate (p=0.0296, OR=36.2), pentose phosphate, cytokine signaling, and IgA network.
  • Cervical cancer: miR-374b (linked to FOXM1), miR-133a (LAMB3–PI3K/AKT), miR-486-5p (PTEN/PI3K–AKT), and miR-489.
  • Kidney cancer: SLC26A8, C14orf19, MGC34034 (LINC01187), MST1P2, GNRH2.

Selected gene sets overlapped with established cancer pathways, notably PI3K–AKT, MAPK, SUMOylation, and immune modulation axes (Shende et al., 3 Dec 2025).

6. Interpretability, Computational Complexity, and Limitations

IG-based attributions satisfy sensitivity and completeness, ensuring a transparent link between the selected gene panel and the prediction. The final gene set (default 5% of initial genes) is validated via pathway analysis and literature.

Computationally, each RGE iteration re-trains a 3-layer GCN (200 epochs) and recomputes IG for every sample (50 steps each), leading to a total cost proportional to the number of iterations and network size (e.g., for G=20000G'=20\,000, runtime is several hours on commodity GPUs).

Notably, the greedy selection procedure is not globally optimal, as the elimination problem is NP-hard. Ablation studies with different stopping criteria (5%, 10%, 20% minima) found that a 5% minimum best balanced panel size and accuracy. Prospective remedies for computational efficiency include warm starts, dimensionality reduction (e.g., PCA), or alternative stopping rules. Currently, RGE-GCN is unimodal; extension to multi-omic or heterogeneous data is considered feasible by augmenting node features or GNN stacking (Shende et al., 3 Dec 2025).

7. Conclusion

RGE-GCN demonstrates a robust, interpretable, and general approach for biomarker discovery and early cancer detection using RNA-seq data. By integrating GCN-based classification with recursive elimination directed by IG attributions, it refines the gene panel to a subset that delivers superior accuracy and F1-score relative to established DEG pipelines. The resultant biomarkers are not only predictive but also biologically meaningful, with direct mapping to known oncogenic pathways and the identification of novel candidate genes for downstream validation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to RE-GCN.