Dice Question Streamline Icon: https://streamlinehq.com

Relevance of Genes Selected by ML Explainability in Expression-Based Phenotype Classification

Determine the biological relevance of gene sets selected by explainability methods for machine-learning classifiers trained on gene expression data, specifically methods such as integrated gradients applied to logistic regression, multilayer perceptron, and graph neural network models, by assessing whether these top-ranked genes consistently correspond to phenotype-associated biomarkers and established biological processes (e.g., via over-representation analysis against MSigDB collections).

Information Square Streamline Icon: https://streamlinehq.com

Background

Many machine learning classifiers (logistic regression, multilayer perceptron, graph neural networks) can accurately predict phenotypes from gene expression data and can be accompanied by explainability methods that rank genes by importance. However, these rankings differ markedly across explainability and statistical methods, and similar performance can be achieved using many alternative sets of genes, raising concerns about the completeness and stability of the selected biomarkers.

The paper compares explainability-derived rankings (integrated gradients) with classical statistical feature selection (DESeq2, EdgeR, mutual information) across multiple datasets (TCGA, GTEx, TARGET) and examines biological relevance via over-representation analysis. Despite strong predictive performance, the divergence among top-ranked genes leaves unresolved whether explainability-selected genes reflect true biological mechanisms relevant to the phenotypes.

References

Still, the question of the relevance of genes selected through ML models explainability remains unsolved.