Feasibility of overcoming spurious “causal” feature identification in ML-based whole-genome phenotype prediction

Determine whether machine learning pipelines for bacterial whole-genome phenotype prediction can reliably overcome the hurdle of falsely identified "causal" features by enabling interpretations that distinguish truly causal genetic variants from spurious associations when models are trained on high-dimensional genomic data.

Background

The paper argues that current machine learning models trained on bacterial whole-genome data often achieve high predictive accuracy yet fail to provide reliable interpretability, with identified "causal" features frequently being spurious due to high-dimensional correlations and genome-wide linkage disequilibrium. As a result, trust in model-derived causal insights is compromised. The authors emphasize the need to understand whether this fundamental hurdle can be surmounted, motivating explicit open problems around developing pipelines and representations that separate causal signals from confounding associations.

This uncertainty is framed at the outset of the work to motivate the subsequent formalization of genotype-to-phenotype mapping challenges and requirements for causal fine mapping, highlighting the gap between predictive performance and causal understanding in bacterial genomics.

References

Though it is not yet clear whether we can overcome this hurdle, significant efforts are being made towards discovering potential high-risk bacterial genetic variants.

— Whole-Genome Phenotype Prediction with Machine Learning: Open Problems in Bacterial Genomics (2502.07749 - James et al., 11 Feb 2025) in Abstract

Feasibility of overcoming spurious “causal” feature identification in ML-based whole-genome phenotype prediction

Sponsor

Background

References

Related Problems