Quantify the Effect of Training Data Biases on Downstream Genomics Tasks

Determine the magnitude of the impact that experiment-specific and technology-specific biases and batch effects in functional genomics datasets have on downstream applications, specifically enhancer sequence prediction and genetic variant effect prediction using deep neural network models trained on DNA sequence data.

Background

The paper discusses that genomics data are affected by various experimental and technological biases, as well as strong batch effects, which can obscure biological signals and influence model training. Recent work indicates that deep learning models may rely on such biases in addition to genuine biological features, raising concerns about downstream analyses.

The authors propose a disentangled representation learning approach (Metadata-guided Feature Disentanglement) to separate biological signals from technical factors. However, they explicitly state that it is unclear how strongly these biases affect downstream tasks like enhancer prediction and variant effect prediction, motivating the need to quantify this impact.

References

It is unclear how strongly this affects downstream applications, such as enhancer sequence or genetic variant effect prediction.

— Metadata-guided Feature Disentanglement for Functional Genomics (2405.19057 - Rakowski et al., 29 May 2024) in Section: Introduction

Quantify the Effect of Training Data Biases on Downstream Genomics Tasks

Sponsor

Background

References

Related Problems