Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 152 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 22 tok/s Pro
GPT-5 High 24 tok/s Pro
GPT-4o 94 tok/s Pro
Kimi K2 212 tok/s Pro
GPT OSS 120B 430 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

Metadata-guided Feature Disentanglement for Functional Genomics (2405.19057v1)

Published 29 May 2024 in q-bio.GN

Abstract: With the development of high-throughput technologies, genomics datasets rapidly grow in size, including functional genomics data. This has allowed the training of large Deep Learning (DL) models to predict epigenetic readouts, such as protein binding or histone modifications, from genome sequences. However, large dataset sizes come at a price of data consistency, often aggregating results from a large number of studies, conducted under varying experimental conditions. While data from large-scale consortia are useful as they allow studying the effects of different biological conditions, they can also contain unwanted biases from confounding experimental factors. Here, we introduce Metadata-guided Feature Disentanglement (MFD) - an approach that allows disentangling biologically relevant features from potential technical biases. MFD incorporates target metadata into model training, by conditioning weights of the model output layer on different experimental factors. It then separates the factors into disjoint groups and enforces independence of the corresponding feature subspaces with an adversarially learned penalty. We show that the metadata-driven disentanglement approach allows for better model introspection, by connecting latent features to experimental factors, without compromising, or even improving performance in downstream tasks, such as enhancer prediction, or genetic variant discovery. The code for our implemementation is available at https://github.com/HealthML/MFD

Definition Search Book Streamline Icon: https://streamlinehq.com
References (43)
  1. Y. Luo et al. New developments on the Encyclopedia of DNA Elements (ENCODE) data portal. Nucleic Acids Research, 48(D1):D882–D889, 11 2019.
  2. J. Zhou and O. Troyanskaya. Predicting effects of noncoding variants with deep learning–based sequence model. Nature Methods, 12(10):931–934, 2015.
  3. D. R. Kelley et al. Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Research, 26(7):990–999, Jul 2016. Epub 2016 May 3.
  4. Ž. Avsec et al. Base-resolution models of transcription-factor binding reveal soft motif syntax. Nature Genetics, 53(3):354–366, 2021.
  5. G. Novakovsky et al. Obtaining genetics insights from deep learning via explainable artificial intelligence. Nature Reviews Genetics, 24(2):125–137, 2023.
  6. S. Hooker et al. A benchmark for interpretability methods in deep neural networks, 2019. In NeurIPS 2019.
  7. A. Majdandzic et al. Correcting gradient-based interpretations of deep neural networks for genomics. Genome Biology, 24:109, 2023.
  8. Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS genetics, 3(9):e161, 2007.
  9. J. Wang et al. Correcting nucleotide-specific biases in high-throughput sequencing data. BMC Bioinformatics, 18:357, 2017.
  10. M. Ghanbari and U. Ohler. Deep neural networks for interpreting rna-binding protein target preferences. Genome research, 30(2):214–226, 2020.
  11. Y. Bengio et al. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798–1828, 2013.
  12. F. Locatello et al. Weakly-supervised disentanglement without compromises. In International Conference on Machine Learning, pp. 6348–6359. PMLR, 2020.
  13. I. Khemakhem et al. Variational autoencoders and nonlinear ica: A unifying framework. In International Conference on Artificial Intelligence and Statistics, pp. 2207–2217. PMLR, 2020.
  14. Y. Ganin et al. Domain-adversarial training of neural networks. The journal of machine learning research, 17(1):2096–2030, 2016.
  15. Q. Zhao et al. Training confounder-free deep learning models for medical applications. Nature communications, 11(1):1–9, 2020.
  16. E. Adeli et al. Representation learning with statistical independence to mitigate bias. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2513–2523, 2021.
  17. D. He and L. Xie. Code-ae: a coherent de-confounding autoencoder for predicting patient-specific drug response from cell line transcriptomics. arXiv preprint arXiv:2102.00538, 2021.
  18. P. Chormai et al. Disentangled explanations of neural network predictions by finding relevant subspaces. ArXiv, abs/2212.14855, 2022.
  19. J. Schreiber et al. Avocado: a multi-scale deep tensor factorization method learns a latent representation of the human epigenome. Genome biology, 21(1):1–18, 2020.
  20. S. Yang et al. Deepnoise: Signal and noise disentanglement based on classifying fluorescent microscopy images via deep learning. Genomics, Proteomics & Bioinformatics, 20(5):989–1001, 2022. AI in Omics.
  21. M. Lotfollahi et al. Biologically informed deep learning to query gene programs in single-cell atlases. Nature Cell Biology, 25:337–350, 2023.
  22. Ž. Avsec et al. Effective gene expression prediction from sequence by integrating long-range interactions. Nature methods, 18(10):1196–1203, 2021.
  23. D. Ha et al. Hypernetworks. arXiv preprint arXiv:1609.09106, 2016.
  24. M. Sundararajan et al. Axiomatic attribution for deep networks. In International conference on machine learning, pp. 3319–3328. PMLR, 2017.
  25. N. Kokhlikyan et al. Captum: A unified and generic model interpretability library for pytorch, 2020.
  26. M. Bentsen et al. Atac-seq footprinting unravels kinetics of transcription factor binding during zygotic genome activation. Nature Communications, 11(1):4267, 2020.
  27. M. Dalby et al. Fantom5 transcribed enhancers in hg38, April 2017.
  28. A. Visel et al. Vista enhancer browser—a database of tissue-specific human enhancers. Nucleic acids research, 35(suppl_1):D88–D92, 2007.
  29. J. Lonsdale et al. The genotype-tissue expression (gtex) project. Nature genetics, 45(6):580–585, 2013.
  30. G. Benegas et al. Gpn-msa: an alignment-based dna language model for genome-wide variant effect prediction. bioRxiv, 2023.
  31. D. R. Kelley et al. Sequential regulatory activity prediction across chromosomes with convolutional neural networks. Genome research, 28(5):739–750, 2018.
  32. S. J. Reddi et al. On the convergence of adam and beyond. arXiv preprint arXiv:1904.09237, 2019.
  33. D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  34. A. Paszke et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
  35. W. Falcon and The PyTorch Lightning team. PyTorch Lightning, March 2019.
  36. M. I. Belghazi et al. Mine: mutual information neural estimation. arXiv preprint arXiv:1801.04062, 2018.
  37. H. M. Amemiya et al. The encode blacklist: Identification of problematic regions of the genome. Scientific reports, 9(1):9354, 2019. 27 Jun. 2019.
  38. A. Visel et al. Vista enhancer browser–a database of tissue-specific human enhancers. Nucleic acids research, 35(Database issue):D88–D92, 2007.
  39. F. Pedregosa et al. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
  40. A. S. Hinrichs et al. The ucsc genome browser database: update 2006. Nucleic acids research, 34(suppl_1):D590–D598, 2006.
  41. N. Kerimov et al. eqtl catalogue 2023: New datasets, x chromosome qtls, and improved detection and visualisation of transcript-level qtls. PLoS Genetics, 19(9):e1010932, 2023.
  42. S. Chen et al. A genomic mutational constraint map using variation in 76,156 human genomes. Nature, pp. 1–11, 2023.
  43. J. E. Moore et al. Expanded encyclopaedias of dna elements in the human and mouse genomes. Nature, 583(7818):699–710, 2020.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Open Questions

We haven't generated a list of open questions mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 1 tweet and received 1 like.

Upgrade to Pro to view all of the tweets about this paper: