Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
133 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Splicing Up Your Predictions with RNA Contrastive Learning (2310.08738v2)

Published 12 Oct 2023 in cs.LG and q-bio.GN

Abstract: In the face of rapidly accumulating genomic data, our understanding of the RNA regulatory code remains incomplete. Recent self-supervised methods in other domains have demonstrated the ability to learn rules underlying the data-generating process such as sentence structure in language. Inspired by this, we extend contrastive learning techniques to genomic data by utilizing functional similarities between sequences generated through alternative splicing and gene duplication. Our novel dataset and contrastive objective enable the learning of generalized RNA isoform representations. We validate their utility on downstream tasks such as RNA half-life and mean ribosome load prediction. Our pre-training strategy yields competitive results using linear probing on both tasks, along with up to a two-fold increase in Pearson correlation in low-data conditions. Importantly, our exploration of the learned latent space reveals that our contrastive objective yields semantically meaningful representations, underscoring its potential as a valuable initialization technique for RNA property prediction.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (61)
  1. V. Agarwal and D. R. Kelley. The genetic and biochemical determinants of mRNA degradation rates in mammals. Genome Biol, 23(1):245, Nov 2022.
  2. Gene ontology: tool for the unification of biology. Nature Genetics, 25(1):25–29, May 2000. doi: 10.1038/75556. URL https://doi.org/10.1038/75556.
  3. A Cookbook of Self-Supervised Learning. arXiv e-prints, art. arXiv:2304.12210, April 2023. doi: 10.48550/arXiv.2304.12210.
  4. F. E. Baralle and J. Giudice. Alternative splicing as a regulator of development and tissue identity. Nat Rev Mol Cell Biol, 18(7):437–451, Jul 2017.
  5. VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning. arXiv e-prints, art. arXiv:2105.04906, May 2021. doi: 10.48550/arXiv.2105.04906.
  6. Reverse Engineering Self-Supervised Learning. arXiv e-prints, art. arXiv:2305.15614, May 2023. doi: 10.48550/arXiv.2305.15614.
  7. An rna foundation model enables discovery of disease mechanisms and candidate therapeutics. bioRxiv, 2023. doi: 10.1101/2023.09.20.558508. URL https://www.biorxiv.org/content/early/2023/09/26/2023.09.20.558508.
  8. Interpretable RNA Foundation Model from Unannotated Data for Highly Accurate RNA Structure and Function Predictions. arXiv e-prints, art. arXiv:2204.00300, April 2022. doi: 10.48550/arXiv.2204.00300.
  9. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. arXiv e-prints, art. arXiv:1606.00915, June 2016. doi: 10.48550/arXiv.1606.00915.
  10. A genome-wide mutational constraint map quantified from variation in 76,156 human genomes. bioRxiv, 2022. doi: 10.1101/2022.03.20.485034. URL https://www.biorxiv.org/content/early/2022/10/10/2022.03.20.485034.
  11. A Simple Framework for Contrastive Learning of Visual Representations. arXiv e-prints, art. arXiv:2002.05709, February 2020. doi: 10.48550/arXiv.2002.05709.
  12. The Gene Ontology knowledgebase in 2023. Genetics, 224(1):iyad031, 03 2023. ISSN 1943-2631. doi: 10.1093/genetics/iyad031. URL https://doi.org/10.1093/genetics/iyad031.
  13. The nucleotide transformer: Building and evaluating robust foundation models for human genomics. bioRxiv, 2023. doi: 10.1101/2023.01.11.523679. URL https://www.biorxiv.org/content/early/2023/01/15/2023.01.11.523679.
  14. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv e-prints, art. arXiv:1810.04805, October 2018. doi: 10.48550/arXiv.1810.04805.
  15. ProbCons: Probabilistic consistency-based multiple sequence alignment. Genome Res, 15(2):330–340, Feb 2005.
  16. GENCODE 2021. Nucleic Acids Res, 49(D1):D916–D923, Jan 2021.
  17. Disease variant prediction with deep generative models of evolutionary data. Nature, 599(7883):91–95, Nov 2021.
  18. On the duality between contrastive and non-contrastive self-supervised learning. arXiv e-prints, art. arXiv:2206.02574, June 2022. doi: 10.48550/arXiv.2206.02574.
  19. Bootstrap your own latent: A new approach to self-supervised Learning. arXiv e-prints, art. arXiv:2006.07733, June 2020. doi: 10.48550/arXiv.2006.07733.
  20. Variant interpretation using population databases: Lessons from gnomAD. Human Mutation, 43(8):1012–1030, December 2021. doi: 10.1002/humu.24309. URL https://doi.org/10.1002/humu.24309.
  21. Deep Residual Learning for Image Recognition. arXiv e-prints, art. arXiv:1512.03385, December 2015. doi: 10.48550/arXiv.1512.03385.
  22. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. arXiv e-prints, art. arXiv:1502.03167, February 2015. doi: 10.48550/arXiv.1502.03167.
  23. A highly conserved program of neuronal microexons is misregulated in autistic brains. Cell, 159(7):1511–1523, Dec 2014.
  24. Predicting Splicing from Primary Sequence with Deep Learning. Cell, 176(3):535–548, Jan 2019.
  25. DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome. Bioinformatics, 37(15):2112–2120, Aug 2021.
  26. Highly accurate protein structure prediction with AlphaFold. Nature, 596(7873):583–589, Aug 2021.
  27. Sequential regulatory activity prediction across chromosomes with convolutional neural networks. Genome Res, 28(5):739–750, May 2018.
  28. Adam: A Method for Stochastic Optimization. arXiv e-prints, art. arXiv:1412.6980, December 2014. doi: 10.48550/arXiv.1412.6980.
  29. Gregory R. Koch. Siamese neural networks for one-shot image recognition. 2015.
  30. EvoAug: improving generalization and interpretability of genomic deep neural networks with evolution-inspired data augmentations. Genome Biol, 24(1):105, May 2023.
  31. Analysis of protein-coding genetic variation in 60,706 humans. Nature, 536(7616):285–291, Aug 2016.
  32. Arthur M. Lesk. Chapter 4 Alignments and phylogenetic trees. Oxford University Press, 2020.
  33. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science, 379(6637):1123–1130, Mar 2023.
  34. Deciphering the impact of genetic variation on human polyadenylation using APARENT2. Genome Biol, 23(1):232, Nov 2022.
  35. SGDR: Stochastic Gradient Descent with Warm Restarts. arXiv e-prints, art. arXiv:1608.03983, August 2016. doi: 10.48550/arXiv.1608.03983.
  36. Evolution Is All You Need: Phylogenetic Augmentation for Contrastive Learning. arXiv e-prints, art. arXiv:2012.13475, December 2020. doi: 10.48550/arXiv.2012.13475.
  37. Self-supervised contrastive learning of protein representations by mutual information maximization. bioRxiv, 2020. doi: 10.1101/2020.09.04.283929. URL https://www.biorxiv.org/content/early/2020/11/10/2020.09.04.283929.
  38. Language models enable zero-shot prediction of the effects of mutations on protein function. bioRxiv, 2021. doi: 10.1101/2021.07.09.450648.
  39. G p.Met645Arg causes Wilson disease by promoting exon 6 skipping. NPJ Genom Med, 5:16, 2020.
  40. HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution. arXiv e-prints, art. arXiv:2306.15794, June 2023. doi: 10.48550/arXiv.2306.15794.
  41. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res, 44(D1):D733–745, Jan 2016.
  42. CHESS: a new human gene catalog curated from thousands of large-scale RNA sequencing experiments reveals extensive transcriptional noise. Genome Biol, 19(1):208, Nov 2018.
  43. Improving language understanding by generative pre-training. 2018. URL https://api.semanticscholar.org/CorpusID:49313245.
  44. Learning Transferable Visual Models From Natural Language Supervision. arXiv e-prints, art. arXiv:2103.00020, February 2021. doi: 10.48550/arXiv.2103.00020.
  45. Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology. Genet Med, 17(5):405–424, May 2015.
  46. Database resources of the National Center for Biotechnology Information in 2023. Nucleic Acids Res, 51(D1):D29–D38, Jan 2023.
  47. Kihyuk Sohn. Improved deep metric learning with multi-class n-pair loss objective. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016. URL https://proceedings.neurips.cc/paper_files/paper/2016/file/6b180037abbebea991d8b1232f8a8ca9-Paper.pdf.
  48. 3’ UTR-isoform choice has limited influence on the stability and translational efficiency of most mRNAs in mouse fibroblasts. Genome Res, 23(12):2078–2090, Dec 2013.
  49. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(56):1929–1958, 2014. URL http://jmlr.org/papers/v15/srivastava14a.html.
  50. Isoform-resolved mRNA profiling of ribosome load defines interplay of HIF and mTOR dysregulation in kidney cancer. Nature Structural Molecular Biology, 29(9):871–880, September 2022. doi: 10.1038/s41594-022-00819-2. URL https://doi.org/10.1038/s41594-022-00819-2.
  51. Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program. Nature, 590(7845):290–299, Feb 2021.
  52. Pushing the limits of self-supervised ResNets: Can we outperform supervised learning without labels on ImageNet? arXiv e-prints, art. arXiv:2201.05119, January 2022. doi: 10.48550/arXiv.2201.05119.
  53. Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of Machine Learning Research, 9(86):2579–2605, 2008. URL http://jmlr.org/papers/v9/vandermaaten08a.html.
  54. Attention is all you need. In NIPS, 2017. URL https://api.semanticscholar.org/CorpusID:13756489.
  55. Deep learning in biomedicine. Nat Biotechnol, 36(9):829–838, Oct 2018.
  56. Decoupled Contrastive Learning. arXiv e-prints, art. arXiv:2110.06848, October 2021. doi: 10.48550/arXiv.2110.06848.
  57. Enzyme function prediction using contrastive learning. Science, 379(6639):1358–1363, Mar 2023.
  58. CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features. arXiv e-prints, art. arXiv:1905.04899, May 2019. doi: 10.48550/arXiv.1905.04899.
  59. mixup: Beyond Empirical Risk Minimization. arXiv e-prints, art. arXiv:1710.09412, October 2017. doi: 10.48550/arXiv.1710.09412.
  60. The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens. Genome Biol, 20(1):244, Nov 2019.
  61. DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genome. arXiv e-prints, art. arXiv:2306.15006, June 2023. doi: 10.48550/arXiv.2306.15006.

Summary

We haven't generated a summary for this paper yet.