Predicting Transcription Factor Binding Sites using Transformer based Capsule Network (2310.15202v2)
Abstract: Prediction of binding sites for transcription factors is important to understand how they regulate gene expression and how this regulation can be modulated for therapeutic purposes. Although in the past few years there are significant works addressing this issue, there is still space for improvement. In this regard, a transformer based capsule network viz. DNABERT-Cap is proposed in this work to predict transcription factor binding sites mining ChIP-seq datasets. DNABERT-Cap is a bidirectional encoder pre-trained with large number of genomic DNA sequences, empowered with a capsule layer responsible for the final prediction. The proposed model builds a predictor for transcription factor binding sites using the joint optimisation of features encompassing both bidirectional encoder and capsule layer, along with convolutional and bidirectional long-short term memory layers. To evaluate the efficiency of the proposed approach, we use a benchmark ChIP-seq datasets of five cell lines viz. A549, GM12878, Hep-G2, H1-hESC and Hela, available in the ENCODE repository. The results show that the average area under the receiver operating characteristic curve score exceeds 0.91 for all such five cell lines. DNABERT-Cap is also compared with existing state-of-the-art deep learning based predictors viz. DeepARC, DeepTF, CNN-Zeng and DeepBind, and is seen to outperform them.
- The encode project consortium. an integrated encyclopedia of dna elements in the human genome. Nature, 489:57–74, 2012. 10.1038/nature11247.
- Dna dynamics play a role as a basal transcription factor in the positioning and regulation of gene transcription initiation. Nucleic Acids Research, 38(6):1790–1795, 2010. 10.1093/nar/gkp1084.
- Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nature Biotechnology, 33:831–838, 2015. 10.1038/nbt.3300.
- DeepTF: Accurate prediction of transcription factor binding sites by combining multi-scale convolution and long short-term memory neural network. In Intelligence Science and Big Data Engineering. Big Data and Machine Learning, pages 126–138, 2019.
- iGHBP: Computational identification of growth hormone binding proteins from sequences using extremely randomised tree. Computational and Structural Biotechnology Journal, 16:412–420, 2018. 10.1016/j.csbj.2018.10.007.
- Prediction of transcription factor binding sites using a combined deep learning approach. Frontiers in Oncology, 06 2022. 10.3389/fonc.2022.893520.
- Capsule network-based text sentiment classification. IFAC-PapersOnLine, 53(5):698–703, 2020. 10.1016/j.ifacol.2021.04.160.
- DeepGRN: prediction of transcription factor binding site across cell-types using attention-based deep neural networks. BMC Bioinformatics, 22(38), 2021. 10.1186/s12859-020-03952-1.
- Capbind: Prediction of transcription factor binding sites based on capsule network. In 2021 6th International Conference on Computational Intelligence and Applications (ICCIA), pages 31–35, 2021. 10.1109/ICCIA52886.2021.00014.
- Bert: Pre-training of deep bidirectional transformers for language understanding, 2019.
- A. Farrel and J t. Guo. An efficient algorithm for improving structure-based prediction of transcription factor binding sites. BMC Bioinformatics volume, 18(342), 2017. 10.1186/s12859-017-1755-0.
- JASPAR 2020: update of the open-access database of transcription factor binding profiles. Nucleic Acids Research, 48(D1):D87–D92, 2019. 10.1093/nar/gkz1001.
- Enhanced regulatory sequence prediction using gapped k-mer features. PLOS Computational Biology, 10(7):1–15, 2014. 10.1371/journal.pcbi.1003711.
- N. Ghosh and I. Banerjee. Iot-based freezing of gait detection using grey relational analysis. Internet of Things, 13:100068, 2021. 10.1016/j.iot.2019.100068.
- DeeperBind: Enhancing prediction of sequence specificities of DNA binding proteins. In 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pages 178–183, 2016.
- Identity mappings in deep residual networks. In Computer Vision – ECCV 2016, pages 630–645, 2016a.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016b.
- Matrix capsules with em routing. In International conference on learning representations, 2018.
- DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome. Bioinformatics, 37(15):2112–2120, 02 2021. 10.1093/bioinformatics/btab083.
- Prediction of the transcription factor binding sites with meta-learning. Methods, 203:207–213, 2022. 10.1016/j.ymeth.2022.04.010.
- M. Karin. Too many transcription factors: positive and negative interactions. The New biologist, 2(2):126–131, 1990.
- Text classification using capsules. Neurocomputing, 376:214–221, 2020. 10.1016/j.neucom.2019.10.033.
- D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.
- The human transcription factors. Cell, 172(4):650–665, 2018. 10.1016/j.cell.2018.01.029.
- D. S. Latchman. Transcription factors: An overview. The International Journal of Biochemistry & Cell Biology, 29(12):1305–1312, 1997. 10.1016/S1357-2725(97)00085-X.
- DeepFinder: An integration of feature-based and deep learning approach for DNA motif discovery. Biotechnology & Biotechnological Equipment, 32(3):759–768, 2018. 10.1080/13102818.2018.1438209.
- J. Li and J h. Ou. Differential regulation of hepatitis b virus gene expression by the sp1 transcription factor. Journal of Virology, 75(18):8400–8406, 2001. 10.1128/JVI.75.18.8400-8406.2001.
- Why can deep convolutional neural networks improve protein fold recognition? a visual explanation by interpretation. Briefings in Bioinformatics, 22(5), 2021. 10.1093/bib/bbab001.
- TRANSFAC and its module TRANSCompel: transcriptional gene regulation in eukaryotes. Nucleic Acids Research, 34:D108–110, 2006. 10.1093/nar/gkj143.
- Protein transfer learning improves identification of heat shock protein families. PLOS ONE, 16(5):1–14, 2021. 10.1371/journal.pone.0251865.
- Q. Qin and J. Feng. Imputation for transcription factor binding predictions based on deep learning. PLOS Computational Biology, 13(2):1–20, 02 2017. 10.1371/journal.pcbi.1005403.
- A review of dna-binding proteins prediction methods. Current Bioinformatics,, 14(3):246–254, 2019. 10.2174/1574893614666181212102030.
- D. Quang and X. Xie. DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences. Nucleic Acids Research, 44(11):e107–e107, 04 2016. 10.1093/nar/gkw226.
- Dynamic routing between capsules. In Proceedings of the 31st International Conference on Neural Information Processing Systems, page 3859–3869, 2017.
- Bert-caps: A transformer-based capsule network for tweet act classification. IEEE Transactions on Computational Social Systems, 7(5):1168–1179, 2020. 10.1109/TCSS.2020.3014128.
- Base-pair resolution detection of transcription factor binding site by deep deconvolutional network. Bioinformatics, 34(20):3446–3453, 2018. 10.1093/bioinformatics/bty383.
- Transcription factors–DNA interactions in rice: identification and verification. Briefings in Bioinformatics, 21(3):946–956, 2019. 10.1093/bib/bbz045.
- G. Tan and B. Lenhard. Tfbstools: an r/bioconductor package for transcription factor binding site analysis. Bioinformatics, 32(10):1555–1556, 2016. 10.1093/bioinformatics/btw024.
- Assessing computational tools for the discovery of transcription factor binding sites. Nature Biotechnology, 23:137–144, 2005. 10.1038/nbt1053.
- Predicting transcription factor binding sites using DNA shape features based on shared hybrid deep learning architecture. Molecular Therapy - Nucleic Acids, 24:154–163, 2021. 10.1016/j.omtn.2021.02.014.
- Mammalian transcription factor networks: Recent advances in interrogating biological complexity. Cell Systems, 5(4):319–331, 2017. 10.1016/j.cels.2017.07.004.
- DNA motif elucidation using belief propagation. Nucleic Acids Research, 41(16):e153–e153, 2013. 10.1093/nar/gkt574.
- Prediction of regulatory motifs from human Chip-sequencing data using a deep learning framework. Nucleic Acids Research, 47(15):7809–7824, 2019. 10.1093/nar/gkz672.
- Convolutional neural network architectures for predicting dna–protein binding. Bioinformatics, 32(12):i121–i127, 2016a. 10.1093/bioinformatics/btw255.
- Convolutional neural network architectures for predicting dna–protein binding. Bioinformatics, 32(12):i121–i127, 2016b. 10.1093/bioinformatics/btw255.
- Multi-scale capsule network for predicting dna-protein binding sites. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 18(5):1793–1800, 2021a. 10.1109/TCBB.2020.3025579.
- Identification of DNA–protein binding sites by bootstrap multiple convolutional neural networks on sequence information. Engineering Applications of Artificial Intelligence, 79:58–66, 2019. 10.1016/j.engappai.2019.01.003.
- DeepSite: bidirectional LSTM and CNN models for predicting DNA–protein binding. International Journal of Machine Learning and Cybernetics volume, 11:841–851, 2020. 10.1007/s13042-019-00990-x.
- CAE-CNN: Predicting transcription factor binding site with convolutional autoencoder and convolutional neural network. Expert Systems with Applications, 183:115404, 2021b. 10.1016/j.eswa.2021.115404.
- PlantDeepSEA, a deep learning-based web service to predict the regulatory effects of genomic variants in plants. Nucleic Acids Research, 49(W1):W523–W529, 2021. 10.1093/nar/gkab383.