qPMS Sigma -- An Efficient and Exact Parallel Algorithm for the Planted $(l, d)$ Motif Search Problem (2403.00306v1)
Abstract: Motif finding is an important step for the detection of rare events occurring in a set of DNA or protein sequences. Extraction of information about these rare events can lead to new biological discoveries. Motifs are some important patterns that have numerous applications including the identification of transcription factors and their binding sites, composite regulatory patterns, similarity between families of proteins, etc. Although several flavors of motif searching algorithms have been studied in the literature, we study the version known as $ (l, d) $-motif search or Planted Motif Search (PMS). In PMS, given two integers $ l $, $ d $ and $ n $ input sequences we try to find all the patterns of length $ l $ that appear in each of the $ n $ input sequences with at most $ d $ mismatches. We also discuss the quorum version of PMS in our work that finds motifs that are not planted in all the input sequences but at least in $ q $ of the sequences. Our algorithm is mainly based on the algorithms qPMSPrune, qPMS7, TraverStringRef and PMS8. We introduce some techniques to compress the input strings and make faster comparison between strings with bitwise operations. Our algorithm performs a little better than the existing exact algorithms to solve the qPMS problem in DNA sequence. We have also proposed an idea for parallel implementation of our algorithm.
- H. Dinh, S. Rajasekaran, and J. Davila, “qpms7: A fast algorithm for finding (l, d)- motifs in dna and protein sequences,” PloS one, vol. 7, no. 7, p. e41425, 2012.
- L. Duret and P. Bucher, “Searching for regulatory elements in human noncoding sequences,” Current opinion in structural biology, vol. 7, no. 3, pp. 399–406, 1997.
- S. Pal, P. Xiao, and S. Rajasekaran, “Efficient sequential and parallel algorithms for finding edit distance based motifs,” BMC genomics, vol. 17, no. 4, p. 465, 2016.
- S. Rajasekaran, “1. abstract 2. introduction 3. experimental techniques 4. computational techniques 4.1. statistics based techniques 4.2. discrete algorithmic techniques,” Frontiers in Bioscience, vol. 14, pp. 5052–5065, 2009.
- M. Frances and A. Litman, “On covering problems of codes,” Theory of Computing Systems, vol. 30, no. 2, pp. 113–119, 1997.
- M. Nicolae and S. Rajasekaran, “qpms9: an efficient algorithm for quorum planted motif search,” Scientific reports, vol. 5, 2015.
- M. Nicolae and S. Rajasekaran, “Efficient sequential and parallel algorithms for planted motif search,” BMC bioinformatics, vol. 15, no. 1, p. 34, 2014.
- J. Davila, S. Balla, and S. Rajasekaran, “Pampa: An improved branch and bound algorithm for planted (l, d) motif search,” in Tech. rep, 2007.
- J. Davila, S. Balla, and S. Rajasekaran, “Fast and practical algorithms for planted (l, d) motif search,” IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB), vol. 4, no. 4, pp. 544–552, 2007.
- S. Bandyopadhyay, S. Sahni, and S. Rajasekaran, “Pms6mc: A multicore algorithm for motif discovery,” Algorithms, vol. 6, no. 4, pp. 805–823, 2013.
- F. Y. Chin and H. C. Leung, “Voting algorithms for discovering long motifs.,” in APBC, pp. 261–271, 2005.
- N. Pisanti, A. M. Carvalho, L. Marsan, and M.-F. Sagot, “Risotto: Fast extraction of motifs with mismatches,” in Latin American Symposium on Theoretical Informatics, pp. 757–768, Springer, 2006.
- T. L. Bailey, C. Elkan, et al., “Fitting a mixture model by expectation maximization to discover motifs in bipolymers,” 1994.
- J. Buhler and M. Tompa, “Finding motifs using random projections,” Journal of computational biology, vol. 9, no. 2, pp. 225–242, 2002.
- C. E. Lawrence, S. F. Altschul, M. S. Boguski, J. S. Liu, A. F. Neuwald, J. C. Wootton, et al., “Detecting subtle sequence signals: a gibbs sampling strategy for multiple alignment,” SCIENCE-NEW YORK THEN WASHINGTON-, vol. 262, pp. 208–208, 1993.
- P. A. Pevzner, S.-H. Sze, et al., “Combinatorial approaches to finding subtle signals in dna sequences.,” in ISMB, vol. 8, pp. 269–278, 2000.
- E. Rocke and M. Tompa, “An algorithm for finding novel gapped motifs in dna sequences,” in Proceedings of the second annual international conference on Computational molecular biology, pp. 228–233, ACM, 1998.
- U. Keich and P. A. Pevzner, “Finding motifs in the twilight zone,” in Proceedings of the sixth annual international conference on Computational biology, pp. 195–204, ACM, 2002.
- A. Price, S. Ramabhadran, and P. A. Pevzner, “Finding subtle motifs by branching from sample strings,” Bioinformatics, vol. 19, no. suppl 2, pp. ii149–ii155, 2003.
- G. Z. Hertz and G. D. Stormo, “Identifying dna and protein patterns with statistically significant alignments of multiple sequences.,” Bioinformatics, vol. 15, no. 7, pp. 563–577, 1999.
- N. C. Jones and P. Pevzner, “An introduction to bioinformatics algorithms,” 2004.
- Totowa, NJ: Humana Press, 2010.
- W. Wei and X.-D. Yu, “Comparative analysis of regulatory motif discovery tools for transcription factor binding sites,” Genomics, proteomics & bioinformatics, vol. 5, no. 2, pp. 131–142, 2007.
- S. Tanaka, “Improved exact enumerative algorithms for the planted (l𝑙litalic_l, d𝑑ditalic_d)-motif search problem,” IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 11, no. 2, pp. 361–374, 2014.