Improving spliced alignment by modeling splice sites with deep learning (2506.12986v1)

Published 15 Jun 2025 in q-bio.GN

Abstract: Motivation: Spliced alignment refers to the alignment of messenger RNA (mRNA) or protein sequences to eukaryotic genomes. It plays a critical role in gene annotation and the study of gene functions. Accurate spliced alignment demands sophisticated modeling of splice sites, but current aligners use simple models, which may affect their accuracy given dissimilar sequences. Results: We implemented minisplice to learn splice signals with a one-dimensional convolutional neural network (1D-CNN) and trained a model with 7,026 parameters for vertebrate and insect genomes. It captures conserved splice signals across phyla and reveals GC-rich introns specific to mammals and birds. We used this model to estimate the empirical splicing probability for every GT and AG in genomes, and modified minimap2 and miniprot to leverage pre-computed splicing probability during alignment. Evaluation on human long-read RNA-seq data and cross-species protein datasets showed our method greatly improves the junction accuracy especially for noisy long RNA-seq reads and proteins of distant homology. Availability and implementation: https://github.com/lh3/minisplice

Summary

The paper introduces minisplice, a 1D-CNN model that enhances spliced alignment by reducing unannotated junction rates from 14.01% to 4.37%.
The paper leverages deep learning to capture conserved splice site signals across species, demonstrating robust performance with noisy datasets.
The integration of minisplice into aligners like minimap2 and miniprot offers researchers a more adaptive and accurate tool for genomic annotation.

Improving Spliced Alignment by Modeling Splice Sites with Deep Learning

The work titled "Improving Spliced Alignment by Modeling Splice Sites with Deep Learning" introduces minisplice—a sophisticated tool integrated with widely used aligners, minimap2 and miniprot, to enhance spliced alignment accuracy by leveraging deep learning strategies. In genomic analysis, spliced alignment, which is the mapping of mRNA or protein sequences to eukaryotic genomes, plays a pivotal role in gene annotation and understanding gene functions. This paper addresses the limitations of simplistic splice site models currently employed and offers a more refined model using deep learning techniques.

Methodology

The authors developed a 1D-CNN model with 7,026 parameters aimed at capturing conserved splice signals across a wide phylogenetic range, focusing particularly on vertebrates and insects. They demonstrate the model's proficiency in identifying splice junctions like the commonly known {\tt GT..AG} while also revealing GC-rich introns prominent in mammals and birds. The methodology is broken down into three main stages: training the model, predicting splicing probabilities, and integrating these scores into splicing aligners to enhance sequence alignment accuracy. Notably, training data are generated from public and curated genome annotations, which the model uses to learn the difference between genuine and spurious splice sites based on sequence patterns.

Results and Evaluation

The evaluation of minisplice's utility involved benchmarks against human long-read RNA-seq data and protein datasets across species. The authors report a significant improvement in junction accuracy, especially with noisy datasets, highlighting the model's robust performance in scenarios where existing aligners struggle. Quantitatively, the integration of minisplice reduced unannotated junction rates from 14.01%—observed with conventional methods—down to 4.37% in cross-species assessments, which underscores its potential for augmenting genomic analysis workflows.

Theoretical and Practical Implications

The theoretical implications of this research underscore the potential of small, targeted deep learning models to reshape genomic annotation practices. This work exemplifies the transition from static, rigid models to more adaptive, data-driven ones, capable of generalizing across various evolutionary contexts. Practically, the integration of minisplice into existing alignment tools signifies enhanced utility for researchers working with diverse datasets, promoting accuracy and depth in genomic analysis without the overhead associated with larger machine learning architectures.

Future Directions and Speculation

Future developments could involve extending the model to cover additional splice site variants beyond the dominant {\tt GT..AG}, achieving a more comprehensive genomic annotation tool. Another prospective avenue is refining the model to consider genetic variations during alignment processes, thereby accommodating mutation-induced signal changes. Moreover, while the current model is optimized for vertebrates and insects, expanding training datasets to include a more extensive array of organisms, such as plants, could widen its applicability further.

In conclusion, by fusing neural networks with genomic alignments, this research moves towards more adaptive and accurate computational tools for biological research, promising substantial advances in the field of computational genomics. The innovative integration of deep learning with spliced alignment paradigms, as exemplified by minisplice, offers a compelling foundation for evolving methodologies in genome analysis.

PDF Markdown

Related Papers

Find Related Papers

GitHub

GitHub - lh3/minisplice: Scoring GT/AG sites for improving spliced alignment (24 stars)

Tweets

https://twitter.com/lh3lh3/status/1934790361261465618

https://twitter.com/strnr/status/1934997088988766279

https://twitter.com/razoralign/status/1934998517854212538

https://twitter.com/AllThingsApx/status/1935002028729602456