Fast sequence to graph alignment using the graph wavefront algorithm (2206.13574v1)

Published 27 Jun 2022 in q-bio.GN and q-bio.QM

Abstract: Motivation: A pan-genome graph represents a collection of genomes and encodes sequence variations between them. It is a powerful data structure for studying multiple similar genomes. Sequence-to-graph alignment is an essential step for the construction and the analysis of pan-genome graphs. However, existing algorithms incur runtime proportional to the product of sequence length and graph size, making them inefficient for aligning long sequences against large graphs. Results: We propose the graph wavefront alignment algorithm (Gwfa), a new method for aligning a sequence to a sequence graph. Although the worst-case time complexity of Gwfa is the same as the existing algorithms, it is designed to run faster for closely matching sequences, and its runtime in practice often increases only moderately with the edit distance of the optimal alignment. On four real datasets, Gwfa is up to four orders of magnitude faster than other exact sequence-to-graph alignment algorithms. We also propose a graph pruning heuristic on top of Gwfa, which can achieve an additional $\sim$10-fold speedup on large graphs. Availability: Gwfa code is accessible at https://github.com/lh3/gwfa.

Citations (8)

View on Semantic Scholar

Summary

The paper presents the Graph Wavefront Alignment (Gwfa) algorithm for efficient sequence-to-graph alignment, crucial for analyzing pan-genome graphs.
Gwfa uses a diagonal recurrence relation and graph wavefronts, incorporating a pruning heuristic to speed up alignment, especially for closely matching sequences.
Empirical results show Gwfa provides significant runtime speedups (up to four orders of magnitude) and lower memory usage compared to existing methods, enabling practical pan-genomic analysis.

Fast Sequence-to-Graph Alignment Using the Graph Wavefront Algorithm

The paper presents the Graph Wavefront Alignment (Gwfa) algorithm, a new approach aimed at addressing the computational challenges associated with aligning sequences to pan-genome graphs. These pan-genome graphs, representing collections of genomes and encoding sequence variations, have become essential in studying multiple similar genomes. The main contribution of this work is the development of an efficient algorithm to align a sequence to a sequence graph, a process critical for constructing and analyzing these genome graphs.

Motivation and Problem Formulation

Aligning sequences to graphs traditionally requires computational resources proportional to the product of sequence length and graph size, rendering existing methods inefficient for larger datasets. Gwfa specifically targets this inefficiency by exploiting cases where the sequences being aligned closely match. The proposed algorithm maintains the same worst-case time complexity as earlier methods but demonstrates a runtime that scales moderately, even as the edit distance of the alignment increases.

The problem is formulated within the context of aligning query sequences to a sequence graph with vertices labeled by strings. Each vertex in the graph is connected by directed edges, and the task is to find a walk through this graph such that the edit distance to the query sequence is minimized.

Methodology

Gwfa capitalizes on the concepts drawn from the wavefront alignment (WFA) previously applied to sequences. By defining a diagonal recurrence relation specific to graph-structured data, Gwfa manages to compute alignments more efficiently. The algorithm traverses and updates graph wavefronts iteratively, extending along diagonals and exploring neighboring graph regions when reaching a vertex's end.

A notable enhancement in Gwfa includes a pruning heuristic that selectively eliminates unpromising wavefronts during the alignment process. This heuristic is designed to accelerate runtime further, albeit at the potential cost of missing the optimal alignment.

Results

Empirical validation of Gwfa on datasets derived from biologically significant genomic regions showed substantial runtime improvements over existing exact sequence-to-graph alignment algorithms, with speedup reaching up to four orders of magnitude. These speed enhancements were particularly marked in datasets where the sequence and graph closely aligned, allowing Gwfa to bypass vast portions of computationally intensive graph traversal.

Additionally, Gwfa's memory footprint was consistently lower than that of comparator algorithms, underscoring its efficiency in resource use alongside computation.

Implications and Future Work

The development of Gwfa represents a significant step towards practical sequence-to-graph alignment, particularly pertinent as the field moves towards utilizing pan-genomic representations over singular reference genomes. This advancement holds promise for a variety of applications, such as variant calling and read mapping in pangenomics.

Future work involves further refining pruning heuristics to balance the trade-off between speed and alignment optimality. Additionally, adapting Gwfa for broader alignment scenarios—including those allowing gaps at various positions in the graph—could extend its applicability in genomic analysis.

This research opens avenues for optimizing alignment processes in more complex, real-world genomic datasets, fostering advancements in both computational genomics and the broader bioinformatics infrastructure.

PDF Markdown