KMC 2: Fast and resource-frugal $k$-mer counting (1407.1507v1)

Published 6 Jul 2014 in cs.DS, cs.CE, and q-bio.GN

Abstract: Motivation: Building the histogram of occurrences of every $k$-symbol long substring of nucleotide data is a standard step in many bioinformatics applications, known under the name of $k$-mer counting. Its applications include developing de Bruijn graph genome assemblers, fast multiple sequence alignment and repeat detection. The tremendous amounts of NGS data require fast algorithms for $k$-mer counting, preferably using moderate amounts of memory. Results: We present a novel method for $k$-mer counting, on large datasets at least twice faster than the strongest competitors (Jellyfish~2, KMC~1), using about 12\,GB (or less) of RAM memory. Our disk-based method bears some resemblance to MSPKmerCounter, yet replacing the original minimizers with signatures (a carefully selected subset of all minimizers) and using $(k, x)$-mers allows to significantly reduce the I/O, and a highly parallel overall architecture allows to achieve unprecedented processing speeds. For example, KMC~2 allows to count the 28-mers of a human reads collection with 44-fold coverage (106\,GB of compressed size) in about 20 minutes, on a 6-core Intel i7 PC with an SSD. Availability: KMC~2 is freely available at http://sun.aei.polsl.pl/kmc. Contact: [email protected]

Citations (239)

View on Semantic Scholar

Summary

The paper presents KMC2, a novel disk-based tool that counts k-mers at least twice as fast as competitors while using around 12 GB of RAM.
It innovates with a signature-based approach and (k, x)-mer sorting to optimize disk usage and reduce computational demands.
The tool’s parallel architecture efficiently processes large NGS datasets, demonstrated by counting 28-mers in about 20 minutes on a 6-core Intel i7 system.

KMC~2: Fast and Resource-Frugal $k$ -mer Counting

The paper "KMC~2: Fast and Resource-frugal $k$ -mer Counting" addresses the critical need for efficient $k$ -mer counting in bioinformatics, particularly in the context of processing large datasets generated by next-generation sequencing (NGS) technologies. This procedure is foundational for a myriad of bioinformatics applications, including genome assembly, sequence alignment, and repeat detection. The authors present KMC~2, a tool which achieves notable improvements in both speed and memory utilization over its predecessors and contemporaries.

Key Contributions and Methodological Advances

KMC~2 stands out due to its novel disk-based method that allows for high-speed $k$ -mer counting while conserving memory usage. This is significant with datasets of the magnitude typically encountered in NGS. The authors claim that KMC~2 operates at least twice as fast as other leading tools like Jellyfish~2 and KMC~1, utilizing approximately 12 GB of RAM. The paper introduces several methodical innovations critical to these improvements:

Signature-Based Approach: Unlike the conventional use of minimizers, KMC~2 employs signatures, a carefully curated subset of minimizers. This approach significantly mitigates disk space requirements and optimizes the distribution of data for processing.
$(k, x)$ -mers: KMC~2 introduces an optimization that sorts $(k+x)$ -mers instead of $k$ -mers, enabling reduced data sorting demands and efficient memory use during processing.
Parallel Architecture: A highly parallel architecture is utilized to harness the full potential of contemporary multi-core systems. KMC~2 processes datasets with remarkable speed, for example, counting 28-mers of a human read collection in about 20 minutes on a 6-core Intel i7 PC equipped with an SSD.

Results and Implications

The paper provides strong empirical evidence supporting the efficiency and effectiveness of KMC~2. For instance, on an Intel i7 system, KMC~2 accomplishes $k$ -mer counting twice as fast as its nearest competitor while maintaining lower memory usage. The tool's performance was benchmarked across various datasets, including large-scale human genome data, highlighting its capability to handle extensive and complex bioinformatics workloads.

The successful implementation of KMC~2 could have several implications:

Enhanced Data Handling: Scientists dealing with high-throughput sequencing data will find KMC~2 particularly beneficial, as it allows rapid processing of extensive datasets with limited computational resources.
Facilitation of Downstream Analyses: By enabling faster preprocessing through efficient $k$ -mer counting, KMC~2 can accelerate downstream bioinformatics workflows, potentially reducing the time to scientific discovery.
Basis for Further Research: The methodologies introduced in KMC~2 could inform the development of future bioinformatics tools aimed at large-scale data processing, paving the way for innovations in algorithmic efficiency.

Speculation on Future Developments

Looking forward, KMC~2's architecture may inspire further research into parallelized and memory-efficient algorithms for other computationally intense tasks in bioinformatics. The effectiveness of the signature-based and $(k, x)$ -mer strategies might be explored further to refine and improve data processing frameworks. Additionally, as hardware continues to advance, tools like KMC~2 could be adapted to leverage emerging technologies such as distributed computing environments, enabling even larger datasets to be processed more swiftly.

In conclusion, KMC~2's introduction represents a significant step forward in the efficient processing of genomic data. While addressing present challenges, it also poses new questions and opportunities for innovation in algorithm design within bioinformatics and computational biology.

PDF Markdown