Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 71 tok/s

Gemini 2.5 Pro 38 tok/s Pro

GPT-5 Medium 36 tok/s Pro

GPT-5 High 39 tok/s Pro

GPT-4o 110 tok/s Pro

Kimi K2 191 tok/s Pro

GPT OSS 120B 462 tok/s Pro

Claude Sonnet 4.5 36 tok/s Pro

2000 character limit reached

Finding low-complexity DNA sequences with longdust (2509.07357v1)

Published 9 Sep 2025 in q-bio.GN

Abstract: Motivation: Low-complexity (LC) DNA sequences are compositionally repetitive sequences that are often associated with increased variant density and variant calling artifacts. While algorithms for identifying LC sequences exist, they either lack rigorous mathematical foundation or are inefficient with long context windows. Results: Longdust is a new algorithm that efficiently identifies long LC sequences including centromeric satellite and tandem repeats with moderately long motifs. It defines string complexity by statistically modeling the k-mer count distribution with the parameters: the k-mer length, the context window size and a threshold on complexity. Longdust exhibits high performance on real data and high consistency with existing methods. Availability and implementation: https://github.com/lh3/longdust

Summary

The paper presents longdust, a novel algorithm that models k-mer count distributions to accurately identify low-complexity DNA regions.
It employs a two-pass approach with precomputed scoring functions to achieve efficient, linear-time detection, outperforming heuristic methods.
Empirical results on human and gorilla genomes demonstrate longdust’s high concordance with existing tools and superior scalability in genomic analyses.

Efficient Identification of Low-Complexity DNA Sequences with longdust

Introduction

The paper presents longdust, a novel algorithm for the identification of low-complexity (LC) DNA sequences, which are compositionally repetitive regions often associated with increased variant density and variant calling artifacts. Existing methods for LC detection, such as TRF, TANTAN, ULTRA, and pytrf, rely on heuristic approaches and are limited in their ability to rigorously define string complexity, particularly for long context windows or non-tandem repetitive structures. SDUST, while mathematically grounded, suffers from prohibitive computational complexity ( $O(w^3L)$ ) for large window sizes. The longdust algorithm addresses these limitations by introducing a statistical model for $k$ -mer count distributions and an efficient scoring function, enabling practical detection of LC regions with long motifs.

Theoretical Framework

Statistical Modeling of $k$ -mer Complexity

Longdust models the $k$ -mer count distribution in DNA strings under the assumption of equal base frequencies, treating $k$ -mer occurrences as Poisson-distributed with parameter $\lambda = \ell(x)/4^k$ , where $\ell(x)$ is the number of $k$ -mers in the string. The composite probability of a string is given by:

$P(\vec{c}_x) = \prod_{t \in \Sigma^k} p(c_x(t) | \lambda)$

where $p(n|\lambda)$ is the Poisson PMF. The log-probability is:

$\log P(\vec{c}_x) = 4^k\lambda(\log\lambda - 1) - \sum_t \log c_x(t)!$

To normalize for string length and facilitate comparison across intervals, the authors introduce a scaled complexity score:

$Q(\vec{c}_x) = H(\lambda) - \log P(\vec{c}_x) = \sum_t \log c_x(t)! - f\left(\frac{\ell(x)}{4^k}\right)$

where $H(\lambda)$ is the negative entropy and $f(\lambda)$ is a precomputed correction term.

Scoring and Detection of LC Intervals

The final scoring function for LC detection is:

$S(\vec{c}_x) = Q(\vec{c}_x) - T \cdot \ell(x)$

with threshold $T$ controlling the sensitivity. An interval is considered LC if $S(\vec{c}_x) > 0$ . The algorithm distinguishes between "perfect" and "good" LC intervals, with the latter defined by the absence of higher-scoring prefixes or suffixes, enabling efficient linear-time detection.

Algorithmic Implementation

Longdust employs a two-pass approach for each genomic position:

Backward Pass: Scans from the current position $j$ backward up to window size $w$ , collecting candidate start positions for LC intervals based on suffix scores.
Forward Pass: For each candidate start, scans forward to $j$ to identify the maximal scoring interval.

The algorithm leverages precomputed $f(\lambda)$ values and incremental updates to $U(i, j)$ for efficient computation. Heuristics such as skipping unique $k$ -mers and X-drop are used to further optimize runtime. The overall complexity is $O(wL)$ , a substantial improvement over SDUST's $O(w^3L)$ .

Pseudocode Overview

def find_lc_intervals(sequence, k, w, T):
    for j in range(len(sequence)):
        candidates = backward_pass(sequence, k, w, T, j)
        for i in candidates:
            j_prime = forward_pass(sequence, k, T, i, j)
            if j_prime == j:
                mark_lc_interval(i, j)

Empirical Results

Human Genome (T2T-CHM13)

Longdust identified 277.1Mb of LC regions, with 224.3Mb overlapping centromeric satellite annotations. Of the remaining 52.7Mb, 34.1Mb overlapped with TRF and 15.4Mb with SDUST, leaving only 3.2Mb unique to longdust. This demonstrates high concordance with established methods for tandem repeats and centromeric regions.

Longdust recovers nearly all tandem repeats with $\ge4$ copies detected by TRF (97.9%), but misses low-copy repeats, consistent with its theoretical detection threshold. Resource usage is favorable, with longdust completing T2T-CHM13 analysis in 1h3m and minimal memory footprint (0.47GB), outperforming TRF and ULTRA in runtime and scalability.

Gorilla Genome

Longdust detected 656.8Mb of LC regions in the gorilla genome, 379.7Mb more than in human, accounting for most of the genome size difference. 99.7% of gorilla-specific regions (lacking 51-mer matches to human) were marked as LC, predominantly near telomeres. TRF failed to complete analysis within 30 hours, underscoring longdust's scalability.

Comparison to Existing Methods

SDUST's complexity score grows linearly with interval length, biasing detection toward longer regions and limiting window size to 64bp. Longdust's logarithmic scaling and efficient interval detection enable analysis of long motifs and large windows. Shannon entropy-based scoring was tested but found to be less efficient in practice. Longdust rarely identifies LC regions not found by TRF or SDUST, but provides more comprehensive recovery of tandem repeats with higher copy numbers.

Limitations and Future Directions

Longdust's primary limitation is the fixed window size, which restricts detection of extremely long repeat units (e.g., 12kb satellites in Woodhouse's scrub jays). Increasing window size incurs significant runtime penalties due to $O(wL)$ complexity and reduced effectiveness of speedup heuristics. Future work may focus on exact $O(wL)$ algorithms, alternative formulations that eliminate window size dependence, and improved handling of dependencies between genomic positions.

Conclusion

Longdust introduces a statistically principled and computationally efficient approach for the identification of low-complexity DNA sequences, outperforming existing methods in scalability and coverage of long motifs. Its implementation as a lightweight C library and command-line tool facilitates integration into genomic analysis pipelines. The algorithm's theoretical foundation and empirical performance suggest its utility for large-scale genome annotation and comparative genomics, with potential for further optimization and extension to broader classes of repetitive sequences.