DNA sequence dataset cleaning remains unresolved

Develop robust, domain-appropriate methodologies for cleaning DNA sequence datasets that address the constraints of the four-symbol nucleotide alphabet and sequencing noise, thereby improving the reliability and accuracy of downstream computational analyses.

Background

The authors note that DNA's four-symbol alphabet creates unique challenges for data preprocessing compared to natural language, where richer symbol sets can aid error detection and correction. They argue that dataset quality and preprocessing significantly limit performance of methylation prediction models.

They propose a linguistic mapping strategy to aid rule-based information extraction and manual cleaning, but they explicitly acknowledge that DNA data cleaning has been a pressing and unresolved issue in bioinformatics.

References

Unlike natural languages, DNA consists of only four symbols, which poses unique challenges in data cleaning—a pressing issue in bioinformatics that remains unresolved.

DNA and Human Language: Epigenetic Memory and Redundancy in Linear Sequence (2503.23494 - Yang et al., 30 Mar 2025) in Introduction