Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 82 tok/s

Gemini 2.5 Pro 45 tok/s Pro

GPT-5 Medium 25 tok/s Pro

GPT-5 High 36 tok/s Pro

GPT-4o 110 tok/s Pro

Kimi K2 207 tok/s Pro

GPT OSS 120B 469 tok/s Pro

Claude Sonnet 4.5 36 tok/s Pro

2000 character limit reached

Challenges in structural variant calling in low-complexity regions (2509.23057v1)

Published 27 Sep 2025 in q-bio.GN

Abstract: Background: Structural variants (SVs) are genomic differences $\ge$50 bp in length. They remain challenging to detect even with long sequence reads, and the sources of these difficulties are not well quantified. Results: We identified 35.4 Mb of low-complexity regions (LCRs) in GRCh38. Although these regions cover only 1.2% of the genome, they contain 69.1% of confident SVs in sample HG002. Across long-read SV callers, 77.3-91.3% of erroneous SV calls occur within LCRs, with error rates increasing with LCR length. Conclusion: SVs are enriched and difficult to call in LCRs. Special care need to be taken for calling and analyzing these variants.

Summary

The paper quantifies that although low-complexity regions make up only 1.2% of GRCh38, they harbor over 69% of confident SVs, underscoring significant detection issues.
It benchmarks 11 long-read SV callers against updated GIAB standards, revealing error rates of up to 91.3% in LCRs, especially for SV events over 2 kb.
The study recommends advanced haplotype-aware multi-sequence realignment and pangenome-based approaches to improve SV calling accuracy in challenging genomic regions.

Structural Variant Calling in Low-Complexity Regions: Quantitative Challenges and Implications

Introduction

This paper provides a systematic quantification of the challenges associated with structural variant (SV) calling in low-complexity regions (LCRs) of the human genome. SVs, defined as genomic differences of at least 50 bp, are known to have significant functional impacts. Despite advances in long-read sequencing technologies and assembly-based approaches, accurate detection of SVs in LCRs remains problematic. The authors present a comprehensive analysis of LCRs in GRCh38 and T2T-CHM13, their overlap with SVs, and the error profiles of state-of-the-art SV callers, highlighting the disproportionate enrichment of SVs and errors within LCRs.

Identification and Annotation of Low-Complexity Regions

The paper employs longdust to identify LCRs in GRCh38, filtering out centromeric repeats using dna-brnn, resulting in 34.4 Mb of LCRs (≥50 bp). To account for polymorphic LCRs absent from GRCh38, the authors analyze 462 assemblies from the Human Pangenome Reference Consortium (HPRC), annotating variant bubbles in the minigraph graph. LCRs are prioritized below segmental duplications (SegDup) to avoid misclassification. The final merged annotation covers 35.4 Mb in GRCh38, with 16.2% intersecting SegDup regions. In T2T-CHM13, LCRs span 79.6 Mb, primarily due to additional centromeric satellites.

Benchmarking and SV Calling Methodology

The analysis leverages the latest GIAB SV benchmark (HG002-Q100 v1.1), which contains 28,188 SVs in 2.76 Gb of confident regions, contrasting with the older HG002-SV v0.6 benchmark (9,705 SVs in 2.66 Gb). The increased SV count in the newer benchmark is attributed to improved allele resolution and inclusion of LCRs. SVs are called using 11 long-read SV callers on PacBio HiFi data aligned to GRCh38, with accuracy evaluated against both benchmarks using truvari. Callers include cuteSV, DeBreak, Delly, longcallD, pbsv, Sawfish, Severus, Sniffles2, SVDSS, SVIM, and SVision-pro, with genotyping performed via kanpig for SVDSS calls.

Quantitative Enrichment and Error Profiles in LCRs

The results demonstrate that although LCRs constitute only 1.2% of GRCh38, they contain 69.1% of confident SVs in HG002. Across SV callers, 59.4–67.7% of SV calls overlap LCRs. Critically, 77.3–91.3% of erroneous SV calls are localized to LCRs, with error rates increasing as LCR length increases. SV callers that perform haplotype-resolved calling (e.g., longcallD, SVDSS) detect more SVs in LCRs and SegDup regions, but even these advanced methods struggle with long LCRs. The majority of errors are attributed to inconsistent read alignment in LCRs, particularly for events exceeding 2 kb, where some callers miss up to half of SVs despite sufficient read length.

Algorithmic Considerations and Practical Recommendations

The analysis reveals that simple SV calling algorithms lacking realignment or local reassembly are inadequate for LCRs. Haplotype-aware multi-sequence realignment, as implemented in longcallD, substantially reduces error rates. The authors recommend stratifying SVs by LCR for downstream analyses, given their distinct error profiles and biological relevance. Notably, LCRs may overlap coding exons and regulatory elements, precluding blanket filtering of LCR-overlapping SVs.

For high-coverage long-read datasets, haplotype-resolved assembly followed by assembly-to-reference alignment offers superior accuracy, as evidenced by the derivation of the HG002-Q100 truth set. However, merging SV calls across samples remains challenging, especially in LCRs. Pangenome-based approaches utilizing multi-sequence alignment across samples (e.g., minigraph-based methods) are preferable for consistent SV representation, though highly variable LCRs still pose difficulties.

Implications and Future Directions

The findings underscore the necessity for specialized algorithmic strategies in SV calling within LCRs. The disproportionate enrichment of SVs and errors in these regions has direct implications for population-scale variant discovery, clinical genomics, and functional genomics. Future developments should focus on improving realignment and assembly algorithms, integrating pangenome graphs, and refining benchmarking standards to better capture the complexity of LCRs. The expansion of reference genomes (e.g., T2T-CHM13) and increased sample diversity will further elucidate the landscape of LCRs and their impact on SV calling.

Conclusion

This paper provides a rigorous quantification of the challenges in SV calling within low-complexity regions, demonstrating that LCRs are both highly enriched for SVs and disproportionately responsible for calling errors. Advanced haplotype-aware algorithms and assembly-based approaches are essential for accurate SV detection in these regions. The results have significant implications for the design of SV calling pipelines, benchmarking standards, and the interpretation of SVs in functional and clinical genomics. Continued methodological innovation and comprehensive reference resources will be critical for addressing the persistent challenges posed by LCRs in SV analysis.

PDF Markdown

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

Overview

This paper looks at a tricky part of genetics: finding big DNA changes (called structural variants) in parts of the genome that have lots of repeats. These repeat-heavy areas are called low-complexity regions (LCRs). The main point is that most of the hard-to-detect DNA changes live in these LCRs, and standard tools often struggle there.

What questions were the researchers asking?

The researchers wanted to answer simple but important questions:

Where in the genome do most structural variants (SVs) occur?
Why do computers make more mistakes when finding SVs in certain places?
How much do repeat-heavy regions (LCRs) affect both the number of SVs found and the error rates?
Which methods or tools work better in these hard regions?

How did they paper it?

Think of the genome as a giant book of letters (A, T, C, G). Sometimes big chunks of this book are added, removed, or rearranged—these are structural variants (SVs). LCRs are like pages with the same word or pattern printed over and over (e.g., “ATATAT…”). It’s harder to tell exactly where changes happen in such repetitive pages.

Here’s what they did, in everyday terms:

They used a computer program (longdust) to find the “repeat-heavy pages” (LCRs) in the human reference genome (GRCh38) and in many other human genome assemblies from a big project called HPRC.
They combined LCRs found in the reference and in many real people’s genomes to build a detailed map of where LCRs commonly appear.
They tested 11 different SV-finding tools on high-quality, long DNA reads (PacBio HiFi reads), aligning the reads to the reference with minimap2, and then asked: which SV calls match a trusted set of answers (the GIAB benchmark for sample HG002)?
They measured two types of mistakes:
- False discovery rate (FDR): how many called SVs are incorrect (false positives).
- False negative rate (FNR): how many true SVs were missed.
They compared results in LCRs vs. other parts of the genome and checked how errors change when LCRs get longer.

Technical terms explained simply:

Structural variant (SV): a big DNA change (50 or more letters long), like a large insertion (new text added) or deletion (text removed).
Low-complexity region (LCR): a stretch of DNA with lots of repeated patterns; this confuses matching and alignment.
Alignment: lining up DNA reads to the reference “book” to see differences.
Haplotype: the version of DNA from one parent; every person has two haplotypes for most regions.
Benchmark (truth set): a trusted list of SVs used to check how well tools perform.

What did they find?

Here are the key findings, explained simply:

LCRs are small but packed: They cover only about 1.2% of the genome, but hold about 69% of the confident SVs in the tested sample (HG002). That means most big changes are hiding in the hardest spots.
Most errors happen in LCRs: Depending on the tool, 77–91% of SV mistakes occurred inside LCRs. In non-repetitive regions, tools were much more accurate.
Longer LCRs cause more trouble: As the repeat regions get longer, errors increase. Some tools missed about half of the SVs in LCRs that were 2,000 letters or longer.
Smarter alignment helps: A tool that realigns reads while paying attention to haplotypes (longcallD) had the lowest error rates in LCRs. It effectively “looks at the puzzle pieces together” and aligns them more consistently.
Older benchmarks missed many LCR SVs: An older trusted dataset (HG002-SV v0.6) included far fewer SVs—largely because it excluded many LCRs and combined similar changes from the two haplotypes into one, reducing counts. The newer benchmark (HG002-Q100 v1.1) is more detailed and includes many LCR regions.

Why is this important?

Many important genetic differences live in LCRs, which can affect genes and how they work. If we ignore or filter them out just because they’re hard, we might miss meaningful biology.
Researchers and clinicians need to treat LCR SVs carefully. These regions demand better algorithms—especially those that realign reads or locally reassemble DNA—to avoid mistakes.
Building better reference maps and using pangenome approaches (comparing many genomes together) can help keep SV calls consistent across people, even in repeat-heavy places.
The paper shows that with high-quality data and smarter methods, accurate calling in LCRs is achievable, but it requires special attention.

Takeaway in plain words

Most big DNA changes are hiding in the most confusing parts of the genome—the parts with lots of repeats. Because of that, many tools make mistakes there. Using methods that consider both parental copies (haplotypes) and re-align reads carefully can greatly improve accuracy. Scientists should separate and analyze LCR-related SVs differently and use advanced strategies when possible.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

The following list highlights concrete gaps, uncertainties, and unexplored aspects that remain after this paper and can guide future research:

Generalizability beyond a single individual: results are based only on HG002; performance and LCR-related error profiles across diverse ancestries, admixed samples, and trios remain unquantified.
Sequencing technology scope: evaluation used PacBio HiFi reads only; the impact of ONT (including duplex/Q20+), CLR, hybrid datasets, and varying read length/error profiles on LCR SV calling is unknown.
Coverage sensitivity: accuracy in LCRs at lower or uneven coverage, and the coverage thresholds needed for reliable haplotype-aware calling and realignment, were not investigated.
Aligner dependence: only minimap2 was used; how repeat-aware aligners (e.g., winnowmap, lra), parameter tuning, or graph/pangenome aligners change LCR SV accuracy and alignment consistency remains unexplored.
Reference choice effects: SV calling was evaluated against GRCh38; whether mapping and evaluation against T2T-CHM13 (especially non-centromeric LCRs) improves performance is untested.
LCR definition sensitivity: the chosen thresholds (≥50 bp LCR length, +5 bp padding, ≥70% overlap to classify SVs as LCR, SegDup priority over LCR) were not subjected to sensitivity analyses to quantify how conclusions change with different cutoffs.
Completeness and precision of LCR annotation: the false-positive/false-negative rates of longdust-derived LCRs (including missed short/complex repeats and misclassified SegDup-overlapping repeats) were not benchmarked.
Satellite repeats handling: alpha and HSAT2/3 satellites were excluded; the effect of including other satellite classes on SV calling and evaluation (both in GRCh38 and CHM13) is not assessed.
Truth set uncertainty in LCRs: reliance on HG002-Q100 assemblies as ground truth may include residual errors/representation choices in LCRs; quantifying truth inaccuracies and their effect on measured FDR/FNR is missing.
Evaluation strictness: truvari “refine” settings can count non-exact allele-length matches as correct; the prevalence and magnitude of allele-length mismatches and breakpoint offsets among “true positives” in LCRs are not reported.
Error typology: a systematic breakdown of error modes in LCRs (length misestimation, breakpoint shift, allele sequence errors, genotype/phasing errors, spurious calls) is lacking.
Variant-type stratification: accuracy by SV class (insertions vs deletions vs duplications vs inversions vs complex events) in LCRs is not analyzed; duplications were converted to insertions for evaluation without quantifying the impact of this choice.
Repeat composition factors: only LCR length was considered; how repeat unit size, homopolymer content, motif degeneracy, and divergence affect callability and error rates is unknown.
SegDup–LCR interplay: prioritizing SegDup over LCR during annotation may undercount LCR-associated SVs within long polymorphic duplications; best practices to disentangle and evaluate such regions are not defined.
Caller parameterization: most tools were run with defaults; whether targeted parameter tuning (e.g., tandem repeat hints for Sniffles2, realignment windows, clustering thresholds) improves LCR performance is untested.
Excluded callers and formats: Severus (fewer LCR calls) and SVision-pro (no allele sequences) were excluded; converting outputs or adjusting settings to enable fair comparison could change conclusions.
Haplotype-aware algorithm requirements: beyond longcallD and SVDSS, the minimal algorithmic features (e.g., multi-read realignment vs local reassembly vs haplotype phasing) necessary to achieve reliable LCR SV calls are not delineated.
Multi-sample calling/merging: the reported difficulty of cross-sample merging in LCRs is not quantified; direct comparisons of traditional merging vs graph/pangenome-based joint calling on cohorts are missing.
Precision–recall trade-offs: only FDR and FNR were reported; full PR curves, thresholded performance, and calibration analyses (e.g., quality scores vs correctness in LCRs) are absent.
Breakpoint/allele-sequence fidelity: base-level sequence concordance and breakpoint accuracy (e.g., in bp) for matched calls in LCRs were not measured.
Computational cost and scalability: runtime/memory impacts of realignment/reassembly in long LCRs (and their feasibility at cohort scale) were not reported.
Coverage and mapping biases in LCRs: potential read depth anomalies, alignment pile-up artifacts, and strand/motif-specific biases in LCRs were not characterized.
Rare polymorphic LCRs: annotation favored common polymorphic LCRs (≥5 assemblies); the callability and error rates for rare/individual-specific LCR variants were not evaluated.
Functional context: while LCRs can overlap coding exons/regulatory elements, the fraction and characteristics of functionally relevant LCR SVs and recommended handling/validation strategies were not analyzed.
Cross-species applicability: whether the proposed LCR annotation and SV-calling insights translate to other genomes (e.g., mouse, plant) remains an open question.
Public resource metadata: provided LCR BED files lack per-region confidence scores, allele-length distributions, motif annotations, and population frequency estimates that would aid downstream stratified analyses.

View Paper Prompt View All Prompts

Practical Applications

Overview

This paper quantifies how low-complexity regions (LCRs) in the human genome disproportionately harbor structural variants (SVs) and drive errors in long-read SV calling. It provides concrete LCR annotations for GRCh38 and T2T-CHM13, shows error rates rise with LCR length, demonstrates that haplotype-aware multi-sequence realignment substantially improves accuracy (e.g., longcallD), and recommends LCR-aware stratification and realignment/reassembly for robust SV analysis. Below are practical applications—immediate and long-term—across industry, academia, policy, and daily practice, with sectors, tools/workflows, and assumptions/dependencies noted.

Immediate Applications

LCR-aware SV annotation in existing pipelines
- Sectors: healthcare (clinical genomics), academia, software
- Tools/Workflows: integrate hg38.lcr-v4.bed.gz / chm13v2.lcr-v4.bed.gz (Zenodo) into pipelines (e.g., Nextflow/WDL), annotate SVs via bedtools or truvari stratification, flag calls overlapping LCR
- Assumptions/Dependencies: genome build alignment (GRCh38 or CHM13; liftOver required for GRCh37), consistent handling of satellite repeat exclusions
Stratified benchmarking and caller selection/tuning
- Sectors: academia, bioinformatics vendors, core facilities
- Tools/Workflows: evaluate callers with truvari using bench --passonly --pick ac --dup-to-ins then refine --use-original-vcfs; report FDR/FNR stratified by LCR/SegDup and LCR length bins; prioritize haplotype-aware callers (e.g., longcallD, SVDSS+kanpig)
- Assumptions/Dependencies: access to HG002-Q100 v1.1 truth set; consistent evaluation settings; sufficient compute
Haplotype-aware realignment for LCR hot spots
- Sectors: healthcare, academia, software
- Tools/Workflows: adopt longcallD (haplotype-aware multi-sequence realignment) or local reassembly/haplotype-resolved strategies; re-run LCR-overlapping calls with these methods; use IGV for manual triage when needed
- Assumptions/Dependencies: long-read data (PacBio HiFi/ONT), adequate coverage, compute resources, team expertise
Reporting SOPs to flag LCR SVs for confirmation
- Sectors: policy within clinical labs; healthcare delivery
- Tools/Workflows: add “LCR-overlap” confidence attribute in LIMS and clinical reports; require orthogonal validation (e.g., local assembly, alternative platform) for LCR SVs; document lower confidence tiers or “special handling”
- Assumptions/Dependencies: lab accreditation practices (CAP/CLIA), stakeholder buy-in, turnaround times
Targeted validation workflows for LCR SVs
- Sectors: healthcare, academia
- Tools/Workflows: PCR-free long-read enrichment/capture, ONT duplex for error reduction, haplotype-resolved local assembly (e.g., hifiasm) and assembly-to-reference SV calling, IGV inspection
- Assumptions/Dependencies: sample availability, cost/coverage budgets, specialized lab capability
Research analyses that retain but stratify LCR SVs
- Sectors: academia
- Tools/Workflows: avoid blanket filtering of LCR-overlapping SVs; stratify analyses by LCR status; prioritize functional follow-up for LCR SVs in coding exons or regulatory loci
- Assumptions/Dependencies: recognition of LCRs’ potential functional impacts; appropriate statistical modeling of error enrichment
Pangenome and genome browser annotation updates
- Sectors: academia, software
- Tools/Workflows: add LCR tracks to UCSC/Ensembl; annotate pangenome graphs (e.g., minigraph-derived bubbles) with LCR calls; publish track hubs
- Assumptions/Dependencies: curator bandwidth; versioning and metadata standards
CRISPR design risk management in LCRs
- Sectors: biotech/pharma
- Tools/Workflows: update guide RNA design pipelines to penalize/avoid LCR loci; LCR-aware off-target assessment
- Assumptions/Dependencies: tool updates; understanding of LCR-induced mapping ambiguity
Consumer genomics QA/communications
- Sectors: consumer genomics companies
- Tools/Workflows: automatically flag DTC SV calls that overlap LCR; add disclaimers and offer confirmatory lab testing options
- Assumptions/Dependencies: product policy changes; customer education
Training and capacity building for analysts and pathologists
- Sectors: education, healthcare
- Tools/Workflows: case studies demonstrating LCR misalignment and allele-resolution pitfalls; hands-on exercises with truvari stratification and IGV review
- Assumptions/Dependencies: curricular integration; access to datasets/tools
Coverage planning and procurement for LCR-heavy projects
- Sectors: healthcare, research operations
- Tools/Workflows: coverage calculators emphasizing higher HiFi/duplex coverage for LCR resolution; procurement strategies prioritizing long-read platforms for SV studies
- Assumptions/Dependencies: budget constraints; platform availability
Data sharing norms that include LCR stratification metrics
- Sectors: academia, consortia
- Tools/Workflows: include LCR overlap and LCR-length-stratified accuracy metrics in published SV callsets; provide accompanying BED tracks
- Assumptions/Dependencies: journal and repository policies; community expectations

Long-Term Applications

Next-generation SV callers specialized for LCRs
- Sectors: software, academia
- Tools/Workflows: expand haplotype-aware multi-sequence realignment, hybrid realignment–reassembly approaches, allele-precise matching, GPU acceleration; ML scoring of alignment consistency
- Assumptions/Dependencies: funding, access to long-read training data, community benchmarks
LCR-aware benchmarking standards and clinical reporting guidance
- Sectors: policy, academia, industry
- Tools/Workflows: GIAB expansions that enforce allele-level matching and LCR-length stratification; CAP/CLIA guidelines for minimum coverage and validation in LCRs; standardized FDR/FNR reporting by LCR category
- Assumptions/Dependencies: multi-stakeholder consensus; pilot studies; regulatory processes
Routine clinical use of haplotype-resolved assemblies
- Sectors: healthcare
- Tools/Workflows: clinical-grade hifiasm or similar assembly workflows; assembly-to-reference SV calling; integrated haplotype phasing and graph alignment
- Assumptions/Dependencies: reduced cost/turnaround, reimbursement models, validated pipelines, sample phasing (e.g., trio data or long-range phasing)
Pangenome-based multi-sample SV representation and merging
- Sectors: academia, software, healthcare informatics
- Tools/Workflows: graph genome infrastructures (e.g., minigraph, VG/giraffe), cross-sample multi-sequence alignment, stable SV identifiers, LCR-aware normalization
- Assumptions/Dependencies: production-grade graph tooling, training, interoperability with EHR/LIMS
LCR-aware QC dashboards and visualization products
- Sectors: software, healthcare
- Tools/Workflows: IGV plugins and web dashboards showing LCR overlap, LCR-length-specific FDR/FNR, automated triage routing to reassembly
- Assumptions/Dependencies: product development resources; integration with existing lab informatics
Sequencing chemistry and platform innovations for LCR resolution
- Sectors: sequencing industry
- Tools/Workflows: longer, more accurate HiFi or duplex reads; library prep improvements that maintain repeat context; error models tuned for LCRs
- Assumptions/Dependencies: R&D investment, market demand, validation across genomes
ML calibrators for LCR misalignment risk
- Sectors: software, academia
- Tools/Workflows: train models to predict misalignment/error probability from features (LCR length, read depth, alignment entropy), integrate with callers to adjust confidence
- Assumptions/Dependencies: labeled ground truth across LCR spectra; privacy-compliant datasets
Population catalogs of LCR SV alleles and disease/regulatory associations
- Sectors: academia, healthcare, pharma
- Tools/Workflows: large cohort long-read sequencing; pangenome graph integration; association studies (GWAS/eQTL) stratified by LCR; functional assays of LCR-mediated regulation
- Assumptions/Dependencies: cohorts and consent, compute/storage, harmonized pipelines
Regulatory updates for diagnostic labs and DTC services
- Sectors: policy
- Tools/Workflows: codify minimum validation for LCR SVs, coverage thresholds, recommended orthogonal methods, standardized reporting language
- Assumptions/Dependencies: evidence base, stakeholder agreement, implementation timelines
Professional education and certification modules
- Sectors: education, professional societies
- Tools/Workflows: formal curricula on LCR-aware genomics, certification/CE credits, competency frameworks
- Assumptions/Dependencies: institutional buy-in, funding
Interoperability standards for LCR tracks and metadata
- Sectors: software, academia
- Tools/Workflows: standardize BED schemas for LCR annotations (build, version, satellite inclusion rules), registries for reference tracks, provenance tracking in pipelines
- Assumptions/Dependencies: standards bodies participation; adoption across tools
Therapeutic strategies leveraging LCR-mediated regulation
- Sectors: pharma/biotech
- Tools/Workflows: map regulatory functions of LCRs; design therapies that modulate LCR-linked regulatory elements or avoid destabilizing repeats; gene therapy designs aware of SV risks in LCR contexts
- Assumptions/Dependencies: deep functional characterization, translational models, safety considerations

Notes on dependencies across applications:

High-quality long-read sequencing (coverage, read length, base accuracy) is often required to realize improvements in LCRs.
Genome build consistency matters (GRCh38, CHM13; liftOver for GRCh37); satellite repeat handling shapes LCR scope.
Community consensus on benchmarks and allele-resolution criteria will affect comparability and clinical acceptance.
Graph/pangenome infrastructures are powerful but require new tooling, training, and standards to be production-ready.

View Paper Prompt View All Prompts

Glossary

allele: One of multiple versions of a genetic variant at a specific genomic position. "943 of them have “*” as alternate alleles."
allele resolution: The precision with which distinct alleles (often per haplotype) are represented as separate variants. "The difference is caused by the allele resolution."
alpha repeats: Alpha-satellite DNA sequences that form part of centromeric regions in the genome. "We filtered about half of them that overlap with alpha and HSAT2/3 centromeric repeats found by dna-brnn."
assembly gaps: Regions in a reference genome where sequence is missing or not assembled. "“NotConf” denotes not-confident regions in the HG002-Q100 v1.1 benchmark, excluding assembly gaps in GRCh38."
assembly-to-reference alignment: Aligning assembled contigs or haplotypes back to a reference genome to identify variants. "and call variants from assembly-to-reference alignment"
BED file: A tab-delimited genomic interval format commonly used to store coordinate-based annotations. "This resulted in a BED file with 111,067 records, covering 35.4Mb of GRCh38."
centromeric satellites: Highly repetitive DNA arrays located at centromeres, often composed of specific satellite families. "Most of the additional regions came from centromeric satellites that are not HSAT2/3 or alpha repeats."
confident regions: Genome intervals designated as reliable for benchmarking variant calls. "There are 29,131 SVs of ≥50bp in length contained in the confident regions in the new HG002-Q100 v1.1 benchmark."
dna-brnn: A computational tool for classifying or detecting repetitive DNA elements using neural networks. "We filtered about half of them that overlap with alpha and HSAT2/3 centromeric repeats found by dna-brnn."
false discovery rate (FDR): The proportion of called variants that are incorrect among all calls deemed positive. "False discovery rate (FDR) of SVs in the HG002-Q100 confident regions, measured by truvari in the “refine” mode."
false negative rate (FNR): The proportion of true variants that are missed by the caller. "False negative rate (FNR) of SVs in HG002-Q100."
Genome-In-A-Bottle (GIAB): A consortium that produces high-quality benchmark datasets for variant calling. "Constructed by the Genome-In-A-Bottle (GIAB) group, the latest SV benchmark HG002-Q100 v1.1 contains 28,188 SVs in 2.76Gb of confident regions, consistent with the recent counts."
genotyping: Determining the zygosity or copy state of a variant for a sample’s haplotypes. "We used kanpig v1.1.0 for genotyping SVs called by SVDSS as is suggested in the documentation."
GRCh38: A widely used human reference genome build (Genome Reference Consortium Human Build 38). "We applied longdust to GRCh38 and identified 115.4Mb of LCRs on assembled chromosomes."
haplotype: A set of variants inherited together on the same chromosome copy. "Suppose both haplotypes in HG002 harbor a 6kb insertion to the same location of the reference genome."
haplotype-resolved assemblers: Genome assemblers that separately reconstruct sequences for each haplotype. "Given accurate long reads at high coverage, we may also assemble the reads with haplotype-resolved assemblers"
heterozygous: Having different alleles on the two haplotypes at a locus. "The newer HG002-Q100 benchmark would consider this event as two heterozygous insertions"
HG002-Q100 v1.1: A specific GIAB structural variant benchmark dataset for sample HG002 with high-confidence regions. "Constructed by the Genome-In-A-Bottle (GIAB) group, the latest SV benchmark HG002-Q100 v1.1 contains 28,188 SVs in 2.76Gb of confident regions"
HG002-SV v0.6: An older GIAB structural variant benchmark dataset for sample HG002. "In contrast, published in 2020, the older HG002-SV benchmark v0.6 only contains 9,705 SVs in 2.66Gb."
HiFi reads (PacBio High-Fidelity reads): Highly accurate long sequencing reads produced by PacBio. "We acquired PacBio High-Fidelity reads from HPRC"
HPRC (Human Pangenome Reference Consortium): A consortium generating diverse human genome assemblies to build a pangenome reference. "assemblies from the Human Pangenome Reference Consortium (HPRC)"
HSAT2/3: Human satellite DNA families prevalent in centromeric regions. "We filtered about half of them that overlap with alpha and HSAT2/3 centromeric repeats found by dna-brnn."
IGV: Integrative Genomics Viewer, a tool for visualizing genomic data and alignments. "IGV screenshot of alignment around an LCR."
kilobase (kb): A length unit in genomics equal to 1,000 base pairs. "Suppose both haplotypes in HG002 harbor a 6kb insertion to the same location of the reference genome."
LCR (low-complexity regions): Genomic regions dominated by simple, repetitive sequence patterns. "We identified 35.4 Mb of low-complexity regions (LCRs) in GRCh38."
liftover: Converting genomic coordinates from one reference build to another. "we lifted its confident regions over to GRCh38"
local reassembly: Reconstructing sequence in a specific region from reads to improve variant detection. "the critical role of realignment or local reassembly in accurate SV calling."
longcallD: An SV caller that performs haplotype-aware multi-sequence realignment. "Developed in our group, longcallD achieves the lowest error rate"
longdust: A tool for detecting low-complexity sequences in genomes. "We applied longdust to GRCh38 and identified 115.4Mb of LCRs on assembled chromosomes."
minimap2: A fast sequence aligner widely used for long-read mapping. "aligned them to the primary assembly of GRCh38 with minimap2"
minigraph graph: A pangenome graph representation produced by minigraph, capturing variation across assemblies. "and used the results to annotate variant bubbles in the minigraph graph of these assemblies"
multi-sequence alignment: Simultaneously aligning multiple sequences to achieve consistent variant representation. "conducting multi-sequence alignment across samples, such methods can produce more consistent SV representations."
phased realignment: Realigning reads with knowledge of their haplotype phase to improve consistency. "The bottom panel shows the phased realignment by longcallD."
phasing: Determining which variants co-occur on the same haplotype. "Performing phasing and alignment within each haplotype, these assemblers are more powerful than most SV callers."
pangenome-based methods: Approaches that use a graph or multi-reference representation capturing population-level variation. "calling variants across samples with pangenome-based methods will be the preferred approach"
polymorphic: Presenting variation in sequence among individuals in a population. "It may miss polymorphic LCRs present in other human samples but missing from GRCh38."
segmental duplication (SegDup): Large duplicated genomic regions that can complicate alignment and variant calling. "segmental duplications (SegDup)"
single nucleotide polymorphism (SNP): A variant where a single base differs among sequences. "The inserted sequences however differ by one SNP between them."
Sniffles2: A structural variant caller for long-read data. "Sniffles2 may optionally take tandem repeatitive regions as input"
structural variant (SV): A genomic variant typically ≥50 bp, including insertions, deletions, duplications, inversions, etc. "Structural variants (SVs) are ≥50bp genomic variants"
tandem repetitive regions: DNA regions with adjacent repeated sequence units. "Sniffles2 may optionally take tandem repeatitive regions as input"
T2T-CHM13 genome: A telomere-to-telomere human reference assembly derived from the CHM13 cell line. "We applied the same procedure to the T2T-CHM13 genome"
truvari: A benchmarking tool for evaluating variant call accuracy. "The truvari evaluation tool also filters SVs with “*” alleles."
UCSC Genome Browser: An online platform providing genome assemblies and annotation tracks. "16.2% of the LCRs are intersected with the SegDup annotation from the “genomicSuperDups” track of the UCSC Genome Browser."
variant bubble: A subgraph in a pangenome graph representing alternative alleles between sequences. "A variant bubble was marked as an LCR if (a) ≥70% of the sequences in the bubble were LCRs in the source assemblies, and (b) the sequences in the bubble were not annotated as segmental duplications (SegDup) by HPRC."

View Paper Prompt View All Prompts

Continue Learning

Authors (2)

Collections

Tweets

This paper has been mentioned in 1 post and received 273 likes.

alphaXiv

Challenges in structural variant calling in low-complexity regions (6 likes, 0 questions)

Challenges in structural variant calling in low-complexity regions (2509.23057v1)

Summary

Structural Variant Calling in Low-Complexity Regions: Quantitative Challenges and Implications

Introduction

Identification and Annotation of Low-Complexity Regions

Benchmarking and SV Calling Methodology

Quantitative Enrichment and Error Profiles in LCRs

Algorithmic Considerations and Practical Recommendations

Implications and Future Directions

Conclusion

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

Overview

What questions were the researchers asking?

How did they paper it?

What did they find?

Why is this important?

Takeaway in plain words

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Overview

Immediate Applications

Long-Term Applications

Glossary

Continue Learning

Related Papers

Authors (2)

Collections

Tweets

alphaXiv