Challenges in structural variant calling in low-complexity regions (2509.23057v1)
Abstract: Background: Structural variants (SVs) are genomic differences $\ge$50 bp in length. They remain challenging to detect even with long sequence reads, and the sources of these difficulties are not well quantified. Results: We identified 35.4 Mb of low-complexity regions (LCRs) in GRCh38. Although these regions cover only 1.2% of the genome, they contain 69.1% of confident SVs in sample HG002. Across long-read SV callers, 77.3-91.3% of erroneous SV calls occur within LCRs, with error rates increasing with LCR length. Conclusion: SVs are enriched and difficult to call in LCRs. Special care need to be taken for calling and analyzing these variants.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
Overview
This paper looks at a tricky part of genetics: finding big DNA changes (called structural variants) in parts of the genome that have lots of repeats. These repeat-heavy areas are called low-complexity regions (LCRs). The main point is that most of the hard-to-detect DNA changes live in these LCRs, and standard tools often struggle there.
What questions were the researchers asking?
The researchers wanted to answer simple but important questions:
- Where in the genome do most structural variants (SVs) occur?
- Why do computers make more mistakes when finding SVs in certain places?
- How much do repeat-heavy regions (LCRs) affect both the number of SVs found and the error rates?
- Which methods or tools work better in these hard regions?
How did they paper it?
Think of the genome as a giant book of letters (A, T, C, G). Sometimes big chunks of this book are added, removed, or rearranged—these are structural variants (SVs). LCRs are like pages with the same word or pattern printed over and over (e.g., “ATATAT…”). It’s harder to tell exactly where changes happen in such repetitive pages.
Here’s what they did, in everyday terms:
- They used a computer program (longdust) to find the “repeat-heavy pages” (LCRs) in the human reference genome (GRCh38) and in many other human genome assemblies from a big project called HPRC.
- They combined LCRs found in the reference and in many real people’s genomes to build a detailed map of where LCRs commonly appear.
- They tested 11 different SV-finding tools on high-quality, long DNA reads (PacBio HiFi reads), aligning the reads to the reference with minimap2, and then asked: which SV calls match a trusted set of answers (the GIAB benchmark for sample HG002)?
- They measured two types of mistakes:
- False discovery rate (FDR): how many called SVs are incorrect (false positives).
- False negative rate (FNR): how many true SVs were missed.
- They compared results in LCRs vs. other parts of the genome and checked how errors change when LCRs get longer.
Technical terms explained simply:
- Structural variant (SV): a big DNA change (50 or more letters long), like a large insertion (new text added) or deletion (text removed).
- Low-complexity region (LCR): a stretch of DNA with lots of repeated patterns; this confuses matching and alignment.
- Alignment: lining up DNA reads to the reference “book” to see differences.
- Haplotype: the version of DNA from one parent; every person has two haplotypes for most regions.
- Benchmark (truth set): a trusted list of SVs used to check how well tools perform.
What did they find?
Here are the key findings, explained simply:
- LCRs are small but packed: They cover only about 1.2% of the genome, but hold about 69% of the confident SVs in the tested sample (HG002). That means most big changes are hiding in the hardest spots.
- Most errors happen in LCRs: Depending on the tool, 77–91% of SV mistakes occurred inside LCRs. In non-repetitive regions, tools were much more accurate.
- Longer LCRs cause more trouble: As the repeat regions get longer, errors increase. Some tools missed about half of the SVs in LCRs that were 2,000 letters or longer.
- Smarter alignment helps: A tool that realigns reads while paying attention to haplotypes (longcallD) had the lowest error rates in LCRs. It effectively “looks at the puzzle pieces together” and aligns them more consistently.
- Older benchmarks missed many LCR SVs: An older trusted dataset (HG002-SV v0.6) included far fewer SVs—largely because it excluded many LCRs and combined similar changes from the two haplotypes into one, reducing counts. The newer benchmark (HG002-Q100 v1.1) is more detailed and includes many LCR regions.
Why is this important?
- Many important genetic differences live in LCRs, which can affect genes and how they work. If we ignore or filter them out just because they’re hard, we might miss meaningful biology.
- Researchers and clinicians need to treat LCR SVs carefully. These regions demand better algorithms—especially those that realign reads or locally reassemble DNA—to avoid mistakes.
- Building better reference maps and using pangenome approaches (comparing many genomes together) can help keep SV calls consistent across people, even in repeat-heavy places.
- The paper shows that with high-quality data and smarter methods, accurate calling in LCRs is achievable, but it requires special attention.
Takeaway in plain words
Most big DNA changes are hiding in the most confusing parts of the genome—the parts with lots of repeats. Because of that, many tools make mistakes there. Using methods that consider both parental copies (haplotypes) and re-align reads carefully can greatly improve accuracy. Scientists should separate and analyze LCR-related SVs differently and use advanced strategies when possible.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
The following list highlights concrete gaps, uncertainties, and unexplored aspects that remain after this paper and can guide future research:
- Generalizability beyond a single individual: results are based only on HG002; performance and LCR-related error profiles across diverse ancestries, admixed samples, and trios remain unquantified.
- Sequencing technology scope: evaluation used PacBio HiFi reads only; the impact of ONT (including duplex/Q20+), CLR, hybrid datasets, and varying read length/error profiles on LCR SV calling is unknown.
- Coverage sensitivity: accuracy in LCRs at lower or uneven coverage, and the coverage thresholds needed for reliable haplotype-aware calling and realignment, were not investigated.
- Aligner dependence: only minimap2 was used; how repeat-aware aligners (e.g., winnowmap, lra), parameter tuning, or graph/pangenome aligners change LCR SV accuracy and alignment consistency remains unexplored.
- Reference choice effects: SV calling was evaluated against GRCh38; whether mapping and evaluation against T2T-CHM13 (especially non-centromeric LCRs) improves performance is untested.
- LCR definition sensitivity: the chosen thresholds (≥50 bp LCR length, +5 bp padding, ≥70% overlap to classify SVs as LCR, SegDup priority over LCR) were not subjected to sensitivity analyses to quantify how conclusions change with different cutoffs.
- Completeness and precision of LCR annotation: the false-positive/false-negative rates of longdust-derived LCRs (including missed short/complex repeats and misclassified SegDup-overlapping repeats) were not benchmarked.
- Satellite repeats handling: alpha and HSAT2/3 satellites were excluded; the effect of including other satellite classes on SV calling and evaluation (both in GRCh38 and CHM13) is not assessed.
- Truth set uncertainty in LCRs: reliance on HG002-Q100 assemblies as ground truth may include residual errors/representation choices in LCRs; quantifying truth inaccuracies and their effect on measured FDR/FNR is missing.
- Evaluation strictness: truvari “refine” settings can count non-exact allele-length matches as correct; the prevalence and magnitude of allele-length mismatches and breakpoint offsets among “true positives” in LCRs are not reported.
- Error typology: a systematic breakdown of error modes in LCRs (length misestimation, breakpoint shift, allele sequence errors, genotype/phasing errors, spurious calls) is lacking.
- Variant-type stratification: accuracy by SV class (insertions vs deletions vs duplications vs inversions vs complex events) in LCRs is not analyzed; duplications were converted to insertions for evaluation without quantifying the impact of this choice.
- Repeat composition factors: only LCR length was considered; how repeat unit size, homopolymer content, motif degeneracy, and divergence affect callability and error rates is unknown.
- SegDup–LCR interplay: prioritizing SegDup over LCR during annotation may undercount LCR-associated SVs within long polymorphic duplications; best practices to disentangle and evaluate such regions are not defined.
- Caller parameterization: most tools were run with defaults; whether targeted parameter tuning (e.g., tandem repeat hints for Sniffles2, realignment windows, clustering thresholds) improves LCR performance is untested.
- Excluded callers and formats: Severus (fewer LCR calls) and SVision-pro (no allele sequences) were excluded; converting outputs or adjusting settings to enable fair comparison could change conclusions.
- Haplotype-aware algorithm requirements: beyond longcallD and SVDSS, the minimal algorithmic features (e.g., multi-read realignment vs local reassembly vs haplotype phasing) necessary to achieve reliable LCR SV calls are not delineated.
- Multi-sample calling/merging: the reported difficulty of cross-sample merging in LCRs is not quantified; direct comparisons of traditional merging vs graph/pangenome-based joint calling on cohorts are missing.
- Precision–recall trade-offs: only FDR and FNR were reported; full PR curves, thresholded performance, and calibration analyses (e.g., quality scores vs correctness in LCRs) are absent.
- Breakpoint/allele-sequence fidelity: base-level sequence concordance and breakpoint accuracy (e.g., in bp) for matched calls in LCRs were not measured.
- Computational cost and scalability: runtime/memory impacts of realignment/reassembly in long LCRs (and their feasibility at cohort scale) were not reported.
- Coverage and mapping biases in LCRs: potential read depth anomalies, alignment pile-up artifacts, and strand/motif-specific biases in LCRs were not characterized.
- Rare polymorphic LCRs: annotation favored common polymorphic LCRs (≥5 assemblies); the callability and error rates for rare/individual-specific LCR variants were not evaluated.
- Functional context: while LCRs can overlap coding exons/regulatory elements, the fraction and characteristics of functionally relevant LCR SVs and recommended handling/validation strategies were not analyzed.
- Cross-species applicability: whether the proposed LCR annotation and SV-calling insights translate to other genomes (e.g., mouse, plant) remains an open question.
- Public resource metadata: provided LCR BED files lack per-region confidence scores, allele-length distributions, motif annotations, and population frequency estimates that would aid downstream stratified analyses.
Practical Applications
Overview
This paper quantifies how low-complexity regions (LCRs) in the human genome disproportionately harbor structural variants (SVs) and drive errors in long-read SV calling. It provides concrete LCR annotations for GRCh38 and T2T-CHM13, shows error rates rise with LCR length, demonstrates that haplotype-aware multi-sequence realignment substantially improves accuracy (e.g., longcallD), and recommends LCR-aware stratification and realignment/reassembly for robust SV analysis. Below are practical applications—immediate and long-term—across industry, academia, policy, and daily practice, with sectors, tools/workflows, and assumptions/dependencies noted.
Immediate Applications
- LCR-aware SV annotation in existing pipelines
- Sectors: healthcare (clinical genomics), academia, software
- Tools/Workflows: integrate
hg38.lcr-v4.bed.gz
/chm13v2.lcr-v4.bed.gz
(Zenodo) into pipelines (e.g., Nextflow/WDL), annotate SVs via bedtools or truvari stratification, flag calls overlapping LCR - Assumptions/Dependencies: genome build alignment (GRCh38 or CHM13; liftOver required for GRCh37), consistent handling of satellite repeat exclusions
- Stratified benchmarking and caller selection/tuning
- Sectors: academia, bioinformatics vendors, core facilities
- Tools/Workflows: evaluate callers with truvari using
bench --passonly --pick ac --dup-to-ins
thenrefine --use-original-vcfs
; report FDR/FNR stratified by LCR/SegDup and LCR length bins; prioritize haplotype-aware callers (e.g., longcallD, SVDSS+kanpig) - Assumptions/Dependencies: access to HG002-Q100 v1.1 truth set; consistent evaluation settings; sufficient compute
- Haplotype-aware realignment for LCR hot spots
- Sectors: healthcare, academia, software
- Tools/Workflows: adopt longcallD (haplotype-aware multi-sequence realignment) or local reassembly/haplotype-resolved strategies; re-run LCR-overlapping calls with these methods; use IGV for manual triage when needed
- Assumptions/Dependencies: long-read data (PacBio HiFi/ONT), adequate coverage, compute resources, team expertise
- Reporting SOPs to flag LCR SVs for confirmation
- Sectors: policy within clinical labs; healthcare delivery
- Tools/Workflows: add “LCR-overlap” confidence attribute in LIMS and clinical reports; require orthogonal validation (e.g., local assembly, alternative platform) for LCR SVs; document lower confidence tiers or “special handling”
- Assumptions/Dependencies: lab accreditation practices (CAP/CLIA), stakeholder buy-in, turnaround times
- Targeted validation workflows for LCR SVs
- Sectors: healthcare, academia
- Tools/Workflows: PCR-free long-read enrichment/capture, ONT duplex for error reduction, haplotype-resolved local assembly (e.g., hifiasm) and assembly-to-reference SV calling, IGV inspection
- Assumptions/Dependencies: sample availability, cost/coverage budgets, specialized lab capability
- Research analyses that retain but stratify LCR SVs
- Sectors: academia
- Tools/Workflows: avoid blanket filtering of LCR-overlapping SVs; stratify analyses by LCR status; prioritize functional follow-up for LCR SVs in coding exons or regulatory loci
- Assumptions/Dependencies: recognition of LCRs’ potential functional impacts; appropriate statistical modeling of error enrichment
- Pangenome and genome browser annotation updates
- Sectors: academia, software
- Tools/Workflows: add LCR tracks to UCSC/Ensembl; annotate pangenome graphs (e.g., minigraph-derived bubbles) with LCR calls; publish track hubs
- Assumptions/Dependencies: curator bandwidth; versioning and metadata standards
- CRISPR design risk management in LCRs
- Sectors: biotech/pharma
- Tools/Workflows: update guide RNA design pipelines to penalize/avoid LCR loci; LCR-aware off-target assessment
- Assumptions/Dependencies: tool updates; understanding of LCR-induced mapping ambiguity
- Consumer genomics QA/communications
- Sectors: consumer genomics companies
- Tools/Workflows: automatically flag DTC SV calls that overlap LCR; add disclaimers and offer confirmatory lab testing options
- Assumptions/Dependencies: product policy changes; customer education
- Training and capacity building for analysts and pathologists
- Sectors: education, healthcare
- Tools/Workflows: case studies demonstrating LCR misalignment and allele-resolution pitfalls; hands-on exercises with truvari stratification and IGV review
- Assumptions/Dependencies: curricular integration; access to datasets/tools
- Coverage planning and procurement for LCR-heavy projects
- Sectors: healthcare, research operations
- Tools/Workflows: coverage calculators emphasizing higher HiFi/duplex coverage for LCR resolution; procurement strategies prioritizing long-read platforms for SV studies
- Assumptions/Dependencies: budget constraints; platform availability
- Data sharing norms that include LCR stratification metrics
- Sectors: academia, consortia
- Tools/Workflows: include LCR overlap and LCR-length-stratified accuracy metrics in published SV callsets; provide accompanying BED tracks
- Assumptions/Dependencies: journal and repository policies; community expectations
Long-Term Applications
- Next-generation SV callers specialized for LCRs
- Sectors: software, academia
- Tools/Workflows: expand haplotype-aware multi-sequence realignment, hybrid realignment–reassembly approaches, allele-precise matching, GPU acceleration; ML scoring of alignment consistency
- Assumptions/Dependencies: funding, access to long-read training data, community benchmarks
- LCR-aware benchmarking standards and clinical reporting guidance
- Sectors: policy, academia, industry
- Tools/Workflows: GIAB expansions that enforce allele-level matching and LCR-length stratification; CAP/CLIA guidelines for minimum coverage and validation in LCRs; standardized FDR/FNR reporting by LCR category
- Assumptions/Dependencies: multi-stakeholder consensus; pilot studies; regulatory processes
- Routine clinical use of haplotype-resolved assemblies
- Sectors: healthcare
- Tools/Workflows: clinical-grade hifiasm or similar assembly workflows; assembly-to-reference SV calling; integrated haplotype phasing and graph alignment
- Assumptions/Dependencies: reduced cost/turnaround, reimbursement models, validated pipelines, sample phasing (e.g., trio data or long-range phasing)
- Pangenome-based multi-sample SV representation and merging
- Sectors: academia, software, healthcare informatics
- Tools/Workflows: graph genome infrastructures (e.g., minigraph, VG/giraffe), cross-sample multi-sequence alignment, stable SV identifiers, LCR-aware normalization
- Assumptions/Dependencies: production-grade graph tooling, training, interoperability with EHR/LIMS
- LCR-aware QC dashboards and visualization products
- Sectors: software, healthcare
- Tools/Workflows: IGV plugins and web dashboards showing LCR overlap, LCR-length-specific FDR/FNR, automated triage routing to reassembly
- Assumptions/Dependencies: product development resources; integration with existing lab informatics
- Sequencing chemistry and platform innovations for LCR resolution
- Sectors: sequencing industry
- Tools/Workflows: longer, more accurate HiFi or duplex reads; library prep improvements that maintain repeat context; error models tuned for LCRs
- Assumptions/Dependencies: R&D investment, market demand, validation across genomes
- ML calibrators for LCR misalignment risk
- Sectors: software, academia
- Tools/Workflows: train models to predict misalignment/error probability from features (LCR length, read depth, alignment entropy), integrate with callers to adjust confidence
- Assumptions/Dependencies: labeled ground truth across LCR spectra; privacy-compliant datasets
- Population catalogs of LCR SV alleles and disease/regulatory associations
- Sectors: academia, healthcare, pharma
- Tools/Workflows: large cohort long-read sequencing; pangenome graph integration; association studies (GWAS/eQTL) stratified by LCR; functional assays of LCR-mediated regulation
- Assumptions/Dependencies: cohorts and consent, compute/storage, harmonized pipelines
- Regulatory updates for diagnostic labs and DTC services
- Sectors: policy
- Tools/Workflows: codify minimum validation for LCR SVs, coverage thresholds, recommended orthogonal methods, standardized reporting language
- Assumptions/Dependencies: evidence base, stakeholder agreement, implementation timelines
- Professional education and certification modules
- Sectors: education, professional societies
- Tools/Workflows: formal curricula on LCR-aware genomics, certification/CE credits, competency frameworks
- Assumptions/Dependencies: institutional buy-in, funding
- Interoperability standards for LCR tracks and metadata
- Sectors: software, academia
- Tools/Workflows: standardize BED schemas for LCR annotations (build, version, satellite inclusion rules), registries for reference tracks, provenance tracking in pipelines
- Assumptions/Dependencies: standards bodies participation; adoption across tools
- Therapeutic strategies leveraging LCR-mediated regulation
- Sectors: pharma/biotech
- Tools/Workflows: map regulatory functions of LCRs; design therapies that modulate LCR-linked regulatory elements or avoid destabilizing repeats; gene therapy designs aware of SV risks in LCR contexts
- Assumptions/Dependencies: deep functional characterization, translational models, safety considerations
Notes on dependencies across applications:
- High-quality long-read sequencing (coverage, read length, base accuracy) is often required to realize improvements in LCRs.
- Genome build consistency matters (GRCh38, CHM13; liftOver for GRCh37); satellite repeat handling shapes LCR scope.
- Community consensus on benchmarks and allele-resolution criteria will affect comparability and clinical acceptance.
- Graph/pangenome infrastructures are powerful but require new tooling, training, and standards to be production-ready.
Glossary
- allele: One of multiple versions of a genetic variant at a specific genomic position. "943 of them have “*” as alternate alleles."
- allele resolution: The precision with which distinct alleles (often per haplotype) are represented as separate variants. "The difference is caused by the allele resolution."
- alpha repeats: Alpha-satellite DNA sequences that form part of centromeric regions in the genome. "We filtered about half of them that overlap with alpha and HSAT2/3 centromeric repeats found by dna-brnn."
- assembly gaps: Regions in a reference genome where sequence is missing or not assembled. "“NotConf” denotes not-confident regions in the HG002-Q100 v1.1 benchmark, excluding assembly gaps in GRCh38."
- assembly-to-reference alignment: Aligning assembled contigs or haplotypes back to a reference genome to identify variants. "and call variants from assembly-to-reference alignment"
- BED file: A tab-delimited genomic interval format commonly used to store coordinate-based annotations. "This resulted in a BED file with 111,067 records, covering 35.4Mb of GRCh38."
- centromeric satellites: Highly repetitive DNA arrays located at centromeres, often composed of specific satellite families. "Most of the additional regions came from centromeric satellites that are not HSAT2/3 or alpha repeats."
- confident regions: Genome intervals designated as reliable for benchmarking variant calls. "There are 29,131 SVs of ≥50bp in length contained in the confident regions in the new HG002-Q100 v1.1 benchmark."
- dna-brnn: A computational tool for classifying or detecting repetitive DNA elements using neural networks. "We filtered about half of them that overlap with alpha and HSAT2/3 centromeric repeats found by dna-brnn."
- false discovery rate (FDR): The proportion of called variants that are incorrect among all calls deemed positive. "False discovery rate (FDR) of SVs in the HG002-Q100 confident regions, measured by truvari in the “refine” mode."
- false negative rate (FNR): The proportion of true variants that are missed by the caller. "False negative rate (FNR) of SVs in HG002-Q100."
- Genome-In-A-Bottle (GIAB): A consortium that produces high-quality benchmark datasets for variant calling. "Constructed by the Genome-In-A-Bottle (GIAB) group, the latest SV benchmark HG002-Q100 v1.1 contains 28,188 SVs in 2.76Gb of confident regions, consistent with the recent counts."
- genotyping: Determining the zygosity or copy state of a variant for a sample’s haplotypes. "We used kanpig v1.1.0 for genotyping SVs called by SVDSS as is suggested in the documentation."
- GRCh38: A widely used human reference genome build (Genome Reference Consortium Human Build 38). "We applied longdust to GRCh38 and identified 115.4Mb of LCRs on assembled chromosomes."
- haplotype: A set of variants inherited together on the same chromosome copy. "Suppose both haplotypes in HG002 harbor a 6kb insertion to the same location of the reference genome."
- haplotype-resolved assemblers: Genome assemblers that separately reconstruct sequences for each haplotype. "Given accurate long reads at high coverage, we may also assemble the reads with haplotype-resolved assemblers"
- heterozygous: Having different alleles on the two haplotypes at a locus. "The newer HG002-Q100 benchmark would consider this event as two heterozygous insertions"
- HG002-Q100 v1.1: A specific GIAB structural variant benchmark dataset for sample HG002 with high-confidence regions. "Constructed by the Genome-In-A-Bottle (GIAB) group, the latest SV benchmark HG002-Q100 v1.1 contains 28,188 SVs in 2.76Gb of confident regions"
- HG002-SV v0.6: An older GIAB structural variant benchmark dataset for sample HG002. "In contrast, published in 2020, the older HG002-SV benchmark v0.6 only contains 9,705 SVs in 2.66Gb."
- HiFi reads (PacBio High-Fidelity reads): Highly accurate long sequencing reads produced by PacBio. "We acquired PacBio High-Fidelity reads from HPRC"
- HPRC (Human Pangenome Reference Consortium): A consortium generating diverse human genome assemblies to build a pangenome reference. "assemblies from the Human Pangenome Reference Consortium (HPRC)"
- HSAT2/3: Human satellite DNA families prevalent in centromeric regions. "We filtered about half of them that overlap with alpha and HSAT2/3 centromeric repeats found by dna-brnn."
- IGV: Integrative Genomics Viewer, a tool for visualizing genomic data and alignments. "IGV screenshot of alignment around an LCR."
- kilobase (kb): A length unit in genomics equal to 1,000 base pairs. "Suppose both haplotypes in HG002 harbor a 6kb insertion to the same location of the reference genome."
- LCR (low-complexity regions): Genomic regions dominated by simple, repetitive sequence patterns. "We identified 35.4 Mb of low-complexity regions (LCRs) in GRCh38."
- liftover: Converting genomic coordinates from one reference build to another. "we lifted its confident regions over to GRCh38"
- local reassembly: Reconstructing sequence in a specific region from reads to improve variant detection. "the critical role of realignment or local reassembly in accurate SV calling."
- longcallD: An SV caller that performs haplotype-aware multi-sequence realignment. "Developed in our group, longcallD achieves the lowest error rate"
- longdust: A tool for detecting low-complexity sequences in genomes. "We applied longdust to GRCh38 and identified 115.4Mb of LCRs on assembled chromosomes."
- minimap2: A fast sequence aligner widely used for long-read mapping. "aligned them to the primary assembly of GRCh38 with minimap2"
- minigraph graph: A pangenome graph representation produced by minigraph, capturing variation across assemblies. "and used the results to annotate variant bubbles in the minigraph graph of these assemblies"
- multi-sequence alignment: Simultaneously aligning multiple sequences to achieve consistent variant representation. "conducting multi-sequence alignment across samples, such methods can produce more consistent SV representations."
- phased realignment: Realigning reads with knowledge of their haplotype phase to improve consistency. "The bottom panel shows the phased realignment by longcallD."
- phasing: Determining which variants co-occur on the same haplotype. "Performing phasing and alignment within each haplotype, these assemblers are more powerful than most SV callers."
- pangenome-based methods: Approaches that use a graph or multi-reference representation capturing population-level variation. "calling variants across samples with pangenome-based methods will be the preferred approach"
- polymorphic: Presenting variation in sequence among individuals in a population. "It may miss polymorphic LCRs present in other human samples but missing from GRCh38."
- segmental duplication (SegDup): Large duplicated genomic regions that can complicate alignment and variant calling. "segmental duplications (SegDup)"
- single nucleotide polymorphism (SNP): A variant where a single base differs among sequences. "The inserted sequences however differ by one SNP between them."
- Sniffles2: A structural variant caller for long-read data. "Sniffles2 may optionally take tandem repeatitive regions as input"
- structural variant (SV): A genomic variant typically ≥50 bp, including insertions, deletions, duplications, inversions, etc. "Structural variants (SVs) are ≥50bp genomic variants"
- tandem repetitive regions: DNA regions with adjacent repeated sequence units. "Sniffles2 may optionally take tandem repeatitive regions as input"
- T2T-CHM13 genome: A telomere-to-telomere human reference assembly derived from the CHM13 cell line. "We applied the same procedure to the T2T-CHM13 genome"
- truvari: A benchmarking tool for evaluating variant call accuracy. "The truvari evaluation tool also filters SVs with “*” alleles."
- UCSC Genome Browser: An online platform providing genome assemblies and annotation tracks. "16.2% of the LCRs are intersected with the SegDup annotation from the “genomicSuperDups” track of the UCSC Genome Browser."
- variant bubble: A subgraph in a pangenome graph representing alternative alleles between sequences. "A variant bubble was marked as an LCR if (a) ≥70% of the sequences in the bubble were LCRs in the source assemblies, and (b) the sequences in the bubble were not annotated as segmental duplications (SegDup) by HPRC."
Collections
Sign up for free to add this paper to one or more collections.