Maximum size of jointly constrained DNA address sets

Determine the maximum possible cardinality, as a function of the address length n, of a set of DNA address sequences that simultaneously satisfy the following constraints: (i) GC-prefix balance close to 50% for all sufficiently long prefixes (D-GC-prefix-balanced), (ii) minimum mutual Hamming distance at least d, (iii) mutual uncorrelatedness where no prefix of one address appears as a proper suffix of the same or another address, and (iv) absence of secondary structure in the primer sequences.

Background

Address sequences are critical for selective random access and accurate amplification in DNA-based storage. The authors impose four constraints on address design: GC-prefix balance to ensure stability and sequencing coverage, large mutual Hamming distance to reduce mis-selection, mutual uncorrelatedness to avoid accidental cross-hybridization and assembly errors, and absence of secondary structure to prevent PCR and editing issues.

While each constraint can be addressed individually using known coding techniques (e.g., bounded running digital sum codes for GC-balance, cross-bifix-free constructions for uncorrelatedness), constructing sets that satisfy all four constraints simultaneously and determining tight bounds on their maximal size is challenging. The paper highlights the lack of a complete solution and frames it as an open problem.

References

As already pointed out, it is an open problem to determine the largest number of address sequences that jointly satisfy the constraints C1 to C4.

DNA-Based Storage: Trends and Methods  (1507.01611 - Yazdi et al., 2015) in Section “Constrained Coding for Address Sequences” (within Random Access and Rewritable DNA-Based Storage)