Maximum number of DNA address sequences satisfying constraints C1–C4

Determine, as a function of the address length n, the largest possible cardinality u(n) of a set of DNA address sequences of length n that simultaneously satisfy four constraints: (C1) each sequence and all sufficiently long prefixes have GC content approximately 50%; (C2) the pairwise Hamming distance between any two sequences is large (e.g., at least half the address length); (C3) the sequences are mutually uncorrelated, meaning no proper prefix of any sequence is a suffix of itself or any other sequence, and vice versa; and (C4) the sequences exhibit no secondary (folding) structures predicted by thermodynamic models.

Background

The paper’s random-access DNA storage architecture relies on short address sequences to uniquely identify and selectively amplify data blocks. To ensure reliable selection without cross-hybridization and robust sequencing, the authors impose four design constraints on addresses: balanced GC content across prefixes, large mutual Hamming distance, mutual uncorrelatedness of prefixes and suffixes, and absence of secondary structures.

While the authors prove exponential bounds for sets of mutually uncorrelated sequences and construct practical address sets via expurgated balanced codes and computational screening, quantifying the maximum achievable number of address sequences that satisfy all four constraints simultaneously remains unresolved. This quantity directly impacts system scalability and achievable encoding rates.

References

It remains an open problem to determine the largest number of address sequences that jointly satisfy the constraints C1-C4.

A Rewritable, Random-Access DNA-Based Storage System  (1505.02199 - Yazdi et al., 2015) in Methods, Subsection “Sequence Correlation” (after Theorem 1)