Efficient computation or approximation of the abstract nanopore channel capacity

Develop an efficient algorithm to compute or approximate the capacity C_f of the abstract nanopore channel defined by a mapping f: {A,C,G,T}^k -> {0,1,...,b-1}, where capacity C_f is the asymptotic per-base logarithmic growth rate of the number of realizable current readouts produced by f across all DNA strands. The goal is to achieve polynomial-time performance in the relevant parameters (e.g., k and b) for general mappings f.

Background

The paper introduces an abstract deterministic model of the nanopore sequencer where each k-mer is mapped to one of b current levels via a function f. The capacity C_f is defined combinatorially as the per-base logarithmic growth rate of the number of current readouts achievable by some DNA strand. The authors provide an exact algorithm via NFA-to-DFA subset construction and the transfer matrix method, but this algorithm is exponential in the number of states and impractical for larger k.

Because counting accepted strings for general NFAs is PSPACE-complete, the authors note that computing C_f may be hard in general, though their NFAs have special structure. They explicitly state that they do not currently know how to calculate or approximate C_f efficiently, and point to randomized approximation algorithms that can estimate counts for fixed-length strands C_f(ℓ), not the asymptotic C_f.

References

Although we currently don't know how to calculate (or approximate) C_f efficiently, it is possible to approximate the capacity for strands of fixed length ℓ, C_f(ℓ).

— On Coding for an Abstracted Nanopore Channel for DNA Storage (2102.01839 - Hulett et al., 2021) in Remark, Subsection 'Complexity of Algorithm R_f' under Section 'Computing the Capacity'

Efficient computation or approximation of the abstract nanopore channel capacity

Sponsor

Background

References

Related Problems