Efficient FPRAS for #NFA Counting

Updated 3 July 2025

Practical FPRAS for #NFA is an efficient randomized algorithm that estimates the number of accepted strings by a nondeterministic finite automaton with provable error bounds.
It leverages dynamic programming combined with Monte Carlo sampling and optimized union estimation to significantly reduce time complexity.
The algorithm's practical impact extends to query evaluation, data provenance, and software testing by enabling feasible approximate counting in logspace.

A fully polynomial randomized approximation scheme (FPRAS) for the #NFA problem provides an efficient, probabilistically accurate algorithm to estimate the number of strings of length $n$ accepted by a given nondeterministic finite automaton (NFA). Progress in this area has moved from purely theoretical existence proofs and highly impractical methods to the recent introduction of algorithms with time bounds near those of core automata operations, positioning #NFA as one of the few #P-hard problems with realistic randomized approximate counting solutions.

1. Definition and Computational Significance

The #NFA problem is defined as follows: for an NFA $A$ with $m$ states over a finite alphabet and a natural number $n$ (in unary), compute $|L_n(A)|$ , the number of words of length $n$ accepted by $A$ . This counting problem is #P-complete (under polynomial-time Turing reductions) and further SpanL-complete, placing it at the heart of logspace counting complexity. Its canonical status means that advances for #NFA often propagate immediately to broad classes of logspace counting and enumeration tasks—including regular path query evaluation, document spanner extraction, and provenance computation for data management.

2. Theoretical Foundations: Existence of FPRAS

The breakthrough result, due to Arenas, Croquevielle, Jayaram, and Riveros (2019), proved that #NFA admits an FPRAS, settling a long-standing open question. For any $\varepsilon \in (0,1)$ and $\delta > 0$ , their algorithm outputs an approximation $\hat{N}$ to $|L_n(A)|$ such that

$\Pr[|\hat{N} - |L_n(A)|| \le \varepsilon |L_n(A)|] \ge 1 - \delta$

with expected time polynomial in $m, n, 1/\varepsilon, \log(1/\delta)$ . The construction relied on combining dynamic programming over an unrolled, layered representation of the NFA with Monte Carlo sampling, advanced union size estimation, and rejection/reservoir sampling to control uniformity and concentration across the exponentially large search space.

As a consequence, every function in the logspace counting class SpanL inherits an FPRAS—a remarkable metatheoretic guarantee, since SpanL contains #P-complete problems.

3. Evolution of Algorithmic Techniques and Complexity

Early FPRAS (ACJR19)

The original FPRAS devised by Arenas et al. maintained a very high sample complexity, requiring independent sample sets for every state and layer, and highly stringent invariants to ensure uniformity across all possible state subsets: $|S(q^\ell)| = O\left( \frac{m^7 n^7}{\varepsilon^7} \right)$ with a total runtime

$\widetilde{O}\left( m^{17} n^{17} \varepsilon^{-14} \log(1/\delta) \right)$

which rendered practical implementations infeasible even for modestly sized automata.

Substantial Improvements (Meel, Chakraborty, Mathur 2024)

Leveraging relaxation of invariants, sample complexity was reduced significantly: $|S(q^\ell)| = \widetilde{O}\left( \frac{n^4}{\varepsilon^2} \right)$ and overall time complexity improved to

$\widetilde{O}\left( (m^2 n^{10} + m^3 n^6) \varepsilon^{-4} \log^2(1/\delta) \right)$

by weakening the uniformity requirements from max-norm to total variation and replacing generic union estimation by more streamlined Monte Carlo estimators.

Nearly Practical FPRAS (2025: Exploiting Sample Dependence)

The most recent advance introduces a further optimized FPRAS with time complexity

$O\left( n^2 m^3 \log(nm) \varepsilon^{-2} \log(1/\delta) \right)$

This scheme departs from the earlier independent sampling paradigm: instead, it exploits dependencies among samples by reusing and propagating sample sets across layers, carefully analyzing the resulting variance through properties of derivation paths—where dependency between two samples is quantified via their last common ancestor in the computational tree.

To achieve practical performance, the algorithm constructs, for each state at layer $\ell$ , a set of samples $S(q^\ell)$ by directly extending samples from predecessor layers. Membership checks (to determine if a word is in $L(q^\ell)$ ) are accelerated using matrix-based caching and incremental updates, reducing what would otherwise be $O(n m^2)$ per check to $O(1)$ amortized over the sample pool. The reliance on robust statistical estimators like median-of-means ensures concentration bounds despite dependencies.

The improved complexity is now comparable to that of basic membership checking and matrix operations on the NFA, making practical implementation plausible.

4. Applications and Practical Impact

The availability of an efficient FPRAS for #NFA underpins several practical domains:

Probabilistic Query Evaluation: Many tasks in probabilistic databases and provenance tracking reduce to counting or approximately counting accepting paths of automata. For example, the probability that a query (expressed as a path or regular expression) matches in a tuple-independent graph database can be framed as a weighted #NFA instance.
Graph Data Management: Counting and uniform sampling of regular path query answers (such as in knowledge graphs or RDF stores) maps directly to the #NFA problem.
Software Testing and Formal Methods: Coverage metrics and path enumeration in model checking are often reducible to counting automata-accepted behaviors.
Learning from Weak Supervision: Modern frameworks encode constraints from weak supervision as NFA, where efficient marginalization equates to tractable approximate #NFA counting.

A plausible implication is that continued optimization may soon permit the integration of FPRAS-based counting directly in production systems for the above tasks, particularly as the most recent algorithms are compatible with parallel and hardware-accelerated (e.g., GPU) matrix operations.

5. Methodological Landscape and Trade-Offs

The evolution from strict independence in sample generation (favoring theoretical cleanliness but incurring impractical cost) to intentional reuse and managed dependency (facilitated by refined variance analysis and robust estimators) marks a central methodological advance. This enables the aggressive amortization of membership and union checks, yielding major savings in both time and space.

Paper	Time Complexity	Core Technical Shift
ACJR19	$\tilde{O}(n^{17} m^{17}\varepsilon^{-14}\log(1/\delta))$	Max-norm invariant; independent sampling
MCM24	$\tilde{O}((n^{10}m^2 + n^6 m^3)\varepsilon^{-4})$	Total variation; union estimation improved
2025	$O(n^2 m^3\log(nm)\varepsilon^{-2} \log(1/\delta))$	Sample reuse; derivation path analysis

For applications requiring strong subset-uniformity among samples, earlier methods may still be relevant; however, for approximate counting as needed in FPRAS, the more recent invariants suffice.

6. Challenges, Future Directions, and Open Questions

While the recent FPRAS algorithm closes the gap to practical deployment, certain challenges remain:

Empirical Evaluation and Engineering: No large-scale implementation or benchmarking is yet reported. Efficient realization, perhaps hardware-accelerated, is needed to confirm practical speedups.
Scalability for Very Large $n$ , $m$ : For extremely large automata or word lengths, the quadratic and cubic dependencies could still be limiting, though the current approach substantially improves feasibility.
Optimality and Lower Bounds: Theoretical work may seek to prove whether further asymptotic improvements—particularly over $n$ or $m$ —are possible, or whether the current FPRAS is near-optimal.
Extension to More Expressive Models: Generalizing these ideas to weighted automata, context-free languages, or other automata-theoretic settings represents an open frontier.

7. Connections to the Broader Algorithms and Query Evaluation Ecosystem

This line of research provides a bridge between automata theory, randomized approximate counting, and applied data systems. Recent work demonstrates that circuit model counting techniques (e.g., with nOBDDs or DNNF) for query provenance (Amarilli et al., 2023) as well as learning from weak supervision (Chen et al., 2 Feb 2024) can often exploit the same FPRAS techniques, or efficiently encode their core inference subproblems as #NFA or near-#NFA instances.

Such convergence suggests that optimized, practical FPRAS for #NFA will play an increasingly central role across subfields involving enumeration, sampling, and approximate counting of automata-accepted languages in both theory and applications.

PDF Markdown Chat (Upgrade)

References (2)

1.

Approximating Queries on Probabilistic Graphs (2023)

2.

A General Framework for Learning from Weak Supervision (2024)