Papers
Topics
Authors
Recent
Search
2000 character limit reached

Probabilistic Deterministic Finite Automata

Updated 16 January 2026
  • PDFA is a formal automaton that deterministically transitions between states while assigning state-dependent probabilities to output symbols.
  • It supports efficient learning through methods like WL*, QUNT, and pL#, providing interpretable, minimal models with guaranteed sample complexity.
  • PDFAs are pivotal in explainable AI, neural model distillation, and robotics planning due to their clear algebraic properties and controlled prediction benchmarks.

A probabilistic deterministic finite automaton (PDFA) is a formal model for stochastic, sequence-generating, discrete event systems that compute conditional probabilities over finite strings. In a PDFA, the next state is determined deterministically by the current state and input symbol, while the emitted symbol is chosen according to a state-dependent probability distribution. PDFAs are strictly more expressive than deterministic finite automata (DFAs), providing real-valued weights for every string, yet are strictly less general than hidden Markov models (HMMs), as their transitions are deterministically functionally mapped. PDFAs have become central in benchmarking predictive models, distilling interpretable surrogates for neural sequence models, and inferring structure from demonstration data. Their well-understood algebraic properties, minimization procedures, and sample complexity guarantees lend themselves to rigorous learning theory, efficient extraction algorithms, and broad applicability in modeling, planning, and explainability.

1. Formal Definitions and Algebraic Properties

Let Σ\Sigma be a finite input alphabet, and let Σ∗\Sigma^* denote the set of finite words over Σ\Sigma. A PDFA is specified as a tuple A=(Q,Σ,δ,w)A = (Q, \Sigma, \delta, w), where:

  • QQ is a finite set of states, often including an initial state q0∈Qq_0 \in Q.
  • δ:Q×Σ→Q\delta: Q \times \Sigma \rightarrow Q is a total, deterministic transition function.
  • $w: Q \times \Sigma_\$ \rightarrow [0,1]isastochasticoutputmappingassigningprobabilitiestoeachpossiblenextsymbol,withis a stochastic output mapping assigning probabilities to each possible next symbol, with\Sigma_\$= \Sigma \cup \{ \$ }$, where \$ is a special end-of-sequence symbol. For all q∈Qq \in Q, $\sum_{\sigma \in \Sigma_\$} w(q, \sigma) = 1ensuresnormalization.</li></ul><p>Foraword ensures normalization.</li> </ul> <p>For a word w=w_1\dots w_n \in \Sigma^*,theprobabilitythat, the probability that Agenerates generates wandthenhaltsis</p><p> and then halts is</p> <p>P_A(w) = \prod_{i=1}^n w(q_{i-1}, w_i) \cdot w(q_n, \$)</p><p>where</p> <p>where q_0istheinitialstateand is the initial state and q_i = \delta(q_{i-1}, w_i).Theconditionalprobabilityofsymbol. The conditional probability of symbol aafterprefix after prefix pis is w(q,a),with, with q = \delta(q_0, p)(<ahref="/papers/1910.13895"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">Weissetal.,2019</a>,<ahref="/papers/2406.18328"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">Baumgartneretal.,2024</a>,<ahref="/papers/2509.10034"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">Dhayalkar,12Sep2025</a>,<ahref="/papers/1507.05164"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">Mironov,2015</a>).</p><p>Alternativeformalismsspecifyastoppingprobability (<a href="/papers/1910.13895" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Weiss et al., 2019</a>, <a href="/papers/2406.18328" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Baumgartner et al., 2024</a>, <a href="/papers/2509.10034" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Dhayalkar, 12 Sep 2025</a>, <a href="/papers/1507.05164" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Mironov, 2015</a>).</p> <p>Alternative formalisms specify a stopping probability \pi(q)foreachstateoratransitionprobability for each state or a transition probability \delta_P(q, \sigma, q'),maintainingthenormalizationconstraints(, maintaining the normalization constraints (\sum_{a \in \Sigma} \pi(q,a) + \pi(q) = 1)(<ahref="/papers/2406.18328"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">Baumgartneretal.,2024</a>,<ahref="/papers/2409.07091"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">Baertetal.,2024</a>).</p><p>PDFAsstrictlygeneralizeDFAs,assigningreal−valuedprobabilityweightstoeverystring,andarestrictlylessgeneralthanHMMs,becausetransitionsaredeterministicfunctionsofthestateandemittedsymbol,notarbitrarilystochastic(<ahref="/papers/1507.05164"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">Mironov,2015</a>).</p><h2class=′paper−heading′id=′minimality−congruence−and−structural−characterization′>2.Minimality,Congruence,andStructuralCharacterization</h2><p>ThecanonicalminimalPDFAisconstructedviatheconceptofresidualdistributions,witheachstaterepresentingauniqueconditionalprobabilityvectoroverallcontinuations.For) (<a href="/papers/2406.18328" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Baumgartner et al., 2024</a>, <a href="/papers/2409.07091" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Baert et al., 2024</a>).</p> <p>PDFAs strictly generalize DFAs, assigning real-valued probability weights to every string, and are strictly less general than HMMs, because transitions are deterministic functions of the state and emitted symbol, not arbitrarily stochastic (<a href="/papers/1507.05164" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Mironov, 2015</a>).</p> <h2 class='paper-heading' id='minimality-congruence-and-structural-characterization'>2. Minimality, Congruence, and Structural Characterization</h2> <p>The canonical minimal PDFA is constructed via the concept of residual distributions, with each state representing a unique conditional probability vector over all continuations. For f:\Sigma^* \to [0,1]realizedbyaPDFA,theminimalautomatonisbuiltontheset realized by a PDFA, the minimal automaton is built on the set \{f_u(v) = f(uv)/f(u)\mid u,v\in\Sigma^*, f(u)>0\}.Minimizationproceedsintwosteps:</p><ol><li>Extractionofreachablestatesbybreadth−firstsearchover. Minimization proceeds in two steps:</p> <ol> <li>Extraction of reachable states by breadth-first search over \delta$.</li> <li>Removal of &quot;convex-combination&quot; states: states whose basis vector in the cutset matrix can be expressed as a convex combination of others (solved via LP) (<a href="/papers/1507.05164" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Mironov, 2015</a>).</li> </ol> <p>The uniqueness of the minimal PDFA is guaranteed by the residual automaton construction—any two equivalent prefix distributions are merged (<a href="/papers/1507.05164" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Mironov, 2015</a>). Closure properties apply: the class of cut languages of PDFAs is closed under union, concatenation, and Kleene plus, with explicit constructions provided (<a href="/papers/1507.05164" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Mironov, 2015</a>).</p> <p>The regularity of a probabilistic language is characterized by the finiteness of a congruence relation generalizing the Myhill-Nerode theorem. The induced congruence $\equiv_\epsilonisdefinedas</p><p> is defined as</p> <p>u \equiv_\epsilon v \iff \forall w \in \Sigma^*,\ \alpha(u\,w) =_\epsilon \alpha(v\,w)</p><p>foranequivalence</p> <p>for an equivalence =_\epsilon$ on probability distributions. This congruence is right-compatible and yields quotient automata and minimal PDFA recognition whenever its index is finite (<a href="/papers/2412.09760" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Carrasco et al., 2024</a>, <a href="/papers/2206.09004" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Mayr et al., 2022</a>).</p> <h2 class='paper-heading' id='learning-algorithms-query-based-spectral-and-tree-methods'>3. Learning Algorithms: Query-Based, Spectral, and Tree Methods</h2> <p>Classical <a href="https://www.emergentmind.com/topics/active-learning-actprm" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">active learning</a> for PDFAs adapts Angluin&#39;s L* algorithm to the probabilistic setting. The learning workflow involves querying conditional next-symbol distributions and organizing the responses in observation tables or trees, subject to a local tolerance $\delta > 0inthe in the \infty−norm.Keyvariantsare:</p><ul><li>WL∗:A-norm. Key variants are:</p> <ul> <li>WL*: A \delta−tolerantL∗adaptationclusteringprefixeswhoseobservation−tablerowsare-tolerant L* adaptation clustering prefixes whose observation-table rows are \delta$-equal, maintaining closedness and consistency. Hypotheses are constructed from prefix-equivalence classes with deterministic transitions using nearest-row heuristics for undefined transitions (<a href="/papers/1910.13895" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Weiss et al., 2019</a>).</li> <li>QUNT: Employs a tree-based classification structure distinguishing state-classes by quantized next-symbol distributions and distinguishing suffixes, yielding significant efficiency advantages over observation-table methods, especially for large PDFAs and alphabets (<a href="/papers/2206.09004" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Mayr et al., 2022</a>).</li> <li>pL# (string-probability <a href="https://www.emergentmind.com/topics/lora-reconstruction-distillation" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">distillation</a>): Rather than using next-symbol probability queries, pL# infers conditional distributions via full-string probability queries, computing $P(a \mid w) = P(wa)/P(w),andmaintainsanobservationtreewithmergingbasedonprobabilityerrorthresholds(<ahref="/papers/2406.18328"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">Baumgartneretal.,2024</a>).</li><li>Congruence−based(PL∗):Activelearningproceedsbyaggregatingclasseswhoseconditionaldistributionsareequivalentlycongruentunder, and maintains an observation tree with merging based on probability error thresholds (<a href="/papers/2406.18328" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Baumgartner et al., 2024</a>).</li> <li>Congruence-based (PL*): Active learning proceeds by aggregating classes whose conditional distributions are equivalently congruent under \equiv_\epsilon,withterminationandminimalityensuredwhenevertheequivalenceformsatruecongruence(<ahref="/papers/2412.09760"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">Carrascoetal.,2024</a>).</li></ul><p>Allthesealgorithmsutilizemembershipqueries(next−symbolorfull−stringprobability),equivalencequeries(globalbehavioragreement),andcounterexample−drivenexpansions.Theirsampleandtimecomplexityispolynomialinthenumberofstates,alphabetsize,, with termination and minimality ensured whenever the equivalence forms a true congruence (<a href="/papers/2412.09760" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Carrasco et al., 2024</a>).</li> </ul> <p>All these algorithms utilize membership queries (next-symbol or full-string probability), equivalence queries (global behavior agreement), and counterexample-driven expansions. Their sample and time complexity is polynomial in the number of states, alphabet size, 1/\delta(orquantizationgranularity),andlogarithmicintheerrorprobability(<ahref="/papers/1910.13895"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">Weissetal.,2019</a>,<ahref="/papers/2206.09004"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">Mayretal.,2022</a>,<ahref="/papers/2412.09760"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">Carrascoetal.,2024</a>).</p><h2class=′paper−heading′id=′empirical−evaluation−and−benchmarking′>4.EmpiricalEvaluationandBenchmarking</h2><p>PDFAsserveasgold−standardbenchmarksforsequenceprediction,especiallyforevaluatingneuralpredictors.Theirenumerabilityandstructuredrandomnesspermit<ahref="https://www.emergentmind.com/topics/controlled−generation"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">controlledgeneration</a>oftestdistributionswhosetruerate–accuracycurvesandoptimalpredictorsarecomputable.Empiricalresultsinclude:</p><ul><li>WL∗achieveszero<ahref="https://www.emergentmind.com/topics/word−error−rate−wer"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">worderrorrate</a>andNDCG=1onsmallPDFAs,outperformingspectralWFAand (or quantization granularity), and logarithmic in the error probability (<a href="/papers/1910.13895" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Weiss et al., 2019</a>, <a href="/papers/2206.09004" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Mayr et al., 2022</a>, <a href="/papers/2412.09760" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Carrasco et al., 2024</a>).</p> <h2 class='paper-heading' id='empirical-evaluation-and-benchmarking'>4. Empirical Evaluation and Benchmarking</h2> <p>PDFAs serve as gold-standard benchmarks for sequence prediction, especially for evaluating neural predictors. Their enumerability and structured randomness permit <a href="https://www.emergentmind.com/topics/controlled-generation" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">controlled generation</a> of test distributions whose true rate–accuracy curves and optimal predictors are computable. Empirical results include:</p> <ul> <li>WL* achieves zero <a href="https://www.emergentmind.com/topics/word-error-rate-wer" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">word error rate</a> and NDCG=1 on small PDFAs, outperforming spectral WFA and n−grammodels,andoftenmatchesorexceedsthemonlarger,moreentropictasks(<ahref="/papers/1910.13895"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">Weissetal.,2019</a>).</li><li>QUNTperformslinearlyornear−linearlywiththenumberofstatesandalphabetsize,overwhelmingobservation−tablelearnersinefficiency,especiallyfor-gram models, and often matches or exceeds them on larger, more entropic tasks (<a href="/papers/1910.13895" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Weiss et al., 2019</a>).</li> <li>QUNT performs linearly or near-linearly with the number of states and alphabet size, overwhelming observation-table learners in efficiency, especially for n > 1000$ (<a href="/papers/2206.09004" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Mayr et al., 2022</a>).</li> <li>pL# distills compact, interpretable PDFAs from neural LMs (LSTMs, Transformers), achieving mean squared error $10^{-6}to to 10^{-10}$ with far fewer states than non-minimal competitors (Baumgartner et al., 2024).
  • In head-to-head RNN benchmarks, LSTMs, RCs, and GLMs fall short of PDFA optimal predictive accuracy by as much as 50% after training, and typically miss by 5%, even for simple processes. Classical causal-state inference trivially recovers the optimal Bayes predictor with orders-of-magnitude less data (Marzen et al., 2019).

PDFAs also underlie state-of-the-art methods for learning task specifications from demonstration, inferring interpretable models encoding sub-goals and temporal dependencies, essential for robot planning and adaptation (Baert et al., 2024).

5. Spectral, Neural, and Algebraic Simulation Theories

PDFAs are amenable to both spectral learning approaches and symbolic simulation in neural architectures.

  • Spectral extraction methods build Hankel matrices of observed string probabilities and factorize to obtain weighted automata. The result may be non-deterministic and less interpretable than PDFA minimal models (Weiss et al., 2019).
  • Symbolic feedforward networks can exactly simulate PDFA behavior: each state-distribution is a vector, each transition function is a row-stochastic matrix, and processing a string is unrolling matrix-vector products followed by a read-out layer (Dhayalkar, 12 Sep 2025). There is a formal equivalence: every PDFA corresponds to such a network and vice versa. They are learnable via standard gradient descent by minimizing squared or cross-entropy loss over observed string probabilities, with convergence to exact behavior in ideal conditions (Dhayalkar, 12 Sep 2025).

6. Applications, Limitations, and Future Directions

PDFAs are central to explainable machine learning (as surrogate models for neural LMs), reverse-engineering black-box sequence generators, task specification from demonstration (robotics), and optimal prediction benchmarking. Their advantages include interpretability, compactness, determinism, and polynomial-time minimization. Their limitations manifest in worst-case sample complexity for high-state or low-probability targets, sensitivity to hybrid or noisy distributions, and suboptimality if equivalence relations are non-transitive (mere tolerances fail to guarantee minimality) (Weiss et al., 2019, Carrasco et al., 2024).

Potential future extensions include adversarial counterexample generation, generalization to partially observable or weighted nondeterministic automata, sharpening sample-complexity bounds under realistic noise, and integrating spectral, tree-based, and congruence-based learning strategies (Baumgartner et al., 2024).

References and Key Papers

Paper Title arXiv id Highlighted Contribution
Learning Deterministic Weighted Automata with Queries... (Weiss et al., 2019) WL*, δ-consistency, empirical benchmarks, minimization
Towards Efficient Active Learning of PDFA (Mayr et al., 2022) Tree-based QUNT, quantization, query-efficiency
PDFA Distillation via String Probability Queries (Baumgartner et al., 2024) pL# algorithm, full-string queries, distillation
Congruence-based Learning of Probabilistic... (Carrasco et al., 2024) Myhill-Nerode congruence for PDFA, canonical minimality
Probabilistic Deterministic Finite Automata and Recurrent... (Marzen et al., 2019) Benchmark methodology, predictive gaps, causal states
A theory of probabilistic automata, part 1 (Mironov, 2015) Formal theory, minimization algorithms, closure
Symbolic Feedforward Networks for Probabilistic Finite Automata (Dhayalkar, 12 Sep 2025) Neural network simulation, learnability, equivalence
Learning Task Specifications from Demonstrations... (Baert et al., 2024) Sub-goal inference, planning, demonstration modeling

These works collectively establish PDFAs as a foundational structure for finite-state stochastic modeling, rigorous learning theory, and systematic empirical evaluation.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Probabilistic Deterministic Finite Automata (PDFA).