PDFA is a formal automaton that deterministically transitions between states while assigning state-dependent probabilities to output symbols.
It supports efficient learning through methods like WL*, QUNT, and pL#, providing interpretable, minimal models with guaranteed sample complexity.
PDFAs are pivotal in explainable AI, neural model distillation, and robotics planning due to their clear algebraic properties and controlled prediction benchmarks.
A probabilistic deterministic finite automaton (PDFA) is a formal model for stochastic, sequence-generating, discrete event systems that compute conditional probabilities over finite strings. In a PDFA, the next state is determined deterministically by the current state and input symbol, while the emitted symbol is chosen according to a state-dependent probability distribution. PDFAs are strictly more expressive than deterministic finite automata (DFAs), providing real-valued weights for every string, yet are strictly less general than hidden Markov models (HMMs), as their transitions are deterministically functionally mapped. PDFAs have become central in benchmarking predictive models, distilling interpretable surrogates for neural sequence models, and inferring structure from demonstration data. Their well-understood algebraic properties, minimization procedures, and sample complexity guarantees lend themselves to rigorous learning theory, efficient extraction algorithms, and broad applicability in modeling, planning, and explainability.
1. Formal Definitions and Algebraic Properties
Let Σ be a finite input alphabet, and let Σ∗ denote the set of finite words over Σ. A PDFA is specified as a tuple A=(Q,Σ,δ,w), where:
Q is a finite set of states, often including an initial state q0​∈Q.
δ:Q×Σ→Q is a total, deterministic transition function.
$w: Q \times \Sigma_\$ \rightarrow [0,1]isastochasticoutputmappingassigningprobabilitiestoeachpossiblenextsymbol,with\Sigma_\$= \Sigma \cup \{ \$ }$, where \$ is a special end-of-sequence symbol. For all q∈Q, $\sum_{\sigma \in \Sigma_\$} w(q, \sigma) = 1ensuresnormalization.</li></ul><p>Forawordw=w_1\dots w_n \in \Sigma^*,theprobabilitythatAgenerateswandthenhaltsis</p><p>P_A(w) = \prod_{i=1}^n w(q_{i-1}, w_i) \cdot w(q_n, \$)</p><p>whereq_0istheinitialstateandq_i = \delta(q_{i-1}, w_i).Theconditionalprobabilityofsymbolaafterprefixpisw(q,a),withq = \delta(q_0, p)(<ahref="/papers/1910.13895"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">Weissetal.,2019</a>,<ahref="/papers/2406.18328"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">Baumgartneretal.,2024</a>,<ahref="/papers/2509.10034"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">Dhayalkar,12Sep2025</a>,<ahref="/papers/1507.05164"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">Mironov,2015</a>).</p><p>Alternativeformalismsspecifyastoppingprobability\pi(q)foreachstateoratransitionprobability\delta_P(q, \sigma, q'),maintainingthenormalizationconstraints(\sum_{a \in \Sigma} \pi(q,a) + \pi(q) = 1)(<ahref="/papers/2406.18328"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">Baumgartneretal.,2024</a>,<ahref="/papers/2409.07091"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">Baertetal.,2024</a>).</p><p>PDFAsstrictlygeneralizeDFAs,assigningreal−valuedprobabilityweightstoeverystring,andarestrictlylessgeneralthanHMMs,becausetransitionsaredeterministicfunctionsofthestateandemittedsymbol,notarbitrarilystochastic(<ahref="/papers/1507.05164"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">Mironov,2015</a>).</p><h2class=′paper−heading′id=′minimality−congruence−and−structural−characterization′>2.Minimality,Congruence,andStructuralCharacterization</h2><p>ThecanonicalminimalPDFAisconstructedviatheconceptofresidualdistributions,witheachstaterepresentingauniqueconditionalprobabilityvectoroverallcontinuations.Forf:\Sigma^* \to [0,1]realizedbyaPDFA,theminimalautomatonisbuiltontheset\{f_u(v) = f(uv)/f(u)\mid u,v\in\Sigma^*, f(u)>0\}.Minimizationproceedsintwosteps:</p><ol><li>Extractionofreachablestatesbybreadth−firstsearchover\delta$.</li>
<li>Removal of "convex-combination" states: states whose basis vector in the cutset matrix can be expressed as a convex combination of others (solved via LP) (<a href="/papers/1507.05164" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Mironov, 2015</a>).</li>
</ol>
<p>The uniqueness of the minimal PDFA is guaranteed by the residual automaton construction—any two equivalent prefix distributions are merged (<a href="/papers/1507.05164" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Mironov, 2015</a>). Closure properties apply: the class of cut languages of PDFAs is closed under union, concatenation, and Kleene plus, with explicit constructions provided (<a href="/papers/1507.05164" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Mironov, 2015</a>).</p>
<p>The regularity of a probabilistic language is characterized by the finiteness of a congruence relation generalizing the Myhill-Nerode theorem. The induced congruence $\equiv_\epsilonisdefinedas</p><p>u \equiv_\epsilon v \iff \forall w \in \Sigma^*,\ \alpha(u\,w) =_\epsilon \alpha(v\,w)</p><p>foranequivalence=_\epsilon$ on probability distributions. This congruence is right-compatible and yields quotient automata and minimal PDFA recognition whenever its index is finite (<a href="/papers/2412.09760" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Carrasco et al., 2024</a>, <a href="/papers/2206.09004" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Mayr et al., 2022</a>).</p>
<h2 class='paper-heading' id='learning-algorithms-query-based-spectral-and-tree-methods'>3. Learning Algorithms: Query-Based, Spectral, and Tree Methods</h2>
<p>Classical <a href="https://www.emergentmind.com/topics/active-learning-actprm" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">active learning</a> for PDFAs adapts Angluin's L* algorithm to the probabilistic setting. The learning workflow involves querying conditional next-symbol distributions and organizing the responses in observation tables or trees, subject to a local tolerance $\delta > 0inthe\infty−norm.Keyvariantsare:</p><ul><li>WL∗:A\delta−tolerantL∗adaptationclusteringprefixeswhoseobservation−tablerowsare\delta$-equal, maintaining closedness and consistency. Hypotheses are constructed from prefix-equivalence classes with deterministic transitions using nearest-row heuristics for undefined transitions (<a href="/papers/1910.13895" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Weiss et al., 2019</a>).</li>
<li>QUNT: Employs a tree-based classification structure distinguishing state-classes by quantized next-symbol distributions and distinguishing suffixes, yielding significant efficiency advantages over observation-table methods, especially for large PDFAs and alphabets (<a href="/papers/2206.09004" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Mayr et al., 2022</a>).</li>
<li>pL# (string-probability <a href="https://www.emergentmind.com/topics/lora-reconstruction-distillation" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">distillation</a>): Rather than using next-symbol probability queries, pL# infers conditional distributions via full-string probability queries, computing $P(a \mid w) = P(wa)/P(w),andmaintainsanobservationtreewithmergingbasedonprobabilityerrorthresholds(<ahref="/papers/2406.18328"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">Baumgartneretal.,2024</a>).</li><li>Congruence−based(PL∗):Activelearningproceedsbyaggregatingclasseswhoseconditionaldistributionsareequivalentlycongruentunder\equiv_\epsilon,withterminationandminimalityensuredwhenevertheequivalenceformsatruecongruence(<ahref="/papers/2412.09760"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">Carrascoetal.,2024</a>).</li></ul><p>Allthesealgorithmsutilizemembershipqueries(next−symbolorfull−stringprobability),equivalencequeries(globalbehavioragreement),andcounterexample−drivenexpansions.Theirsampleandtimecomplexityispolynomialinthenumberofstates,alphabetsize,1/\delta(orquantizationgranularity),andlogarithmicintheerrorprobability(<ahref="/papers/1910.13895"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">Weissetal.,2019</a>,<ahref="/papers/2206.09004"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">Mayretal.,2022</a>,<ahref="/papers/2412.09760"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">Carrascoetal.,2024</a>).</p><h2class=′paper−heading′id=′empirical−evaluation−and−benchmarking′>4.EmpiricalEvaluationandBenchmarking</h2><p>PDFAsserveasgold−standardbenchmarksforsequenceprediction,especiallyforevaluatingneuralpredictors.Theirenumerabilityandstructuredrandomnesspermit<ahref="https://www.emergentmind.com/topics/controlled−generation"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">controlledgeneration</a>oftestdistributionswhosetruerate–accuracycurvesandoptimalpredictorsarecomputable.Empiricalresultsinclude:</p><ul><li>WL∗achieveszero<ahref="https://www.emergentmind.com/topics/word−error−rate−wer"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">worderrorrate</a>andNDCG=1onsmallPDFAs,outperformingspectralWFAandn−grammodels,andoftenmatchesorexceedsthemonlarger,moreentropictasks(<ahref="/papers/1910.13895"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">Weissetal.,2019</a>).</li><li>QUNTperformslinearlyornear−linearlywiththenumberofstatesandalphabetsize,overwhelmingobservation−tablelearnersinefficiency,especiallyforn > 1000$ (<a href="/papers/2206.09004" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Mayr et al., 2022</a>).</li>
<li>pL# distills compact, interpretable PDFAs from neural LMs (LSTMs, Transformers), achieving mean squared error $10^{-6}to10^{-10}$ with far fewer states than non-minimal competitors (Baumgartner et al., 2024).
In head-to-head RNN benchmarks, LSTMs, RCs, and GLMs fall short of PDFA optimal predictive accuracy by as much as 50% after training, and typically miss by 5%, even for simple processes. Classical causal-state inference trivially recovers the optimal Bayes predictor with orders-of-magnitude less data (Marzen et al., 2019).
PDFAs also underlie state-of-the-art methods for learning task specifications from demonstration, inferring interpretable models encoding sub-goals and temporal dependencies, essential for robot planning and adaptation (Baert et al., 2024).
5. Spectral, Neural, and Algebraic Simulation Theories
PDFAs are amenable to both spectral learning approaches and symbolic simulation in neural architectures.
Spectral extraction methods build Hankel matrices of observed string probabilities and factorize to obtain weighted automata. The result may be non-deterministic and less interpretable than PDFA minimal models (Weiss et al., 2019).
Symbolic feedforward networks can exactly simulate PDFA behavior: each state-distribution is a vector, each transition function is a row-stochastic matrix, and processing a string is unrolling matrix-vector products followed by a read-out layer (Dhayalkar, 12 Sep 2025). There is a formal equivalence: every PDFA corresponds to such a network and vice versa. They are learnable via standard gradient descent by minimizing squared or cross-entropy loss over observed string probabilities, with convergence to exact behavior in ideal conditions (Dhayalkar, 12 Sep 2025).
6. Applications, Limitations, and Future Directions
PDFAs are central to explainable machine learning (as surrogate models for neural LMs), reverse-engineering black-box sequence generators, task specification from demonstration (robotics), and optimal prediction benchmarking. Their advantages include interpretability, compactness, determinism, and polynomial-time minimization. Their limitations manifest in worst-case sample complexity for high-state or low-probability targets, sensitivity to hybrid or noisy distributions, and suboptimality if equivalence relations are non-transitive (mere tolerances fail to guarantee minimality) (Weiss et al., 2019, Carrasco et al., 2024).
Potential future extensions include adversarial counterexample generation, generalization to partially observable or weighted nondeterministic automata, sharpening sample-complexity bounds under realistic noise, and integrating spectral, tree-based, and congruence-based learning strategies (Baumgartner et al., 2024).
References and Key Papers
Paper Title
arXiv id
Highlighted Contribution
Learning Deterministic Weighted Automata with Queries...
These works collectively establish PDFAs as a foundational structure for finite-state stochastic modeling, rigorous learning theory, and systematic empirical evaluation.