Chomsky Normal Form
- Chomsky Normal Form is a canonical form for context-free grammars that restricts production rules to either a single terminal or a pair of nonterminals, ensuring clear structure.
- The conversion process involves eliminating epsilon-productions, unit rules, and inaccessible symbols, and then enforcing terminal isolation and binarization to achieve CNF.
- The formalization in Coq provides machine-checked proofs of language equivalence and complexity bounds, bolstering its significance in formal language theory and compiler design.
Chomsky Normal Form (CNF) is a canonical form for context-free grammars (CFGs), where productions are constrained to a particular structure that facilitates algorithms in parsing and the analysis of formal languages. A CFG is in CNF if each of its productions is either of the form (where is a nonterminal, is a terminal), (where are nonterminals), or, if the language includes the empty string, a single rule with the start symbol not occurring on any right-hand side. The existence and formal properties of CNF are foundational in computer language processing and formal language theory, underpinned by rigorous mechanical formalization and proof (Ramos et al., 2015).
1. Theorem Statement: Existence of Chomsky Normal Form
The existence of CNF for arbitrary CFGs is formalized as follows:
Mathematical Statement:
This asserts that for every CFG , there exists an equivalent CFG in CNF generating the same language.
Coq Statement:
1 2 3 4 5 6 7 |
Theorem g_cnf_final:
∀ g: cfg non_terminal terminal,
(produces_empty g ∨ ¬ produces_empty g) ∧
(produces_non_empty g ∨ ¬ produces_non_empty g) →
∃ g': cfg non_terminal' terminal,
g_equiv g' g ∧
(is_cnf g' ∨ is_cnf_with_empty_rule g'). |
2. Structural Characterization of CNF Grammars
A grammar is in CNF if and only if every production rule in satisfies exactly one of:
- , for ,
- , for
- Optionally, if and is absent from any right side
In the Coq formalization, new nonterminals in the CNF construction are encoded as:
1 2 |
Inductive non_terminal' (NT T:Type): Type := | Lift_r: list (NT+T) → non_terminal'. |
1 |
Notation sf' := list (non_terminal' + terminal). |
3. Transformation Algorithm: Conversion to CNF
Transforming a general CFG into an equivalent CNF grammar is accomplished in three main phases:
3.1. Preprocessing and Simplification
- Elimination of -productions, retaining only if
- Elimination of unit rules ()
- Removal of useless symbols (nonterminals not deriving terminal strings)
- Removal of inaccessible symbols (never derivable from )
These passes yield a simplified grammar as per:
1 2 3 4 5 6 7 8 9 10 |
Theorem g_simpl:
∀ g: cfg NT T, non_empty g →
∃ g': cfg NT' T,
g_equiv g' g ∧
has_no_inaccessible_symbols g' ∧
has_no_useless_symbols g' ∧
(generates_empty g → has_one_empty_rule g') ∧
(¬ generates_empty g → has_no_empty_rules g') ∧
has_no_unit_rules g' ∧
start_symbol_not_in_rhs g'. |
3.2. Introduction of Fresh Start Symbol
A fresh start symbol is introduced for technical uniformity:
1 |
Definition start_symbol (g_cnf g) := Lift_r [ inl (start_symbol g) ]. |
3.3. Enforcement of CNF Form
(a) Terminal Isolation:
If a rule contains terminals within longer right-hand sides, e.g., , for , then associate each such with a new nonterminal , add the rule , and substitute with accordingly.
(b) Binarization:
For any rule , , use a laddered approach to recursively reduce the arity:
Relevant constructors in the Coq definition include:
1 2 3 4 |
| Lift_cnf_t (* terminal isolation *) | Lift_cnf_1 (* direct terminal rule *) | Lift_cnf_2 (* start of binary chain *) | Lift_cnf_3 (* ladder step in binary chain *) |
1 2 3 |
Inductive g_cnf'_rules ... := | Lift_cnf'_all: (* all rules of g_cnf_rules *) | Lift_cnf'_new: g_cnf'_rules g (start_symbol (g_cnf g)) []. |
4. Language Equivalence: Key Lemmas and Proof Strategy
The essential invariant is bidirectional simulation between derivations in and in :
- Soundness: For each in , either is included in or is realized through a sequence in , corresponding to the constructively decomposed rules via induction on rule shape.
- Completeness: All sentential forms in can be mapped back to by an “extraction” function, removing the wrappers. Formally, for sentential forms , the extraction holds by induction. This guarantees that .
The equivalence is machine-checked in Coq, combining the inductive definitions and extraction functions as proof artifacts (Ramos et al., 2015).
5. Constraints, Correctness, and Complexity
All transformation steps (simplification, CNF conversion) are formally logical, with rules encoded as Coq Prop. No executable algorithm is extracted—deriving one would require changing from logical predicates to data representations and reproving the associated lemmas.
Complexity is controlled: the combinatorial blow-up remains linear relative to the total length of all right sides in the grammar. Each terminal in a long rule yields a new nonterminal and rule; each right-hand side of length in nonterminals results in new nonterminals and production rules. Consequently, the size of the resulting grammar is polynomial in the original grammar’s description length.
The formalization (encompassing closures, simplification, CNF, and a generic CFG library) is fully machine-checked in approximately 18,000 lines of Coq. No further asymptotic efficiency claims are advanced beyond the “polynomial in the grammar description” bound (Ramos et al., 2015).
6. Context and Significance in Formal Language Theory
The rigorous formalization and machine-checked proof of the CNF theorem, as accomplished by Ramos et al. in Coq, strengthens theoretical foundations and offers a verified blueprint for transformations ubiquitous in parsing, automata theory, and compiler implementation. The work demonstrates that every reasonable CFG admits an equivalent CNF grammar, ensuring standard algorithms and theoretical results can operate on the normal form with full confidence in correctness and formal properties (Ramos et al., 2015). A plausible implication is the encouragement of further formalization of conversion algorithms and closure properties in proof assistants, expanding the guarantees available in foundational computational theory toolchains.