Papers
Topics
Authors
Recent
Search
2000 character limit reached

Chomsky Normal Form

Updated 3 March 2026
  • Chomsky Normal Form is a canonical form for context-free grammars that restricts production rules to either a single terminal or a pair of nonterminals, ensuring clear structure.
  • The conversion process involves eliminating epsilon-productions, unit rules, and inaccessible symbols, and then enforcing terminal isolation and binarization to achieve CNF.
  • The formalization in Coq provides machine-checked proofs of language equivalence and complexity bounds, bolstering its significance in formal language theory and compiler design.

Chomsky Normal Form (CNF) is a canonical form for context-free grammars (CFGs), where productions are constrained to a particular structure that facilitates algorithms in parsing and the analysis of formal languages. A CFG is in CNF if each of its productions is either of the form AaA \to a (where AA is a nonterminal, aa is a terminal), ABCA \to BC (where A,B,CA,B,C are nonterminals), or, if the language includes the empty string, a single rule SϵS' \to \epsilon with SS' the start symbol not occurring on any right-hand side. The existence and formal properties of CNF are foundational in computer language processing and formal language theory, underpinned by rigorous mechanical formalization and proof (Ramos et al., 2015).

1. Theorem Statement: Existence of Chomsky Normal Form

The existence of CNF for arbitrary CFGs is formalized as follows:

Mathematical Statement:

G=(V,Σ,P,S). G=(V,Σ,P,S). L(G)=L(G)  (Aβ)P. (β=1βΣ)  (β=2βV×V).\forall G=(V,\Sigma,P,S).\ \exists G'=(V',\Sigma,P',S').\ L(G)=L(G')\ \wedge\ \forall (A\to\beta)\in P'.\ (|\beta|=1\wedge\beta\in\Sigma)\ \vee\ (|\beta|=2\wedge\beta\in V'\times V').

This asserts that for every CFG GG, there exists an equivalent CFG GG' in CNF generating the same language.

Coq Statement:

1
2
3
4
5
6
7
Theorem g_cnf_final:
  ∀ g: cfg non_terminal terminal,
    (produces_empty g ∨ ¬ produces_empty g) ∧ 
    (produces_non_empty g ∨ ¬ produces_non_empty g) →
  ∃ g': cfg non_terminal' terminal,
    g_equiv g' g ∧
    (is_cnf g' ∨ is_cnf_with_empty_rule g').
Here, predicates enforce that every derivable string in gg is preserved in gg', that CNF constraints are met, and that the empty string is handled via SϵS' \to \epsilon when required, provided SS' is not used on any right-hand side (Ramos et al., 2015).

2. Structural Characterization of CNF Grammars

A grammar G=(V,Σ,P,S)G'=(V', \Sigma, P', S') is in CNF if and only if every production rule in PP' satisfies exactly one of:

  • AaA \to a, for AVA \in V', aΣa \in \Sigma
  • ABCA \to BC, for A,B,CVA,B,C \in V'
  • Optionally, SϵS' \to \epsilon if ϵL(G)\epsilon \in L(G) and SS' is absent from any right side

In the Coq formalization, new nonterminals in the CNF construction are encoded as:

1
2
Inductive non_terminal' (NT T:Type): Type :=
  | Lift_r: list (NT+T) → non_terminal'.
Sentential forms are:
1
Notation sf' := list (non_terminal' + terminal).
This encoding maintains the structure, ensuring each rule's right-hand side conforms to the CNF pattern (Ramos et al., 2015).

3. Transformation Algorithm: Conversion to CNF

Transforming a general CFG gg into an equivalent CNF grammar gg' is accomplished in three main phases:

3.1. Preprocessing and Simplification

  • Elimination of ϵ\epsilon-productions, retaining SϵS\to\epsilon only if ϵL(G)\epsilon \in L(G)
  • Elimination of unit rules (ABA \to B)
  • Removal of useless symbols (nonterminals not deriving terminal strings)
  • Removal of inaccessible symbols (never derivable from SS)

These passes yield a simplified grammar as per:

1
2
3
4
5
6
7
8
9
10
Theorem g_simpl:
  ∀ g: cfg NT T, non_empty g →
  ∃ g': cfg NT' T,
    g_equiv g' g ∧
    has_no_inaccessible_symbols g' ∧
    has_no_useless_symbols g' ∧
    (generates_empty g → has_one_empty_rule g') ∧
    (¬ generates_empty g → has_no_empty_rules g') ∧
    has_no_unit_rules g' ∧
    start_symbol_not_in_rhs g'.

3.2. Introduction of Fresh Start Symbol

A fresh start symbol is introduced for technical uniformity:

1
Definition start_symbol (g_cnf g) := Lift_r [ inl (start_symbol g) ].
This construction ensures the start symbol never appears in any right-hand side.

3.3. Enforcement of CNF Form

(a) Terminal Isolation:

If a rule contains terminals within longer right-hand sides, e.g., AtA \to \ldots t \ldots, for tΣt \in \Sigma, then associate each such tt with a new nonterminal [t]=Liftr[inr t][t] = \text{Lift}_r[\text{inr}\ t], add the rule [t]t[t]\to t, and substitute tt with [t][t] accordingly.

(b) Binarization:

For any rule AX1X2XkA \to X_1 X_2 \ldots X_k, k3k \ge 3, use a laddered approach to recursively reduce the arity:

  • AX1[N2..k]A \to X_1 [N_{2..k}]
  • [N2..k]X2[N3..k][N_{2..k}] \to X_2 [N_{3..k}]
  • \ldots
  • [Nk1,Xk]Xk1Xk[N_{k-1},X_k] \to X_{k-1} X_k

Relevant constructors in the Coq definition include:

1
2
3
4
| Lift_cnf_t    (* terminal isolation *)
| Lift_cnf_1    (* direct terminal rule *)
| Lift_cnf_2    (* start of binary chain *)
| Lift_cnf_3    (* ladder step in binary chain *)
If the grammar generates ϵ\epsilon, an extra rule is added atop:
1
2
3
Inductive g_cnf'_rules ... :=
  | Lift_cnf'_all: (* all rules of g_cnf_rules *)
  | Lift_cnf'_new: g_cnf'_rules g (start_symbol (g_cnf g)) [].
(Ramos et al., 2015)

4. Language Equivalence: Key Lemmas and Proof Strategy

The essential invariant is bidirectional simulation between derivations in gg and in gcnfg_{\text{cnf}}:

  • Soundness: For each AαA\to\alpha in gg, either AαA\to\alpha is included in gcnfg_{\text{cnf}} or AαA\xRightarrow{*}\alpha is realized through a sequence in gcnfg_{\text{cnf}}, corresponding to the constructively decomposed rules via induction on rule shape.
  • Completeness: All sentential forms in gcnfg_{\text{cnf}} can be mapped back to gg by an “extraction” function, removing the Liftr\text{Lift}_r wrappers. Formally, for sentential forms s1gcnfs2s_1 \xRightarrow{*}_{g_{\text{cnf}}} s_2, the extraction extract_sf s1gextract_sf s2extract\_sf\ s_1 \xRightarrow{*}_g extract\_sf\ s_2 holds by induction. This guarantees that L(g)=L(gcnf)L(g) = L(g'_{\text{cnf}}).

The equivalence is machine-checked in Coq, combining the inductive definitions and extraction functions as proof artifacts (Ramos et al., 2015).

5. Constraints, Correctness, and Complexity

All transformation steps (simplification, CNF conversion) are formally logical, with rules encoded as Coq Prop. No executable algorithm is extracted—deriving one would require changing from logical predicates to data representations and reproving the associated lemmas.

Complexity is controlled: the combinatorial blow-up remains linear relative to the total length of all right sides in the grammar. Each terminal in a long rule yields a new nonterminal and rule; each right-hand side of length kk in nonterminals results in k1k-1 new nonterminals and production rules. Consequently, the size of the resulting grammar is polynomial in the original grammar’s description length.

The formalization (encompassing closures, simplification, CNF, and a generic CFG library) is fully machine-checked in approximately 18,000 lines of Coq. No further asymptotic efficiency claims are advanced beyond the “polynomial in the grammar description” bound (Ramos et al., 2015).

6. Context and Significance in Formal Language Theory

The rigorous formalization and machine-checked proof of the CNF theorem, as accomplished by Ramos et al. in Coq, strengthens theoretical foundations and offers a verified blueprint for transformations ubiquitous in parsing, automata theory, and compiler implementation. The work demonstrates that every reasonable CFG admits an equivalent CNF grammar, ensuring standard algorithms and theoretical results can operate on the normal form with full confidence in correctness and formal properties (Ramos et al., 2015). A plausible implication is the encouragement of further formalization of conversion algorithms and closure properties in proof assistants, expanding the guarantees available in foundational computational theory toolchains.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Chomsky Normal Form.