Chomsky Normal Form

Updated 3 March 2026

Chomsky Normal Form is a canonical form for context-free grammars that restricts production rules to either a single terminal or a pair of nonterminals, ensuring clear structure.
The conversion process involves eliminating epsilon-productions, unit rules, and inaccessible symbols, and then enforcing terminal isolation and binarization to achieve CNF.
The formalization in Coq provides machine-checked proofs of language equivalence and complexity bounds, bolstering its significance in formal language theory and compiler design.

Chomsky Normal Form (CNF) is a canonical form for context-free grammars (CFGs), where productions are constrained to a particular structure that facilitates algorithms in parsing and the analysis of formal languages. A CFG is in CNF if each of its productions is either of the form $A \to a$ (where $A$ is a nonterminal, $a$ is a terminal), $A \to BC$ (where $A,B,C$ are nonterminals), or, if the language includes the empty string, a single rule $S' \to \epsilon$ with $S'$ the start symbol not occurring on any right-hand side. The existence and formal properties of CNF are foundational in computer language processing and formal language theory, underpinned by rigorous mechanical formalization and proof (Ramos et al., 2015).

1. Theorem Statement: Existence of Chomsky Normal Form

The existence of CNF for arbitrary CFGs is formalized as follows:

Mathematical Statement:

$\forall G=(V,\Sigma,P,S).\ \exists G'=(V',\Sigma,P',S').\ L(G)=L(G')\ \wedge\ \forall (A\to\beta)\in P'.\ (|\beta|=1\wedge\beta\in\Sigma)\ \vee\ (|\beta|=2\wedge\beta\in V'\times V').$

This asserts that for every CFG $G$ , there exists an equivalent CFG $G'$ in CNF generating the same language.

Coq Statement:

Theorem g_cnf_final:
  ∀ g: cfg non_terminal terminal,
    (produces_empty g ∨ ¬ produces_empty g) ∧ 
    (produces_non_empty g ∨ ¬ produces_non_empty g) →
  ∃ g': cfg non_terminal' terminal,
    g_equiv g' g ∧
    (is_cnf g' ∨ is_cnf_with_empty_rule g').

Here, predicates enforce that every derivable string in

g

is preserved in

g'

, that CNF constraints are met, and that the empty string is handled via

S' \to \epsilon

when required, provided

S'

is not used on any right-hand side (Ramos et al., 2015).

2. Structural Characterization of CNF Grammars

A grammar $G'=(V', \Sigma, P', S')$ is in CNF if and only if every production rule in $P'$ satisfies exactly one of:

$A \to a$ , for $A \in V'$ , $a \in \Sigma$
$A \to BC$ , for $A,B,C \in V'$
Optionally, $S' \to \epsilon$ if $\epsilon \in L(G)$ and $S'$ is absent from any right side

In the Coq formalization, new nonterminals in the CNF construction are encoded as:

1 2	Inductive non_terminal' (NT T:Type): Type := \| Lift_r: list (NT+T) → non_terminal'.

Sentential forms are:

1	Notation sf' := list (non_terminal' + terminal).

This encoding maintains the structure, ensuring each rule's right-hand side conforms to the CNF pattern (Ramos et al., 2015).

3. Transformation Algorithm: Conversion to CNF

Transforming a general CFG $g$ into an equivalent CNF grammar $g'$ is accomplished in three main phases:

3.1. Preprocessing and Simplification

Elimination of $\epsilon$ -productions, retaining $S\to\epsilon$ only if $\epsilon \in L(G)$
Elimination of unit rules ( $A \to B$ )
Removal of useless symbols (nonterminals not deriving terminal strings)
Removal of inaccessible symbols (never derivable from $S$ )

These passes yield a simplified grammar as per:

Theorem g_simpl:
  ∀ g: cfg NT T, non_empty g →
  ∃ g': cfg NT' T,
    g_equiv g' g ∧
    has_no_inaccessible_symbols g' ∧
    has_no_useless_symbols g' ∧
    (generates_empty g → has_one_empty_rule g') ∧
    (¬ generates_empty g → has_no_empty_rules g') ∧
    has_no_unit_rules g' ∧
    start_symbol_not_in_rhs g'.

3.2. Introduction of Fresh Start Symbol

A fresh start symbol is introduced for technical uniformity:

1	Definition start_symbol (g_cnf g) := Lift_r [ inl (start_symbol g) ].

This construction ensures the start symbol never appears in any right-hand side.

3.3. Enforcement of CNF Form

(a) Terminal Isolation:

If a rule contains terminals within longer right-hand sides, e.g., $A \to \ldots t \ldots$ , for $t \in \Sigma$ , then associate each such $t$ with a new nonterminal $[t] = \text{Lift}_r[\text{inr}\ t]$ , add the rule $[t]\to t$ , and substitute $t$ with $[t]$ accordingly.

(b) Binarization:

For any rule $A \to X_1 X_2 \ldots X_k$ , $k \ge 3$ , use a laddered approach to recursively reduce the arity:

$A \to X_1 [N_{2..k}]$
$[N_{2..k}] \to X_2 [N_{3..k}]$
$\ldots$
$[N_{k-1},X_k] \to X_{k-1} X_k$

Relevant constructors in the Coq definition include:

| Lift_cnf_t    (* terminal isolation *)
| Lift_cnf_1    (* direct terminal rule *)
| Lift_cnf_2    (* start of binary chain *)
| Lift_cnf_3    (* ladder step in binary chain *)

If the grammar generates

\epsilon

, an extra rule is added atop:

1
2
3

Inductive g_cnf'_rules ... :=
  | Lift_cnf'_all: (* all rules of g_cnf_rules *)
  | Lift_cnf'_new: g_cnf'_rules g (start_symbol (g_cnf g)) [].

(Ramos et al., 2015)

4. Language Equivalence: Key Lemmas and Proof Strategy

The essential invariant is bidirectional simulation between derivations in $g$ and in $g_{\text{cnf}}$ :

Soundness: For each $A\to\alpha$ in $g$ , either $A\to\alpha$ is included in $g_{\text{cnf}}$ or $A\xRightarrow{*}\alpha$ is realized through a sequence in $g_{\text{cnf}}$ , corresponding to the constructively decomposed rules via induction on rule shape.
Completeness: All sentential forms in $g_{\text{cnf}}$ can be mapped back to $g$ by an “extraction” function, removing the $\text{Lift}_r$ wrappers. Formally, for sentential forms $s_1 \xRightarrow{*}_{g_{\text{cnf}}} s_2$ , the extraction $extract\_sf\ s_1 \xRightarrow{*}_g extract\_sf\ s_2$ holds by induction. This guarantees that $L(g) = L(g'_{\text{cnf}})$ .

The equivalence is machine-checked in Coq, combining the inductive definitions and extraction functions as proof artifacts (Ramos et al., 2015).

5. Constraints, Correctness, and Complexity

All transformation steps (simplification, CNF conversion) are formally logical, with rules encoded as Coq Prop. No executable algorithm is extracted—deriving one would require changing from logical predicates to data representations and reproving the associated lemmas.

Complexity is controlled: the combinatorial blow-up remains linear relative to the total length of all right sides in the grammar. Each terminal in a long rule yields a new nonterminal and rule; each right-hand side of length $k$ in nonterminals results in $k-1$ new nonterminals and production rules. Consequently, the size of the resulting grammar is polynomial in the original grammar’s description length.

The formalization (encompassing closures, simplification, CNF, and a generic CFG library) is fully machine-checked in approximately 18,000 lines of Coq. No further asymptotic efficiency claims are advanced beyond the “polynomial in the grammar description” bound (Ramos et al., 2015).

6. Context and Significance in Formal Language Theory

The rigorous formalization and machine-checked proof of the CNF theorem, as accomplished by Ramos et al. in Coq, strengthens theoretical foundations and offers a verified blueprint for transformations ubiquitous in parsing, automata theory, and compiler implementation. The work demonstrates that every reasonable CFG admits an equivalent CNF grammar, ensuring standard algorithms and theoretical results can operate on the normal form with full confidence in correctness and formal properties (Ramos et al., 2015). A plausible implication is the encouragement of further formalization of conversion algorithms and closure properties in proof assistants, expanding the guarantees available in foundational computational theory toolchains.

Markdown Report Issue Upgrade to Chat

References (1)

Formalization of context-free language theory (2015)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Chomsky Normal Form.