Chomsky normal form
{{Short description|Notation for context-free formal grammars}}
{{distinguish|conjunctive normal form}}
In formal language theory, a context-free grammar, G, is said to be in Chomsky normal form (first described by Noam Chomsky){{cite journal |last=Chomsky |first=Noam |date=1959 |title=On Certain Formal Properties of Grammars |journal=Information and Control |volume=2 |issue=2 |pages=137–167 |doi=10.1016/S0019-9958(59)90362-6 |doi-access= }} Here: Sect.6, p.152ff. if all of its production rules are of the form:{{cite web |last1=D'Antoni |first1=Loris |title=Page 7, Lecture 9: Bottom-up Parsing Algorithms |url=http://pages.cs.wisc.edu/~loris/cs536/slides/lec9.pdf |archive-url=https://web.archive.org/web/20210719220611/http://pages.cs.wisc.edu/~loris/cs536/slides/lec9.pdf |archive-date=2021-07-19 |url-status=live |website=CS536-S21 Intro to Programming Languages and Compilers |publisher=University of Wisconsin-Madison}}{{Cite book |last=Sipser |first=Michael |url=https://archive.org/details/introductiontoth00sips |title=Introduction to the theory of computation |date=2006 |publisher=Thomson Course Technology |isbn=0-534-95097-3 |edition=2nd |location=Boston |at=Definition 2.8 |oclc=58544333 |url-access=registration}}
: A → BC, or
: A → a, or
: S → ε,
where A, B, and C are nonterminal symbols, the letter a is a terminal symbol (a symbol that represents a constant value), S is the start symbol, and ε denotes the empty string. Also, neither B nor C may be the start symbol, and the third production rule can only appear if ε is in L(G), the language produced by the context-free grammar G.{{cite book |last1=Hopcroft |first1=John E. |last2=Ullman |first2=Jeffrey D. |date=1979 |title=Introduction to Automata Theory, Languages and Computation |publisher=Addison-Wesley Publishing |location=Reading, Massachusetts |isbn=978-0-201-02988-8 |url=https://archive.org/details/introductiontoau00hopc }}{{rp|92–93,106}}
Every grammar in Chomsky normal form is context-free, and conversely, every context-free grammar can be transformed into an equivalent onethat is, one that produces the same language which is in Chomsky normal form and has a size no larger than the square of the original grammar's size.
Converting a grammar to Chomsky normal form
To convert a grammar to Chomsky normal form, a sequence of simple transformations is applied in a certain order; this is described in most textbooks on automata theory.{{rp|87–94}}{{cite book |last1=Hopcroft |first1=John E. |last2=Motwani |first2=Rajeev |last3=Ullman |first3=Jeffrey D. |date=2006 |title=Introduction to Automata Theory, Languages, and Computation |edition=3rd |publisher=Addison-Wesley |isbn=978-0-321-45536-9 |url-access=registration |url=https://archive.org/details/introductiontoau0000hopc }} Section 7.1.5, p.272{{cite book |last=Rich |first=Elaine |author-link=Elaine Rich|date=2007 |title=Automata, Computability, and Complexity: Theory and Applications |publisher=Prentice-Hall |edition=1st |page=169 |section=11.8 Normal Forms |url=https://www.cs.utexas.edu/~ear/cs341/automatabook/AutomataTheoryBook.pdf |archive-url=https://archive.today/20230117061906/https://www.cs.utexas.edu/~ear/cs341/automatabook/AutomataTheoryBook.pdf |archive-date=2023-01-17 |isbn=978-0132288064}}{{cite book |last=Wegener |first=Ingo |date=1993 |title=Theoretische Informatik - Eine algorithmenorientierte Einführung |language=de |series=Leitfäden und Mongraphien der Informatik |publisher=B. G. Teubner |location=Stuttgart |isbn=978-3-519-02123-0}} Section 6.2 "Die Chomsky-Normalform für kontextfreie Grammatiken", p. 149–152
The presentation here follows Hopcroft, Ullman (1979), but is adapted to use the transformation names from Lange, Leiß (2009).{{cite journal |last1=Lange |first1=Martin |last2=Leiß |first2=Hans |date=2009 |title=To CNF or not to CNF? An Efficient Yet Presentable Version of the CYK Algorithm |journal=Informatica Didactica |volume=8 |url=http://ddi.cs.uni-potsdam.de/InformaticaDidactica/LangeLeiss2009.pdf |archive-url=https://web.archive.org/web/20110719111029/http://ddi.cs.uni-potsdam.de/InformaticaDidactica/LangeLeiss2009.pdf |archive-date=2011-07-19 |url-status=live }}For example, Hopcroft, Ullman (1979) merged TERM and BIN into a single transformation. Each of the following transformations establishes one of the properties required for Chomsky normal form.
=START: Eliminate the start symbol from right-hand sides=
Introduce a new start symbol S0, and a new rule
:S0 → S,
where S is the previous start symbol.
This does not change the grammar's produced language, and S0 will not occur on any rule's right-hand side.
=TERM: Eliminate rules with nonsolitary terminals=
To eliminate each rule
:A → X1 ... a ... Xn
with a terminal symbol a being not the only symbol on the right-hand side, introduce, for every such terminal, a new nonterminal symbol Na, and a new rule
:Na → a.
Change every rule
:A → X1 ... a ... Xn
to
:A → X1 ... Na ... Xn.
If several terminal symbols occur on the right-hand side, simultaneously replace each of them by its associated nonterminal symbol.
This does not change the grammar's produced language.{{rp|92}}
=BIN: Eliminate right-hand sides with more than 2 nonterminals=
=DEL: Eliminate ε-rules=
An ε-rule is a rule of the form
:A → ε,
where A is not S0, the grammar's start symbol.
To eliminate all rules of this form, first determine the set of all nonterminals that derive ε.
Hopcroft and Ullman (1979) call such nonterminals nullable, and compute them as follows:
- If a rule A → ε exists, then A is nullable.
- If a rule A → X1 ... Xn exists, and every single Xi is nullable, then A is nullable, too.
Obtain an intermediate grammar by replacing each rule
:A → X1 ... Xn
by all versions with some nullable Xi omitted.
By deleting in this grammar each ε-rule, unless its left-hand side is the start symbol, the transformed grammar is obtained.{{rp|90}}
For example, in the following grammar, with start symbol S0,
: S0 → AbB | C
: B → AA | AC
: C → b | c
: A → a | ε
the nonterminal A, and hence also B, is nullable, while neither C nor S0 is.
Hence the following intermediate grammar is obtained:indicating a kept and omitted nonterminal N by {{color|#006000|N}} and {{color|#ffc0c0|N}}, respectively
: S0 → {{color|#006000|A}}b{{color|#006000|B}} | {{color|#006000|A}}b{{color|#ffc0c0|B}} | {{color|#ffc0c0|A}}b{{color|#006000|B}} | {{color|#ffc0c0|A}}b{{color|#ffc0c0|B}} | C
: B → {{color|#006000|AA}} | {{color|#ffc0c0|A}}{{color|#006000|A}} | {{color|#006000|A}}{{color|#ffc0c0|A}} | {{color|#ffc0c0|A}}ε{{color|#ffc0c0|A}} | {{color|#006000|A}}C | {{color|#ffc0c0|A}}C
: C → b | c
: A → a | ε
In this grammar, all ε-rules have been "inlined at the call site".If the grammar had a rule S0 → ε, it could not be "inlined", since it had no "call sites". Therefore it could not be deleted in the next step.
In the next step, they can hence be deleted, yielding the grammar:
: S0 → AbB | Ab | bB | b | C
: B → AA | A | AC | C
: C → b | c
: A → a
This grammar produces the same language as the original example grammar, viz. {ab,aba,abaa,abab,abac,abb,abc,b,ba,baa,bab,bac,bb,bc,c}, but has no ε-rules.
=UNIT: Eliminate unit rules=
A unit rule is a rule of the form
:A → B,
where A, B are nonterminal symbols.
To remove it, for each rule
:B → X1 ... Xn,
where X1 ... Xn is a string of nonterminals and terminals, add rule
:A → X1 ... Xn
unless this is a unit rule which has already been (or is being) removed. The skipping of nonterminal symbol B in the resulting grammar is possible due to B being a member of the unit closure of nonterminal symbol A.{{Cite book |last=Allison |first=Charles D. |title=Foundations of Computing: An Accessible Introduction to Automata and Formal Languages |publisher=Fresh Sources, Inc. |year=2022 |isbn=9780578944173 |pages=176 |language=en}}
=Order of transformations=
class="wikitable collapsible" style="float:right" |
+ Mutual preservation of transformation results |
colspan=6 | Transformation X {{color|#004000|always preserves}} ({{Aye}}) resp. {{color|#400000|may destroy}} ({{Nay}}) the result of Y: |
{{diagonal split header|X|Y}}
! START ||TERM||BIN||DEL||UNIT |
---|
START
| || {{Ya}} || {{Ya}} || {{Na}} || {{Na}} |
TERM
| {{Ya}} || || {{Na}} || {{Ya}} || {{Ya}} |
BIN
| {{Ya}} || {{Ya}} || || {{Ya}} || {{Ya}} |
DEL
| {{Ya}} || {{Ya}} || {{Ya}} || || {{Na}} |
UNIT
| {{Ya}} || {{Ya}} || {{Ya}} ||{{Ya|text=({{Aye}})*}}|| |
colspan=6 | *UNIT preserves the result of DEL if START had been called before. |
When choosing the order in which the above transformations are to be applied, it has to be considered that some transformations may destroy the result achieved by other ones. For example, START will re-introduce a unit rule if it is applied after UNIT. The table shows which orderings are admitted.
Moreover, the worst-case bloat in grammar sizei.e. written length, measured in symbols depends on the transformation order. Using |G| to denote the size of the original grammar G, the size blow-up in the worst case may range from |G|2 to 22 |G|, depending on the transformation algorithm used.{{rp|7}} The blow-up in grammar size depends on the order between DEL and BIN. It may be exponential when DEL is done first, but is linear otherwise. UNIT can incur a quadratic blow-up in the size of the grammar.{{rp|5}} The orderings START,TERM,BIN,DEL,UNIT and START,BIN,DEL,UNIT,TERM lead to the least (i.e. quadratic) blow-up.
Example
File:Syntax tree of arithmetic expression wrt Chomsky normal form grammar.gif of the arithmetic expression "a^2+4*b" wrt. the example grammar (top) and its Chomsky normal form (bottom)]]
The following grammar, with start symbol Expr, describes a simplified version of the set of all syntactical valid arithmetic expressions in programming languages like C or Algol60. Both number and variable are considered terminal symbols here for simplicity, since in a compiler front end their internal structure is usually not considered by the parser. The terminal symbol "^" denoted exponentiation in Algol60.
:
Expr
| → Term | | | Expr AddOp Term | | | AddOp Term |
Term
| → Factor | | | Term MulOp Factor |
Factor
| → Primary | | | Factor ^ Primary |
Primary
| → number | | | variable | | | ( Expr ) |
AddOp
| → + | | | − |
MulOp
| → * | | | / |
In step "START" of the above conversion algorithm, just a rule S0→Expr is added to the grammar.
After step "TERM", the grammar looks like this:
:
S0
| → Expr |
Expr
| → Term | | | Expr AddOp Term | | | AddOp Term |
Term
| → Factor | | | Term MulOp Factor |
Factor
| → Primary | | | Factor PowOp Primary |
Primary
| → number | | | variable | | | Open Expr Close |
AddOp
| → + | | | − |
MulOp
| → * | | | / |
PowOp
| → ^ |
Open
| → ( |
Close
| → ) |
After step "BIN", the following grammar is obtained:
:
S0
| → Expr |
Expr
| → Term | | | Expr AddOp_Term | | | AddOp Term |
Term
| → Factor | | | Term MulOp_Factor |
Factor
| → Primary | | | Factor PowOp_Primary |
Primary
| → number | | | variable | | | Open Expr_Close |
AddOp
| → + | | | − |
MulOp
| → * | | | / |
PowOp
| → ^ |
Open
| → ( |
Close
| → ) |
AddOp_Term
| colspan=3 | → AddOp Term |
MulOp_Factor
| colspan=3 | → MulOp Factor |
PowOp_Primary
| colspan=3 | → PowOp Primary |
Expr_Close
| colspan=3 | → Expr Close |
Since there are no ε-rules, step "DEL" does not change the grammar.
After step "UNIT", the following grammar is obtained, which is in Chomsky normal form:
:
S0
| → number | | | variable | | | Open Expr_Close | | | Factor PowOp_Primary | | | Term MulOp_Factor | | | Expr AddOp_Term | | | AddOp Term |
Expr
| → number | | | variable | | | Open Expr_Close | | | Factor PowOp_Primary | | | Term MulOp_Factor | | | Expr AddOp_Term | | | AddOp Term |
Term
| → number | | | variable | | | Open Expr_Close | | | Factor PowOp_Primary | | | Term MulOp_Factor |
Factor
| → number | | | variable | | | Open Expr_Close | | | Factor PowOp_Primary |
Primary
| → number | | | variable | | | Open Expr_Close |
AddOp
| → + | | | − |
MulOp
| → * | | | / |
PowOp
| → ^ |
Open
| → ( |
Close
| → ) |
AddOp_Term
| colspan=3 | → AddOp Term |
MulOp_Factor
| colspan=3 | → MulOp Factor |
PowOp_Primary
| colspan=3 | → PowOp Primary |
Expr_Close
| colspan=3 | → Expr Close |
The Na introduced in step "TERM" are PowOp, Open, and Close.
The Ai introduced in step "BIN" are AddOp_Term, MulOp_Factor, PowOp_Primary, and Expr_Close.
Alternative definition
= Chomsky reduced form =
Another way{{rp|92}}Hopcroft et al. (2006){{page needed|date=November 2014}} to define the Chomsky normal form is:
A formal grammar is in Chomsky reduced form if all of its production rules are of the form:
: or
: ,
where , and are nonterminal symbols, and is a terminal symbol. When using this definition, or may be the start symbol. Only those context-free grammars which do not generate the empty string can be transformed into Chomsky reduced form.
= Floyd normal form =
In a letter where he proposed a term Backus–Naur form (BNF), Donald E. Knuth implied a BNF "syntax in which all definitions have such a form may be said to be in 'Floyd Normal Form'",
: or
: or
: ,
where , and are nonterminal symbols, and is a terminal symbol,
because Robert W. Floyd found any BNF syntax can be converted to the above one in 1961.{{cite journal | url=https://core.ac.uk/download/pdf/82003923.pdf |archive-url=https://web.archive.org/web/20210305050258/https://core.ac.uk/download/pdf/82003923.pdf |archive-date=2021-03-05 |url-status=live | author=Floyd, Robert W. | title=Note on mathematical induction in phrase structure grammars. | journal=Information and Control | volume=4 | pages=353–358 | year=1961 | issue=4 | doi=10.1016/S0019-9958(61)80052-1 | doi-access=free }} Here: p.354 But he withdrew this term, "since doubtless many people have independently used this simple fact in their own work, and the point is only incidental to the main considerations of Floyd's note."{{cite journal |last=Knuth |first=Donald E. |date=December 1964 |title=Backus Normal Form vs. Backus Naur Form |journal=Communications of the ACM |doi=10.1145/355588.365140 |volume=7 |issue=12 |pages=735–736|s2cid=47537431 |doi-access=free }} While Floyd's note cites Chomsky's original 1959 article, Knuth's letter does not.
Application
Besides its theoretical significance, CNF conversion is used in some algorithms as a preprocessing step, e.g., the CYK algorithm, a bottom-up parsing for context-free grammars, and its variant probabilistic CKY.{{cite book |last1=Jurafsky |first1=Daniel |last2=Martin |first2=James H. |date=2008 |title=Speech and Language Processing |edition=2nd |publisher=Pearson Prentice Hall |isbn=978-0-13-187321-6 |page=465}}
See also
- Backus–Naur form
- CYK algorithm
- Greibach normal form
- Kuroda normal form
- Pumping lemma for context-free languages — its proof relies on the Chomsky normal form
Notes
{{reflist|group=note}}
References
Further reading
- Cole, Richard. Converting CFGs to CNF (Chomsky Normal Form), October 17, 2007. [http://cs.nyu.edu/courses/fall07/V22.0453-001/cnf.pdf (pdf)] — uses the order TERM, BIN, START, DEL, UNIT.
- {{cite book |author=John Martin |year=2003 |url=https://archive.org/details/introductiontola0000mart |title=Introduction to Languages and the Theory of Computation |publisher=McGraw Hill |isbn=978-0-07-232200-2 |url-access=registration}} (Pages 237–240 of section 6.6: simplified forms and normal forms.)
- {{cite book
| author-link = Michael Sipser
| author = Michael Sipser
| year = 1997
| title = Introduction to the Theory of Computation
| publisher = PWS Publishing
| isbn = 978-0-534-94728-6
| url = https://archive.org/details/introductiontoth00sips
}} (Pages 98–101 of section 2.1: context-free grammars. Page 156.)
- {{cite book|author=Charles D. Allison (2021)|title = Foundations of Computing: An Accessible Introduction to Formal Language| date=20 August 2021 |publisher=Fresh Sources, Inc.|isbn = 9780578944173}} (pages 171-183 of section 7.1: Chomsky Normal Form)
- Sipser, Michael. Introduction to the Theory of Computation, 2nd edition.
- {{cite book|author=Alexander Meduna|title=Automata and Languages: Theory and Applications|url=https://books.google.com/books?id=a-rjBwAAQBAJ&q=%22Chomsky+normal+form%22|date=6 December 2012|publisher=Springer Science & Business Media|isbn=978-1-4471-0501-5}}
{{Noam Chomsky}}