Parikh's theorem

Parikh's theorem in theoretical computer science says that if one looks only at the number of occurrences of each terminal symbol in a context-free language, without regard to their order, then the language is indistinguishable from a regular language.{{cite book |title=Automata and Computability|last=Kozen |first=Dexter|year=1997 |publisher=Springer-Verlag|location=New York|isbn=3-540-78105-6}} It is useful for deciding that strings with a given number of terminals are not accepted by a context-free grammar.{{cite web|url=http://www8.cs.umu.se/kurser/TDBC92/VT06/final/3.pdf|title=Parikh's theorem|author=Håkan Lindqvist|publisher=Umeå Universitet}} It was first proved by Rohit Parikh in 1961{{cite journal |last1=Parikh |first1=Rohit|authorlink=Rohit Jivanlal Parikh|year=1961 |title=Language Generating Devices|journal=Quarterly Progress Report, Research Laboratory of Electronics, MIT}} and republished in 1966.{{cite journal |last1=Parikh |first1=Rohit|authorlink=Rohit Jivanlal Parikh|year=1966 |title=On Context-Free Languages|journal=Journal of the Association for Computing Machinery|volume=13|issue=4|pages=570–581|doi=10.1145/321356.321364|s2cid=12263468|doi-access=free}}

Definitions and formal statement

Let \Sigma=\{a_1,a_2,\ldots,a_k\} be an alphabet. The Parikh vector of a word is defined as the function p:\Sigma^*\to\mathbb{N}^k, given by

p(w)=(|w|_{a_1}, |w|_{a_2}, \ldots, |w|_{a_k})

where |w|_{a_i} denotes the number of occurrences of the symbol a_i in the word w.

A subset of \mathbb{N}^k is said to be linear if it is of the form

u_0 + \mathbb{N}u_1 + \dots + \mathbb{N}u_m = \{u_0+t_1u_1+\dots+t_mu_m \mid t_1,\ldots,t_m\in\mathbb{N}\}

for some vectors u_0,\ldots,u_m.

A subset of \mathbb{N}^k is said to be semi-linear if it is a union of finitely many linear subsets.

{{Math theorem|name=Theorem|note=|math_statement=

Let L be a context-free language or a regular language, let P(L) be the set of Parikh vectors of words in L, that is, P(L) = \{p(w) \mid w \in L\}. Then P(L) is a semi-linear set.

If S is any semi-linear set, then there exists a regular language (which a fortiori is context-free) whose Parikh vectors is S.

}}

In short, the image under p of context-free languages and of regular languages is the same, and it is equal to the set of semilinear sets.

Two languages are said to be commutatively equivalent if they have the same set of Parikh vectors. Thus, every context-free language is commutatively equivalent to some regular language.

Proof

The second part is easy to prove.

{{Math proof|title=Proof|proof=

Given semi-linear set S, to construct a regular language whose set of Parikh vectors is S.

S is a union of 0 or more linear sets. Since the empty language is regular, and union of regular languages is regular, it suffices to prove that any linear set is the set of Parikh vectors of a regular language.

Let S = \{u_0+t_1u_1+\dots+t_mu_m \mid t_1,\ldots,t_m\in\mathbb{N}\}, then it is the set of Parikh vectors of \{z_0\} \cdot (\cup_{i=1}^m \{z_i\})^*, where each z_i has Parikh vector u_i.

}}

The first part is less easy. The following proof is credited to Goldstine.{{Cite journal |last=Goldstine |first=J. |date=1977-01-01 |title=A simplified proof of parikh's theorem |url=https://dx.doi.org/10.1016/0012-365X%2877%2990103-0 |journal=Discrete Mathematics |language=en |volume=19 |issue=3 |pages=235–239 |doi=10.1016/0012-365X(77)90103-0 |issn=0012-365X}}

First we need a small strengthening of the pumping lemma for context-free languages:

{{Math theorem

| name = Lemma

| note =

| math_statement = If L is generated by a Chomsky normal form grammar, then \exists N \geq 1, such that the following is true.

For any k \geq 1, and for any w \in L with |w| \geq N^k, given a derivation S \Rightarrow^* uAv of w from the axiom S, we can split w into segments w = ux_1\cdots x_kzy_k\cdots y_1v, and pick a nonterminal symbol A, such that we have:

|x_iy_i|\geq 1 for all i, and |x_1\cdots x_kzy_k\cdots y_1| \leq N^k

and the derivation S \Rightarrow^* uAv can be split in the following (with the same parse tree, in particular using precisely the same nonterminals):

S \Rightarrow^* uAv\quad \forall i, A \Rightarrow^* x_iAy_i A \Rightarrow^* z\quad

}}

The proof is essentially the same as the standard pumping lemma: use the pigeonhole principle to find k copies of some nonterminal symbol A in the longest path in the shortest derivation tree.

Now we prove the first part of Parikh's theorem, making use of the above lemma.

{{Math proof|title=Proof|proof=

First, construct a Chomsky normal form grammar for L.

For each finite nonempty subset of nonterminals U, define L_U to be the set of sentences in L such that there exists a derivation that uses every nonterminal in U, no more and no less. It is clear that L = \cup_U L_U, so it suffices to prove that each p(L_U) is a semilinear set.

Now fix some U, and let k = |U|. We construct two finite sets F, G, such that p(L_U) = p(F\cdot G^*), which is obviously semilinear.

: For notational clarity, write \Rightarrow^*_U to mean "there exists a derivation using no more (but possibly less) than nonterminals in U. With that, we define F, G as follows:

F = \{w \in L_U: |w| < N^k\}

G = \{xy: 1 \leq |xy| \leq N^k \text{ and there exists } A\in U \text{ such that } A \Rightarrow^*_U xAy\}

To prove p(L_U) \subset p(F\cdot G^*), we induct on the length of w\in L_U.

: If |w| < N^k, then w \in F, so p(w) \in p(F\cdot G^*). Otherwise, since w \in L_U, there is a derivation of w using precisely the elements of U. By the strengthened pumping lemma, this derivation can be rewritten to be of the form

S

\underset{d_0}{\stackrel{*}{\Rightarrow}} u A v

\underset{d_1}{\stackrel{*}{\Rightarrow}} u x_1 A y_1 v

\underset{d_2}{\stackrel{*}{\Rightarrow}} \cdots

\underset{d_{k}}{\stackrel{*}{\Rightarrow}} u x_1 \cdots x_k A y_k \cdots y_1 v

\underset{d_{k+1}}{\stackrel{*}{\Rightarrow}} u x_1 \cdots x_k z y_k \cdots y_1 v

: where A\in U, 1 \leq |x_iy_i|, and |x_1 \cdots x_k z y_k \cdots y_1| \leq N^k.

: Since there are only k-1 elements in U\setminus \{A\}, but there are k sub-derivations d_1, ..., d_k in the middle, picking one witnessing subderivation to use each of the nonterminals in U\setminus \{A\}, we may delete one sub-derivation d_i and obtain a shorter w' that still uses all the nonterminals of U, i.e., which is still in L_U, with

p(w) = p(uzv) + p(x_1y_1) + \cdots + p(x_ky_k) = p(w') + p(x_iy_i)

: By induction, p(w') \in p(F\cdot G^*), and by construction, x_iy_i\in G, so p(w)\in p(F\cdot G^*).

To prove p(L_U) \supset p(F\cdot G^*), consider an element w\in F\cdot G^*. We need to show that p(w) \in p(L_U). We induct on the minimal number of factors from G that is needed to identify w as an element of F\cdot G^*.

: If no factor from G is needed, then w \in F \subset L_U.

: Otherwise, w = w'xy for some w'\in F\cdot G^* and xy\in G, where w' requires less factors from G than w.

: By induction, p(w') = p(w) for some w\in L_U. By construction of G, there exists some A\in U such that A\Rightarrow^*_U xAy.

: By construction of L_U, the symbol A appears in a derivation of w using exactly all of U. Then we can interpolate A\Rightarrow^*_U xAy into that derivation to obtain some w'\in L_U such that

p(w') = p(w) + p(xy) = p(w') + p(xy) = p(w)

}}

Strengthening for bounded languages

A language L is bounded if L\subset w_1^*\ldots w_k^* for some fixed words w_1,\ldots, w_k.

Ginsburg and Spanier {{cite journal |last1=Ginsburg|first1=Seymour|last2=Spanier|first2=Edwin H.|year=1966 |title=Presburger formulas, and languages|journal=Pacific Journal of Mathematics|volume=16|issue=2|pages=285–296|doi=10.2140/pjm.1966.16.285|url=https://projecteuclid.org/euclid.pjm/1102994974|doi-access=free}}

gave a necessary and sufficient condition, similar to Parikh's theorem, for bounded languages.

Call a linear set stratified, if in its definition for each i\ge 1 the vector u_i has the property that it has at most two non-zero coordinates, and for each i,j\ge 1 if each of the vectors u_i,u_j has two non-zero coordinates, i_1 and j_1, respectively, then their order is not i_1.

A semi-linear set is stratified if it is a union of finitely many stratified linear subsets.

{{Math theorem|name=Ginsburg-Spanier|note=|math_statement=

A bounded language L is context-free if and only if \{(n_1,\ldots,n_k)\mid w_1^{n_1}\ldots w_k^{n_k}\in L\} is a stratified semi-linear set.

}}

Significance

The theorem has multiple interpretations. It shows that a context-free language over a singleton alphabet must be a regular language and that some context-free languages can only have ambiguous grammars{{explain|reason=why exactly?|date=April 2017}}. Such languages are called inherently ambiguous languages. From a formal grammar perspective, this means that some ambiguous context-free grammars cannot be converted to equivalent unambiguous context-free grammars.

References