Chomsky–Schützenberger enumeration theorem

From HandWiki

In formal language theory, the Chomsky–Schützenberger enumeration theorem is a theorem derived by Noam Chomsky and Marcel-Paul Schützenberger about the number of words of a given length generated by an unambiguous context-free grammar. The theorem provides an unexpected link between the theory of formal languages and abstract algebra.

Statement

In order to state the theorem, a few notions from algebra and formal language theory are needed.

Let [math]\displaystyle{ \mathbb{N} }[/math] denote the set of nonnegative integers. A power series over [math]\displaystyle{ \mathbb{N} }[/math] is an infinite series of the form

[math]\displaystyle{ f = f(x) = \sum_{k=0}^\infty a_k x^k = a_0 + a_1 x^1 + a_2 x^2 + a_3 x^3 + \cdots }[/math]

with coefficients [math]\displaystyle{ a_k }[/math] in [math]\displaystyle{ \mathbb{N} }[/math]. The multiplication of two formal power series [math]\displaystyle{ f }[/math] and [math]\displaystyle{ g }[/math] is defined in the expected way as the convolution of the sequences [math]\displaystyle{ a_n }[/math] and [math]\displaystyle{ b_n }[/math]:

[math]\displaystyle{ f(x)\cdot g(x) = \sum_{k=0}^\infty \left(\sum_{i=0}^k a_i b_{k-i}\right) x^k. }[/math]

In particular, we write [math]\displaystyle{ f^2 = f(x)\cdot f(x) }[/math], [math]\displaystyle{ f^3 = f(x)\cdot f(x)\cdot f(x) }[/math], and so on. In analogy to algebraic numbers, a power series [math]\displaystyle{ f(x) }[/math] is called algebraic over [math]\displaystyle{ \mathbb{Q}(x) }[/math], if there exists a finite set of polynomials [math]\displaystyle{ p_0(x), p_1(x), p_2(x), \ldots , p_n(x) }[/math] each with rational coefficients such that

[math]\displaystyle{ p_0(x) + p_1(x) \cdot f + p_2(x)\cdot f^2 + \cdots + p_n(x)\cdot f^n = 0. }[/math]

A context-free grammar is said to be unambiguous if every string generated by the grammar admits a unique parse tree or, equivalently, only one leftmost derivation. Having established the necessary notions, the theorem is stated as follows.

Chomsky–Schützenberger theorem. If [math]\displaystyle{ L }[/math] is a context-free language admitting an unambiguous context-free grammar, and [math]\displaystyle{ a_k := | L \ \cap \Sigma^k | }[/math] is the number of words of length [math]\displaystyle{ k }[/math] in [math]\displaystyle{ L }[/math], then [math]\displaystyle{ G(x)=\sum_{k = 0}^\infty a_k x^k }[/math] is a power series over [math]\displaystyle{ \mathbb{N} }[/math] that is algebraic over [math]\displaystyle{ \mathbb{Q}(x) }[/math].

Proofs of this theorem are given by (Kuich Salomaa), and by (Panholzer 2005).

Usage

Asymptotic estimates

The theorem can be used in analytic combinatorics to estimate the number of words of length n generated by a given unambiguous context-free grammar, as n grows large. The following example is given by (Gruber Lee): the unambiguous context-free grammar G over the alphabet {0,1} has start symbol S and the following rules

SM | U
M → 0M1M | ε
U → 0S | 0M1U.

To obtain an algebraic representation of the power series [math]\displaystyle{ G(x) }[/math] associated with a given context-free grammar G, one transforms the grammar into a system of equations. This is achieved by replacing each occurrence of a terminal symbol by x, each occurrence of ε by the integer '1', each occurrence of '→' by '=', and each occurrence of '|' by '+', respectively. The operation of concatenation at the right-hand-side of each rule corresponds to the multiplication operation in the equations thus obtained. This yields the following system of equations:

S = M + U
M = M²x² + 1
U = Sx + MUx²

In this system of equations, S, M, and U are functions of x, so one could also write [math]\displaystyle{ S(x) }[/math], [math]\displaystyle{ M(x) }[/math], and [math]\displaystyle{ U(x) }[/math]. The equation system can be resolved after S, resulting in a single algebraic equation:

[math]\displaystyle{ x(2x-1)S^2 + (2x-1)S +1 = 0 }[/math].

This quadratic equation has two solutions for S, one of which is the algebraic power series [math]\displaystyle{ G(x) }[/math]. By applying methods from complex analysis to this equation, the number [math]\displaystyle{ a_n }[/math] of words of length n generated by G can be estimated, as n grows large. In this case, one obtains [math]\displaystyle{ a_n \in O(2+\epsilon)^n }[/math] but [math]\displaystyle{ a_n \notin O(2-\epsilon)^n }[/math] for each [math]\displaystyle{ \epsilon\gt 0 }[/math].[1]

The following example is from (Bassino Nicaud):[math]\displaystyle{ \left\{\begin{array} { l } { S \rightarrow X Y } \\ { T \rightarrow a T | T b T | Y c Y } \\ { Y \rightarrow Y a Y | c Y | a b T a Y Y a | X } \\ { X \rightarrow a | b | c } \end{array} \Rightarrow \left\{\begin{array}{l} s(z)=x(z) y(z) \\ t(z)=z t(z)+z t(z)^2+z y(z)^2 \\ y(z)=z y(z)^2+z y(z)+z^4 t(z) y(z)^2+x(z) \\ x(z)=3 z \end{array}\right.\right. }[/math]which simplifies to[math]\displaystyle{ s(z)^8-27\left(z^3-z^2\right) s(z)^5+\ldots+59049 z^{10}=0 }[/math]

Inherent ambiguity

In classical formal language theory, the theorem can be used to prove that certain context-free languages are inherently ambiguous. For example, the Goldstine language [math]\displaystyle{ L_G }[/math] over the alphabet [math]\displaystyle{ \{a,b\} }[/math] consists of the words [math]\displaystyle{ a^{n_1}ba^{n_2}b\cdots a^{n_p}b }[/math] with [math]\displaystyle{ p\ge 1 }[/math], [math]\displaystyle{ n_i\gt 0 }[/math] for [math]\displaystyle{ i \in \{1,2,\ldots,p\} }[/math], and [math]\displaystyle{ n_j \neq j }[/math] for some [math]\displaystyle{ j \in \{1,2,\ldots,p\} }[/math].

It is comparably easy to show that the language [math]\displaystyle{ L_G }[/math] is context-free.[2] The harder part is to show that there does not exist an unambiguous grammar that generates [math]\displaystyle{ L_G }[/math]. This can be proved as follows: If [math]\displaystyle{ g_k }[/math] denotes the number of words of length [math]\displaystyle{ k }[/math] in [math]\displaystyle{ L_G }[/math], then for the associated power series holds [math]\displaystyle{ G(x) = \sum_{k=0}^\infty g_k x^k = \frac{1-x}{1-2x}- \frac1x \sum_{k \ge 1} x^{k(k+1)/2-1} }[/math]. Using methods from complex analysis, one can prove that this function is not algebraic over [math]\displaystyle{ \mathbb{Q}(x) }[/math]. By the Chomsky-Schützenberger theorem, one can conclude that [math]\displaystyle{ L_G }[/math] does not admit an unambiguous context-free grammar.[3]

Notes

  1. See (Gruber Lee) for a detailed exposition.
  2. Berstel & Boasson (1990).
  3. See (Berstel Boasson) for detailed account.

References

  • Bassino, Frederique; Nicaud, Cyril (December 16, 2011). "Philippe Flajolet & Analytic Combinatorics: Inherent Ambiguity of Context-Free Languages". https://algo.inria.fr/pfac/PFAC/Program_files/nicaud.pdf. 
  • Berstel, Jean; Boasson, Luc (1990). "Context-free languages". in van Leeuwen, Jan. Handbook of Theoretical Computer Science, Volume B: Formal Models and Semantics. Elsevier and MIT press. pp. 59–102. ISBN 0-444-88074-7. http://monge.univ-mlv.fr/~berstel/Articles/1990HandbookCfl.pdf. 
  • "The Algebraic Theory of Context-Free Languages". In P. Braffort and D. Hirschberg, eds., Computer Programming and Formal Systems (pp. 118–161). Amsterdam: North-Holland. http://www-igm.univ-mlv.fr/~berstel/Mps/Travaux/A/1963-7ChomskyAlgebraic.pdf. 
  • Analytic Combinatorics. Cambridge: Cambridge University Press. ISBN 978-0-521-89806-5. 
  • Gruber, Hermann; Lee, Jonathan; Shallit, Jeffrey (2012). "Enumerating regular expressions and their languages". arXiv:1204.4982 [cs.FL].
  • Semirings, Automata, Languages. Berlin: Springer-Verlag. ISBN 978-3-642-69961-0. 
  • Panholzer, Alois (2005). "Gröbner Bases and the Defining Polynomial of a Context-free Grammar Generating Function". Journal of Automata, Languages and Combinatorics 10: 79–97.