Consider the language
L = { 0n1m | n > m }
Here is a grammar for this language.
S → A B
A → 0 A | 0
B → 0 B 1 | ε
For any given x ∈ L we must be able to construct a derivation for x using the given grammar. Here is an example: to derive the string x = 000011 we do
S → A B → 0 A B → 00B → 000B1 → 0000B11 → 000011
An alternative to giving a derivation is to construct a parse tree. Like a derivation, a parse tree shows what variables get replaced with what right hand sides. For example, if we use the rule
B → 0 B 1
in a derivation, the parse tree will contain an associated subtree that looks like this:
Here is the full parse tree for the string x = 000011:
Lemma If L is a CFL there is a number p such that for all s ∈ L with |s| ≥ p we have that
Proof If L is a CFL, there is some grammar G for L. The arity of G is the length k of the longest right-hand side in G. Let n be number of variables in G. Consider a string s in L with length greater than or equal to p = kn+1. The parse tree for this string has a height greater than or equal to n + 1. This means that some path from the root down to a leaf in the parse tree has at least n + 1 internal nodes on the path. This in turn means that at least one variable along that path has to be repeated. Let A be the first variable on that path to get repeated as we work our way up from the leaf to the root along that path.
This results in a parse tree that contains structures like this:
If this is a valid parse tree, then this is also a valid parse tree for the string uv2wx2y:
This is a valid parse tree for the string uwy:
By using similar tricks, we can construct valid parse trees for any string of the form uvnwxny.
Consider the language
L = { zz | z ∈ Σ* }
If we assume that L is a CFL, let p be the pumping length for L. Consider the string
s = 0p1p0p1p ∈ L
Now consider what happens when we decompose s = uvwxy with |vwx| ≤ p. Because we have this latter constraint, the substring uvw can span at most two of the four subgroups in s. If we form the new string uv2wx2y, the new string will introduce new 0s or 1s to at most two of the four subgroups in s.
For example, the new string could equal 0p1p+k0p+k1p. This string is not in L. In every similar case we will see that there is no way to select v and x to give uv2wx2y ∈ L.