Infragistics Parsing Framework - Removing Ambiguities, Part 3

Mike Dour / Thursday, March 14, 2013

This post is part of a series on the parsing framework provided by Infragistics starting in version 12.2. The previous post in the series can be found here and the next post can be found here.

In the previous posts about ambiguities, I discussed the conflicts which can occur while parsing but only showed examples of local ambiguities. Other than the performance implications, you really don’t have to worry about local ambiguities because the parser will eventually figure things out as it reads more tokens from the lexer. But with global ambiguities, things get more complicated because the same tokens can be interpreted in two or more different ways.

There are many ways to produce a globally ambiguous grammar, some of which are not so obvious. Let’s look at a simple grammar which can produce a global ambiguity:

  1. S → x y z
  2. S → x A z
  3. A → y

This grammar has a shift/reduce conflict if the lexer provides the tokens <x><y><z>. The “z” symbol moves the parser along in production 1, but it is also a valid look-ahead to reduce “A” because production 3 is complete. Recall that a reduce operation is when the body of a production has been read completely, the next token is something that can legally follow the symbol at the head of that production, and so a parent node can be created to represent the head symbol in the syntax tree. Productions 1 and 2 can both represent the total content of the document, so two syntax trees will be generated.

Sometimes global ambiguities cannot be avoided. The language may be inherently ambiguous. C# is an example of such a language. The method invocation “A(B<C,D>(E+F))” can either have two parameters which are Boolean expressions or a single parameter which is an invocation of generic method B. In ambiguous languages, it is a good idea to tell the parser which interpretation to use in ambiguous situations because otherwise, the parser will arbitrarily pick one. As I mentioned in my first post, this type of situation can be handled by setting HasPriority on one of the non-terminal symbols. In 12.2, this was the only way to handle a global ambiguity. The parser will look to see which sub-tree has a priority node at the highest level and choose that one. In 13.1, we have also added some APIs to allow for code to decide which interpretation gets used. I’ll discuss these as the 13.1 release gets closer.

However, even if the language is not ambiguous, it is possible to make ambiguous grammars to describe it. This can be done unintentionally if the grammar writer is not careful and allows for the same content to be parsed in multiple ways, even if that content is empty. Consider the following EBNF grammar:

S = [A];
A = [x];

This grammar creates the following productions (because the square brackets contain optional sequences):

  1. S
  2. S → A
  3. A
  4. A → x

Now if an empty document is parsed, there are two possible trees:

In the first interpretation, the empty version of the start symbol is used (production 1). In the second interpretation, the version of the start symbol with a single “A” is used (production 2) and that “A” is empty (production 3). There are a few ways to fix this. One way is to remove the optional brackets around “x” from the “A” symbol definition. However, if “A” is used in other contexts, each usage of “A” will have to be made optional, which may not be ideal. Another way to handle this is to split the optionality of “A” with the actual symbols that can be used:

S = AOptional;
AOptional = [ARequired];
ARequired = x;

Now each symbol which was using “A” can use either “AOptional” or “ARequired” depending on whether it was optional or not. A third way to handle this is to remove the optional brackets from the usage of “A” in the start symbol:

S = A;
A = [x];

The only side-effect to this solution is that an “A” node will always be in the syntax tree, whether the document contains an “x” or not. This may or may not be ok depending on your logic which processes the tree.

Let’s looks at a slightly more complex global ambiguity (remember that curly braces indicate a sequence which can be repeated 0 or more times):

S = {A};
A = [x];

Similar to the previous example, this one allows for multiple ways of interpreting an empty document, but it also allows for multiple ways of interpreting documents with content. To see why, let’s look at the productions created by this grammar (remember that repetitions generate hidden recursive symbols to represent them):

  1. S → S1
  2. S1
  3. S1 → S1 A
  4. A
  5. A → x

Now if an empty document is parsed, there are two possible trees (the hidden S1 nodes are shown here even though they won’t appear in the final syntax tree):

In the first interpretation, the empty version of the repetition is used. In the second interpretation, the version of a repetition with a single “A” is used and that “A” is empty. Actually, it is theoretically possible for an infinite number of ambiguous trees to be generated here. An empty document can be interpreted as N “A” symbols for any value of N, all of which are empty, but our parser implementation does not generate these extra trees.

Now let’s look at the trees generated when a single “x” is present in the document:

And again, an infinite number of trees could be generated in this case, but these are the only trees actually generated by our parser.

So how can we fix this? Well, some of the same solutions from the previous example can be used here as well. You can make “A” always contains an “x” and then make all other usages of “A” optional if they are not in a repetition. Or you could separate “A” into required and optional content:

S = {ARequired};
AOptional = [ARequired];
ARequired = x;

We can also take this type of ambiguity to the next level – a repetition of repetitions:

S = {A};
A = {x};

I won’t go through the examples of ambiguous trees for this one, but you can imagine how this grows in complexity from the previous example.

These examples were all theoretical, but next time we’ll see some real world global ambiguities which can occur when writing grammar rules for arithmetic operator expressions. Then I’ll demonstrate how a different set of grammar rules can create non-ambiguous parses which actually arrange the nodes in the syntax tree in a way which reflects the precedence levels of the operators.

By Mike Dour

The information contained herein is provided for reference purposes only.

Copyright © 2013 Infragistics, Inc., All Rights Reserved.