Infragistics Parsing Framework - Removing Ambiguities, Part 2

Mike Dour / Wednesday, December 19, 2012

This post is part of a series on the parsing framework provided by Infragistics starting in version 12.2. The previous post in the series can be found here and the next post can be found here.

[After reading my previous post, a friend of mine, Doug Rubino, suggested I use the term “substitute” instead of “promote” for the strategy of removing ambiguities by placing a non-terminal’s production bodies into the contexts in which it is used. I like that term better as well, so from now on, I’ll call this technique substitution.]

This time, I want to discuss shift/reduce conflicts and how we can resolve them using a similar technique we used to resolve reduce/reduce conflicts.

Shift/Reduce Conflicts

As I have said in the previous post, a shift/reduce conflict occurs when “The next k tokens from the lexer tell the parser that a production body has been read completely and can be reduced to its head symbol, but the first of those tokens also moves the parser along in a different production body which is not yet complete.” And for the Infragistics parser, that k value is 1. So let’s look at an example of a shift/reduce conflict. Let’s say we have the following grammar with “S” as the start symbol:

  1. S → A y z
  2. S → B
  3. A → x
  4. B → x y

And let’s say the lexer gives the parser the tokens <x><y><z>. After reading the <x> token, the parser will be in a state where it could be done completing production 3 or in the middle of production 4. Upon reaching the next token, <y>, the parser has two choices: either move to the next position in production 4 (shift) or create non-terminal “A” in the tree because production 3 was completed and <y> is a valid look-ahead token for “A” (reduce). And since there are two choices, the parser splits into two so both can be attempted. After reading the <z> token, the parser which shifted will go away because a <z> token cannot follow a “B” non-terminal and the other parser will parse correctly. And again, this isn’t necessarily a problem for the parser, but it will impact performance if these sorts of conflicts happen often. So let’s see if we can fix it.

As I said previously, the crux of the problem with all these conflicts is there are too many reductions and therefore too many parent nodes to create for the final parse tree. For reduce/reduce conflicts, we had two general ways of dealing with this: merging two or more non-terminal symbols into one symbol or substituting (called “promoting” in the previous post) a non-terminal symbol’s production bodies in the places where it was being used. Merging decreased multiple reductions down to one and substituting decreased multiple reductions down to zero. But in this case, we don’t have multiple reductions. One reduction is already too many for a shift/reduce conflict. So we cannot use the merging strategy because there is nothing to merge. Substitution is our only option here. Here is how the grammar would look with the “A” non-terminal symbol’s production bodies substituted:

  1. S → x y z
  2. S → B
  3. B → x y

This grammar does not have a conflict when reading the tokens <x><y><z> and luckily, it has not produced a less restrictive grammar either. However, we have removed the “A” node from the final parse tree, which could be a downside to this approach. As I said previously, this changes the grammatical structure of a parsed document and may require some additional work on the part of the developer to analyze the tree as if the original grammar were being used.

By the way, I should have mentioned this in the previous post, but it is important to note that these are just general ways to remove ambiguities. There may be a better solution for your specific grammar which only works in your scenario. For example, we could have also fixed this grammar by moving “y” into the “A” non-terminal and then merging “A” and “B”:

  1. S → AB z
  2. S → AB
  3. AB → x y

You may have to get creative. Merging and substituting are just two techniques to help you get started and understand what’s happening under the covers so the parsing process doesn’t seem so black box. But ultimately, you have to use what makes sense for your language and your logic which will process the completed parse tree.

Another thing to note is that I’ve been showing some pretty simple examples where all reductions happen because a single terminal symbol is directly after the completed production being reduced. However, things can get complicated pretty quickly as non-terminal symbols follow a completed production, because any token which can start those non-terminals will be a look-ahead for reducing the production. For example, let’s consider this grammar:

  1. S → A B
  2. S → x y
  3. A → x
  4. B → C
  5. C → D
  6. D → E
  7. E → y z

If the lexer provides the tokens <x><y> here, this will be a shift/reduce conflict between shifting <y> for production 2 and reducing production 3 to non-terminal “A” since “A” can be followed by a “B”, which could potentially start with a <y> token. There could be long chains of dependencies like this in a typical language grammar which makes it hard to determine which tokens could potentially start a non-terminal symbol.

So far, all the conflicts I’ve shown have been local ambiguities. Next time, I’ll show some global ambiguities.

By Mike Dour

The information contained herein is provided for reference purposes only.

Copyright © 2012 Infragistics, Inc., All Rights Reserved.