Unit - 2
Basic Parsing Techniques
The parser is a compiler used to break down the information from the lexical analysis process into smaller components.
A parser receives input in the form of a token sequence and generates output in the form of a parse tree.
The parser is the compiler level, which uses a token string as input and transforms it into the corresponding parse tree with the aid of existing grammar. Often known as Syntax Analyzer.
Parsing consists of two types: parsing top-down and parsing bottom up.
Fig 1: Types of the parser
2.1.1 Top-down parsing
Recursive or predictive parsing is known as top-down parsing.
The parsing begins with the start symbol in the top-down parsing and transforms it into the input symbol.
The top-down parser is the parser that, by expanding the non-terminals, generates a parse for the given input string with the aid of grammar productions, i.e. it begins from the start symbol and ends on the terminals.
Parse Tree The input string representation of acdb is as follows:
Fig 2: parse tree
There are 2 types of additional Top-down parsers: Recursive descent parser, and Non-recursive descent parser.
It is also known as the Brute force parser or the backtracking parser. Basically, by using brute force and backtracking, it produces the parse tree.
A top-down parsing technique that builds the parse tree from the top and reads the input from left to right is recursive descent. For any terminal and non-terminal entity, it uses procedures.
The input is recursively parsed by this parsing technique to make a parse tree, which may or may not require back-tracking. But the related grammar (if not left-faced) does not prevent back-tracking. Predictive is considered a type of recursive-descent parsing that does not require any back-tracking.
As it uses context-free grammar that is recursive, this parsing technique is called recursive.
II. Non-recursive descent parser:
It is also known with or without a backtracking parser or dynamic parser, LL(1) parser, or predictive parser. It uses the parsing table instead of backtracking to create the parsing tree.
2.1.2 Bottom-up parsing
Bottom-up parsing is also known as parsing for shift-reduce.
Bottom-up parsing is used to build an input string parsing tree.
The parsing begins with the input symbol in the bottom-up parsing and builds the parse tree up to the start symbol by tracing the rightmost string derivations in reverse.
Parsing from the bottom up is known as multiple parsing. Those are like the following:
Key takeaway:
- A parser receives input in the form of a token sequence and generates output in the form of a parse tree.
- Bottom-up parsing is used to build an input string parsing tree.
- A top-down parsing technique that builds the parse tree from the top and reads the input from left to right is recursive descent.
● Shift reduction parsing is a mechanism by which a string is reduced to the beginning symbol of a grammar.
● Using a stack to hold the grammar and an input tape to hold the string, Shift reduces parsing.
● The two acts are done by Shift Decrease Parsing: move and decrease. This is why it is considered that shift-reduce parsing.
● The current symbol in the input string is moved into the stack for the shift operation.
● The symbols will be substituted by the non-terminals at each reduction. The mark is the right side of manufacturing and the left side of output is non-terminal.
Example:
Grammar:
Input string :
a1-(a2+a3)
Parsing table :
For bottom-up parsing, shift-reduce parsing uses two special steps. Such steps are referred to as shift-step and reduce-step.
Shift - step: The shift step refers to the progression to the next input symbol, which is called the shifted symbol, of the input pointer. Into the stack, this symbol is moved. The shifted symbol will be viewed as a single parse tree node.
Reduce - step: When a complete grammar rule (RHS) is found by the parser and substituted with (LHS), it is known as a reduction stage. This happens when a handle is located at the top of the stack. A POP function is performed on the stack that pops off the handle and substitutes it with a non-terminal LHS symbol to decrease.
As follows, there are two major types of shift reduction parsing:
2.2.1 Operator precedence parsing
The grammar of operator precedence is a kind of shift reduction system of parsing. It is applied to a small grammar class of operators.
A grammar, if it has two properties, is said to be operator precedence grammar:
● There's no R.H.S. of any production.
● There are no two adjoining non-terminals.
Operator precedence can only be defined between the terminals of the grammar. A non-terminal is overlooked.
The three operator precedence relations exist:
➢ a ⋗ b suggests that the precedence of terminal "a" is greater than terminal "b"
➢ a ⋖ b means the lower precedence of terminal "a" than terminal "b"
➢ a ≐ b indicates that all terminals "a" and "b" have the same precedence.
Precedence table:
Parsing Action
➔ Add the $ symbol at both ends of the specified input string.
➔ Now search the right-left input string until the ⁇ is encountered.
➔ Scan to the leftover all of the equal precedences until much of the first left is met.
➔ A handle is anything between most of the left and most of the right.
➔ $ on $ means the parsing succeeds.
Example - Grammar:
Given string:
For it, let us take a parse tree as follows:
We can design the following operator precedence table based on the tree above:
Let us now process the string with the assistance of the precedence table above:
2.2.2 LR parsers
● In the LR parser, “ L” means left to right scanning of any input.
● One form of bottom-up parsing is LR parsing. It is used to evaluate a broad grammar class.
● R" stands for the construction of a proper reverse derivation.
● "K" is the number of look-ahead input symbols used to make some parsing decisions.
LR parsing is divided into four parts: parsing for LR (0), parsing for SLR, parsing for CLR, and parsing for LALR.
Fig 3: Types of LR parser
LR algorithm :
The stack, input, output, and parsing table are required by the LR algorithm. The input, output, and stack are the same in all modes of LR parsing, but the parsing table is different.
Fig 4: Diagram of LR parser
● The input buffer is used to denote the end of the input and includes the string followed by the $ symbol to be parsed.
● To contain a series of grammar symbols with $ at the bottom of the stack, a stack is used.
● A two-dimensional array is a parsing table. It comprises two parts: The action part and the Goes To part.
LR (1) parser :
Different steps involved in LR (1) Parsing:
❏ Write context-free grammar for the specified input string.
❏ Verify the grammar's uncertainty.
❏ Add the Augmentation output to the specified grammar.
❏ Creating a canonical set of objects from LR (0).
❏ Drawing a diagram of data flow (DFA).
❏ Build a table for LR (1) parsing.
Augment Grammar :
If we add one more output to the given grammar G, Augmented Grammar G will be generated. It allows the parser to define when the parsing should be stopped and to announce the acceptance of the input.
Example
Given grammar
S → AA
A → aA | b
The Augment grammar G is represented by
Key takeaway:
- The current symbol in the input string is moved into the stack for the shift operation.
- The shifted symbol will be viewed as a single parse tree node.
- LR(1) parser Verify the grammar's uncertainty.
- In the LR parser, “ L” means left to right scanning of any input.
- The stack, input, output, and parsing table are required by the LR algorithm.
An LR (0) object is an output G with a dot on the right side of the production at a certain location.
LR(0) items are useful for showing how much of the input has been scanned in the parsing phase up to a given stage. We position the reduction node in the entire row in the LR (0).
We increase the grammar and get a fresh output of this one; take its closure. That is the collection's first element; call it Z (we will call it I0). Try GOTO from Z, i.e. consider GOTO(Z, X) for each grammar symbol; each of these (almost) is another element of the set.
From each of these new elements of the set, try GOTOing now, etc. Start with Jane Smith, add F to all her friends, then add F to everyone's friends in F, named FF, then add FF to everyone's friends, etc.
The (almost) is because GOTO(Z, X) may be empty, so we formally create the LR(0) items canonical set, C, as follows :
In the DFA I created earlier, this GOTO gives exactly the arcs. Formal therapy does not include the NFA but starts from the beginning of the DFA.
Example -
E' → E
E → E + T | T
T → T * F | F
F → ( E ) | id
It's bigger than the toy that I made before. The NFA will have 2+4+2+4+2+4+2=20 states (a development on the RHS with k symbols gives k+1 N-states as there are k+1 places to position the dot). This brings 12 D-states into being. The creation in the book, which we are following now, however, explicitly constructs the DFA.
Begin to build the diagram on the platform:
Start with {E ' → ·E}, close the lock, and then proceed to apply GOTO.
Fig 5: DFA diagram
Key takeaway:
- LR(0) items are useful for showing how much of the input has been scanned in the parsing phase up to a given stage.
SLR (1) refers to the fast parsing of LR. It's the same as parsing for LR(0). The only difference is in the parsing table. We use a canonical LR (0) item array to create the SLR (1) parsing table.
We just position the reduction step in the follow-up of the left-hand side in the SLR (1) parsing.
Different steps related to SLR (1) Parsing:
❏ Write a context-free grammar for the specified input string
❏ Verify the grammar's uncertainty
❏ Add Increase output in the grammar specified
❏ Build a canonical set of things from LR (0)
❏ Drawing a diagram of data flow (DFA)
❏ Build an SLR (1) table for parsing
SLR(1) Table Construction :
Start at rule zero for the rules in an augmented grammar, G ', and follow the steps below:
Steps of State Formation (big picture algorithm)
1. Apply the Start and Start operations
2. Attain the State:
Example :
Consider the augmented grammar G'
S’ ‐‐> S$
S ‐‐> aSbS
S ‐‐> a
The set of Item Sets LR(0), states:
I0 [S’ -> .S$] (Operation to start; read on S goes to I11 (state 1) )
[S -> .aSbS] (Full S rule operation; the reading on 'a' goes to I2 (state 2) )
[S -> .a] (For all rules 'S', continue to complete; read on 'a', to state 2 )
I1 [S’ -> S.$] (From the first line, read on 'S'; Note: never read on '$'
Nothing to read about; nothing to end )
I2 [S -> a.SbS] (Reading from state I0 on 'a'; reading on'S 'goes to I3 (state 3) )
[S -> a.] (Continue reading on 'a' from state I0 (see stage 2 of formation of state) with nothing to read on; nothing to complete )
[S -> .aSbS] (Complete the state in the first item read on 'a' cycles back to state 2 because of '.S' )
[S -> .a] (All grammar rules for 'S' ditto read on 'a' cycles remain full back to state 2 )
I3 [S -> aS.bS] (The dot is before a non-terminal from the reading on'S 'from state I2, no full operation read on' b 'goes to I4 (state 4) )
I4 [S -> aSb.S] (From reading from state I3 on 'b'; reading from'S 'goes to I5 (state 5) )
[S -> .aSbS] (Due to the '.S' in the first object, complete the state;
Note: always dot in front for complete points
Return to State 2 read on 'a' cycles )
[S -> .a] ( Continue complete; ditto read back to state 2 on 'a' cycles )
I5 [S -> aSbS.] (From state 5 to read on 'S'; nothing to read on )
Construct the parse table :
You need the FIRST and FOLLOW sets for each non-terminal in your grammar to construct the parsing table.
You will need the completed collection of products for the state from above.
Now, for your parse table, draw the n x m table and mark it appropriately. The n is equal to the number of states from part 1 that you have. By counting all non-terminals and all terminals in the grammar, you decide the M. Provide a row or column for each element of n and m.
Label each row n, starting at zero, with a state number. Mark each column m, beginning from left to right, with one terminal per column. Labeling the remaining columns after labeling with all the terminals
With those non-terminals.
The ACTION side is the group of columns on the left (terminals). The group of columns is the GOTO side on the right (non-terminals)
You obey these four rules to fill out the table,
Construction rules
α, β = any string of terminals and/or non‐terminals
X, S’, S = non‐terminals
(When dot is in middle)
1. if [A ‐‐> α.aβ] ε Ii and read on ‘a’ produces Ij then ACTION [i , a] = SHIFT j.
2. if [A ‐‐> α.Xβ] ε Ii and read on ‘X’ produces Ij then GOTO [i , X] = j.
(When dot is at end)
3. if [A ‐‐> α.] ε Ii then ACTION [i , a] = REDUCE on A ‐> α for all a ε FOLLOW(A).
4. if [S’ ‐‐> S.] ε Ii then ACTION [i , $] = ACCEPT.
For example, the parsing table:
Key takeaway :
- SLR (1) refers to the fast parsing of LR.
- It's the same as parsing for LR(0).
➢ Canonical lookahead is referred to by CLR. To create the CLR (1) parsing table, CLR parsing utilizes the canonical set of LR (1) objects. As compared to the SLR (1) parsing, the CLR (1) parsing table produces a greater number of states.
➢ We only position the reduction node in the lookahead symbols in the CLR (1).
➢ Different steps used in CLR (1) Parsing:
➔ Write a context-free grammar for the specified input string
➔ Verify the grammar's uncertainty
➔ Add Increase output in the grammar specified
➔ Build a canonical set of things from LR (0)
➔ Drawing a diagram of data flow (DFA)
➔ Build a CLR (1) table for parsing
LR (1) item
Item LR(1) is a list of things from LR(0) and a look-ahead sign.
LR (1) item = LR (0) item + look ahead
The lookahead is used to decide where the final item will be put.
For the development of claims, the look-ahead often adds the $ symbol.
Example
CLR ( 1 ) Grammar
Add Augment Production, insert the symbol '•' for each production in G at the first position, and add the lookahead as well.
I0 state :
Add the output of Augment to the I0 State and compute the closure
I0 = Closure (S → •S)
In changed I0 State, add all productions beginning with S since "." is followed by the non-terminal. So, it will become the I0 State.
I0 = S → •S, $
S → •AA, $
In changed I0 State, add all productions beginning with A since "." is followed by the non-terminal. So, it will become the I0 State
I0= S → •S, $
S → •AA, $
A → •aA, a/b
A → •b, a/b
I1= Go to (I0, S) = closure (S → S•, $) = S → S•, $
I2= Go to (I0, A) = closure ( S → A•A, $ )
"In I2 State, add all productions beginning with A since "." is followed by the non-terminal. The I2 Condition, then, becomes
I2= S → A•A, $
A → •aA, $
A → •b, $
I3= Go to (I0, a) = Closure ( A → a•A, a/b )
In the I3 State, add all productions beginning with A since "." is followed by the non-terminal. The I3 Condition, then, becomes
I3= A → a•A, a/b
A → •aA, a/b
A → •b, a/b
Go to (I3, a) = Closure (A → a•A, a/b) = (same as I3)
Go to (I3, b) = Closure (A → b•, a/b) = (same as I4)
I4= Go to (I0, b) = closure ( A → b•, a/b) = A → b•, a/b
I5= Go to (I2, A) = Closure (S → AA•, $) =S → AA•, $
I6= Go to (I2, a) = Closure (A → a•A, $)
"In I6 State, add all productions beginning with A since "." is followed by the non-terminal. The I6 Condition, then, becomes
I6 = A → a•A, $
A → •aA, $
A → •b, $
Go to (I6, a) = Closure (A → a•A, $) = (same as I6)
Go to (I6, b) = Closure (A → b•, $) = (same as I7)
I7= Go to (I2, b) = Closure (A → b•, $) = A → b•, $
I8= Go to (I3, A) = Closure (A → aA•, a/b) = A → aA•, a/b
I9= Go to (I6, A) = Closure (A → aA•, $) = A → aA•, $
Diagram DFA :
CLR(1) Parsing Table :
Productions are numbered according to the following:
The location of the move node in the CLR (1) parsing table is the same as the parsing table of the SLR (1).
Key takeaway:
- To create the CLR (1) parsing table, CLR parsing utilizes the canonical set of LR (1) objects.
- As compared to the SLR (1) parsing, the CLR (1) parsing table produces a greater number of states.
The lookahead LR is alluded to by LALR. We use the canonical set of LR (1) objects to create the LALR (1) parsing table.
In the LALR (1) analysis, the LR (1) products that have the same outputs but look ahead differently are combined to form a single set of items
The parsing of LALR(1) is the same as the parsing of CLR(1), the only distinction in the parsing table.
LALR (1) Parsing:
LALR refers to the lookahead LR. To construct the LALR (1) parsing table, we use the canonical collection of LR (1) items.
In the LALR (1) parsing, the LR (1) items that have the same productions but different look ahead are combined to form a single set of items
LALR (1) parsing is the same as the CLR (1) parsing, with only a difference in the parsing table.
Example
LALR ( 1 ) Grammar
Add Augment Production, insert the '•' symbol for each production in G at the first place and add the look ahead as well.
S → •S, $
S → •AA, $
A → •aA, a/b
A → •b, a/b
I0 State:
Add the output of Augment to the I0 State and measure Closure L
I0 = Closure (S → •S)
Add all productions beginning with S to I0 State, since the non-terminal is followed by '•'. The I0 Condition, then, becomes
I0 = S → •S, $
S → •AA, $
In changed I0 State, add all productions beginning with A since "•" is followed by the non-terminal. So, it will become the I0 State.
I0= S → •S, $
S → •AA, $
A → •aA, a/b
A → •b, a/b
I1= Go to (I0, S) = closure (S → S•, $) = S → S•, $
I2= Go to (I0, A) = closure ( S → A•A, $ )
In the I2 State, add all productions beginning with A since "•" is followed by the non-terminal. The I2 Condition, then, becomes
I2= S → A•A, $
A → •aA, $
A → •b, $
I3= Go to (I0, a) = Closure ( A → a•A, a/b )
In the I3 State, add all productions beginning with A since "•" is followed by the non-terminal. The I3 Condition, then, becomes
I3= A → a•A, a/b
A → •aA, a/b
A → •b, a/b
Go to (I3, a) = Closure (A → a•A, a/b) = (same as I3)
Go to (I3, b) = Closure (A → b•, a/b) = (same as I4)
I4= Go to (I0, b) = closure ( A → b•, a/b) = A → b•, a/b
I5= Go to (I2, A) = Closure (S → AA•, $) =S → AA•, $
I6= Go to (I2, a) = Closure (A → a•A, $)
"In I6 State, add all productions beginning with A since "•" is followed by the non-terminal. The I6 Condition, then, becomes
I6 = A → a•A, $
A → •aA, $
A → •b, $
Go to (I6, a) = Closure (A → a•A, $) = (same as I6)
Go to (I6, b) = Closure (A → b•, $) = (same as I7)
I7= Go to (I2, b) = Closure (A → b•, $) = A → b•, $
I8= Go to (I3, A) = Closure (A → aA•, a/b) = A → aA•, a/b
I9= Go to (I6, A) = Closure (A → aA•, $) A → aA•, $
If we evaluate, then I3 and I6 things from LR (0) are the same, but they only differ in their lookahead.
I3 = { A → a•A, a/b
A → •aA, a/b
A → •b, a/b
}
I6= { A → a•A, $
A → •aA, $
A → •b, $
}
In their LR (0) objects, I3 and I6 are obviously the same but differ in their lookahead, so we can combine them and name them I36.
I36 = { A → a•A, a/b/$
A → •aA, a/b/$
A → •b, a/b/$
}
The I4 and I7 are the same but they vary only in their look ahead, so we can combine them and call them I47.
I47 = {A → b•, a/b/$}
The I8 and I9 are the same, but they just vary in their outlook, so we can merge them and name them I89.
I89 = {A → aA•, a/b/$}
Drawing DFA :
LALR (1) Parsing Table :
Key takeaway:
- We use the canonical set of LR (1) objects to create the LALR (1) parsing table.
- The parsing of LALR(1) is the same as the parsing of CLR(1), the only distinction in the parsing table.
You may use the LR parser to parse ambiguous grammar. In the parsing table of ambiguous grammars, the LR parser resolves the conflicts (shift/reduce or decrease/reduce) based on certain grammar rules (precedence and/or associativity of operators).
Example :
Let's take the unclear grammar below:
E -> E+E
E -> E*E
E -> id
The precedence and associativity of the grammar operators (+ and *) can be considered to be as follows:
● “+” and “*” both are left-associative,
● Precedence of “*” is higher than the precedence of “+”.
If we use the LALR(1) parser, the LR(1) item DFA will be:
We can see from the LR(1) DFA item that there is shift/reduce conflicts in the I5 and I6 states.
"In I5 and I6, moves on "+ and "*" are both moved and decreased. To settle this dispute, we will use the precedence and associativeness of the operators to decide which phase to preserve and which to discard from the table.
Key takeaway:
- In the parsing table of ambiguous grammars, the LR parser resolves the conflicts based on certain grammar rules.
YACC is an automatic tool that generates the programme for the parser.
YACC stands for Compiler Compiler Yet Another. The software is available under UNIX OS OS
LR parser construction requires a lot of work for the input string to be parsed.
Therefore, to achieve productivity in parsing and data, the procedure must provide automation.
YACC is an LALR parser generator that reports conflicts or uncertainties in the form of error messages (if at all present).
As we addressed YACC in the first unit of this tutorial, to make it more understandable, you can go through the concepts again.
● YACC
As shown in the picture, the standard YACC translator can be represented as
Fig 6: YACC automatic parser generator
Key takeaway:
- YACC is an automatic tool that generates the programme for the parser.
- YACC is an LALR parser generator that reports conflicts or uncertainties in the form of error messages (if at all present).
During the syntax review, we will consider the problem of how to select the products to be used. To this end, for a given grammar G, we use a predictive parsing table.
We use the following algorithm whose key ideas are used to construct this table:
1. Suppose that A → α is output with an in FIRST(α). Then, when the current input symbol is a the parser extends A by α.
2. Complication happens when α = ∈ or α ∗⇒ ∈ . In this scenario, again, we need to be If the current input symbol is in FOLLOW (A), extend A by α
Algorithm :
Input: Grammar G
Output: Parsing table M
Algorithm: Construction of a predictive parsing table
1. For each production A → α of G, do steps 2 and 3
2. For each terminal an in FIRST(α), add A → α to M[A, a]
3. If α ∗⇒ ∈, add A → α to M[A, b] for each terminal b in FOLLOW(A)
If α ∗⇒∈ and $ is in FOLLOW(A), add A → α to M[A, $]
4. Make each undefined entry of M be error
Example :
Grammar G 3 = ( {E, E′, T, T ′, F }, {a, ∗, +, (, )}, P, E) with the set of productions P:
E → TE′ T → FT ′
F → (E) E′ → +TE′
T ′ → ∗FT ′ F → a
E′ → ∈ T ′ → ∈
Parsing table M:
Key takeaway:
- During the syntax review, we will consider the problem of how to select the products to be used.
References:
1. Kenneth Louden,” Compiler Construction”, Cengage Learning.
2. Charles Fischer and Ricard LeBlanc,” Crafting a Compiler with C”, Pearson Education