Blog Series Tags

How Compilers Work - The Problem of Parsing

Parsing is the second stage of compilation, it comes right after lexical analysis. While lexical analysis focuses on breaking down the input text into a set of tokens, parsing focuses on ensuring that the relationships between those tokens are valid. Lexical analysis is about valid words, parsing is about grammatically correct sentences.

The problem with parsing, is that at this point during most discussions, the text moves headfirst into theory - and there is quite a bit of it. To make matters worse, it’s quite abstract and quite disconnected from much of what comes afterwards (namely parser generation). So instead of going theoretical, let’s jump straight into implementing a parser by hand and then come back to the theory. Let’s take a look at an example of the Cunning language.

func add
{
	int a, b, sum;
	a = 1;
	b = 2;

	sum = a + b;
}

This is what the programmer sees while writing code. From the point of view of a compiler, the challenge is to ensure that proper grammar is adhered to by the programmer, and this requires checking for syntax - which is the grammatical structure of the source. For example while the following is a valid way to assign a value to a variable:

	a = 1;

This is not:

	1 = a;

In other words, the variable identifier should always come first, followed by an assignment ‘=’ operator and then by a number (integer in this case) that is assigned to the variable. This rule can be expressed as a pattern, a sort of template to check the sentence against. If the sentence conforms to the template, then the grammar is valid. If it does not, then it is invalid.

Remember however that Parsing is the second stage of compilation. It sees what is emitted by the lexical analyser and nothing else. The above code is converted into a stream of token values that look like so:

a : 2  (TIdentifier)
= : 8  (TOperatorAssign)
1 : 10 (TLitInteger)
; : 5  (TPunctuationEnd)

To parse the above stream of tokens, we can write a function that checks if the token stream has the correct tokens in the correct order.

bool IsCorrectAssignment(Lexer& lexer)
{
	// lexer.Get() returns the next token
	if (lexer.Get() != TIdentifier)
		return false;

	if (lexer.Get() != TOperatorAssign)
		return false;

	if (lexer.Get() != TLitInteger)
		return false;

	if (lexer.Get() != TPunctuationEnd)
		return false;
}

Now we have a function that can check if the token stream contains the correct grammar for an assignment. We just need to call it with the lexical analyser set to a point just before the assignment statement.

That’s it, parsing in it’s most basic form.

No seriously

That entails the core idea behind parsing - albeit in a very simplistic form. Parsing is about higher order pattern matching, it’s about checking the grammatical correctness of the input, i.e. the syntax. Now let’s build up the code to make it a little more respectable. The first thing we can do is to create a method called Accept which accepts a token if it’s expected and rejects it otherwise.

Accept(int expected, int actual)
{
	if (expected != actual)
		throw;
}

Now we can use this method to avoid having to use if’s all the time.

void IsCorrectAssignment(Lexer& lexer)
{
	Accept(TIdentifier, lexer.Get());
	Accept(TOperatorAssign, lexer.Get());
	Accept(TLitInteger, lexer.Get());
	Accept(TPunctuationEnd, lexer.Get());
}

Beautiful. Now we have a simple way of defining parsing rules. We can accept only expected tokens, rejecting everything else. The above code will successfully parse a = 1 but not 1 = a. The former is a compilation error and we should probably look into giving more constructive feedback, rather than crashing and burning - but we will leave that for later.

Generating a Parse Tree

Parsing is not only about accepting or rejecting statements based on grammar. It’s also about building a representation of the program in memory for the later stages of the compilation process to work with. You can think of this as arranging all the tokens returned from the lexical analyser into a structured form.

func add
{
	int a, b, sum;
	a = 1;
	b = 2;

	sum = a + b;
}

One way of looking at the above program is by arranging it into a hierarchy. The function add exists close to the top, close because there can be multiple functions in Cunning, so you can expect there to be a higher level item above the add function which we can call a program. Under the add function multiple statements form the logic. This structure naturally forms a tree. This tree contains all of the tokens that make up the program. You can walk this tree in the correct order to regenerate the original program. Let’s start by defining a node for the tree structure, which we’ll call ParseNode.

class ParseNode
{
private:
	Token mToken;
	string mText;
	vector <shared_ptr<ParseNode>> mChildren;

public:
	ParseNode() {...}
	ParseNode(Token token, string text); {...}

	void AddChild(shared_ptr<ParseNode> child)
	{
		mChildren.push_back(child);
	}
};

The class should be fairly self explanatory. Each node stores the token type and any text that is associate with that token (for example a token type TIdentifier will have a text value of the actual identifier name, a, b or sum). We use shared pointers to easily manage the object sharing involved when adding child nodes under the current node. Now lets edit the Accept function to create the node structure as it checks the token types.

shared_ptr<ParseNode> Accept(Token expected, Token actual, shared_ptr<ParseNode> parent)
{
    if (expected == actual)
    {
        shared_ptr<ParseNode> node(new ParseNode(actual, mLexer.GetText()));
        parent->AddChild(node);
        return node;
    }
    else
    {
        Error(string("Unexpected token: ") + to_string(actual));
    }
}

Notice a few changes. The Accept method now takes 3 parameters the new addition being the parent under which to place a new node of if the expected token matches the actual token. We’ve switched to using the Token enumeration for better readability. It returns the newly created node if no error has occurred, and if an error has, it prints out a message before passing out under a bench.

We can use this new function to build the parse tree as we go along in the IsCorrectAssignment procedure, which we really should rename to something more parsy now, like ParseAssignment.

void ParseAssignment(Lexer& lexer, shared_ptr<ParseNode> parent)
{
	Accept(TIdentifier, lexer.Get(), parent);
	Accept(TOperatorAssign, lexer.Get(), parent);
	Accept(TLitInteger, lexer.Get(), parent);
	Accept(TPunctuationEnd, lexer.Get(), parent);
}

Now, when this procedure finishes executing, the parent ParseNode should have 4 ParseNodes under it, but before we go any further, we need to ensure that all assignment statements are correctly parsed, not just the simple cases such as a = 1;.

a = 1; // Will parse
b = 2; // Will parse

// Will not parse, but is correct grammar.
sum = a + b;

Let’s assume for the time being that there can only be 2 types of assignments. One that assigns an integer literal to a variable in the form a = 1, and another that assigns the sum of 2 variables to a third in the form sum = a + b. This simplifies the process of parsing assignments a great deal for the purpose of understanding things.

void ParseAssignment(shared_ptr<ParseNode> statement, Token token)
{
    Accept(TIdentifier, token, statement);
    Accept(TOperatorAssign, mLexer.Get(), statement);

    // Get the next token and use it to predict
    // what kind of assignment it is.
    token = mLexer.Get();
    if (token == TLitInteger)
    {
        Accept(TLitInteger, token, statement);
    }
    else if (token == TIdentifier)
    {
        Accept(TIdentifier, token, statement);
        Accept(TOperatorPlus, mLexer.Get(), statement);
        Accept(TIdentifier, mLexer.Get(), statement);
    }

    Accept(TPunctuationEnd, mLexer.Get(), statement);
}

The important thing to notice here is the branching of the Accept calls, predicated upon the next token. Without this hack, it would be impossible for us to decide if the assignment was to a single literal or to the sum of two variables. This kind of parsing is called look ahead parsing and it plays an important role in parsing algorithms, so stash away this bit of knowledge for use later. Also notice that the parent ParseNode has been renamed statement. This is to reflect the fact that the assignment operation is a statement and appears under a statement node within the parse tree. This follows the natural nesting of the source code.

The example source as we have written it, only contains one other kind of statement - definitions. They take the form int a; or int a, b, c, ... z;. Our parser can now parse all the assignment statements including ones that add to variables but it cannot parse the definitions.

func add
{
	int a, b, sum; // Wont parse
	a = 1;
	b = 2;

	sum = a + b;
}

Let’s fix that with the following function.

void ParseDefinition(shared_ptr<ParseNode> statement, Token token)
{
    Accept(TReservedInt, token, statement);
    Accept(TIdentifier, mLexer.Get(), statement);

    token = mLexer.Get();
    while (true)
    {
        if (token == TPunctuationSeperator)
        {
            Accept(TPunctuationSeperator, token, statement);
        }
        else if (token == TIdentifier)
        {
            Accept(TIdentifier, token, statement);
        }
        else if (token == TPunctuationEnd)
        {
            Accept(TPunctuationEnd, token, statement);
            break;
        }
        else
        {
            Error(string("Unexpected token: (") + to_string(token) + ") '" + mLexer.GetText() + "'"); // Throws
        }

        token = mLexer.Get();
    }
}

Again notice that the function takes a parent of type statement. Also it does the same thing as ParseAssignment in that it uses a single look ahead token to predict which parse rule to match the input against. It goes a step further though, it wraps the whole thing in a loop, allowing for as many definitions a programmer would like shoehorn into a single statement.

The ParseDefinition and ParseAssignment functions together allow us to parse any kind of statement. But how do you call them? But how do you pick the correct function to call if your at the start of a statement? If the lexical analyser is at the start of the line int a, b, sum;, then it should call ParseDefinition, while if it’s at the start of the line a = 1; it should call ParseAssignment instead. Well, the solution is to have a separate function decide which Parse... function to call, and since we’re dealing with statements it makes sense to call the function ParseStatement.

void ParseStatement(shared_ptr<ParseNode> function, Token token)
{
    shared_ptr<ParseNode> statement(new ParseNode());
    function->AddChild(statement);

    if (token == TReservedInt)
    {
        ParseDefinition(statement, token);
    }
    else if (token == TIdentifier)
    {
        ParseAssignment(statement, token);
    }
}

The parent of the statement is the function as it contains all statements, a fact that is represented in the naming of it’s parent ParseNode. Notice that we create a new ParseNode out of nothing, as the statement itself is a placeholder that denotes structure. It does not have a token or any text associated with it. The procedure is passed the first token on the line (the reason for this will become apparent momentarily), which it uses as the look ahead token to decide which parse branch to follow. Fairly simple stuff really and strangely similar to the ParseDefinition and ParseAssignment procedures that it calls. This similarity is by design, a fact that will become evident when you see the procedure that parses a single function.

void ParseFunction(shared_ptr<ParseNode> program, Token token)
{
    shared_ptr<ParseNode> function = Accept(TReservedFunc, token, program);

    Accept(TIdentifier, mLexer.Get(), function);
    Accept(TPunctuationScopeOpen, mLexer.Get(), function);

    token = mLexer.Get();

    while (token != TPunctuationScopeClose)
    {
        ParseStatement(function, token);
        token = mLexer.Get();
    }

    Accept(TPunctuationScopeClose, token, function);
}

The function’s parent is the program. It starts by looking for the func keyword (TReservedFunc token) and accepts the tokens that follow. It then executes a loop where it checks to see if the } character (TPunctuationScopeClose token) is next, denoting the end of the function. If it isn’t, it passes the token to the ParseStatement procedure to continue the parse (which is why the first token inside ParseStatement is taken from the parameter).

When a parser is written by hand and written as a set of procedures which call each other, it is called a Handwritten Recursive Descent Parser. Handwritten, should actually be keyboard typed…

What dear? Yes I’ve taken my pedantic pills this morning. No there are no more bodies in the backyard. Technically that’s a hatchet, even then the term would be axe murderer.

In a Recursive Descent Parser, the form of the parser follows the form of the language quite closely, and descends from the highest level abstraction of the source (The Program), down to the lowest level details inside a single statement. Let’s finish up the parser with the top level ParseProgram procedure.

shared_ptr<ParseNode> ParseProgram()
{
    shared_ptr<ParseNode> program (new ParseNode());

    Token token = mLexer.Get();
    while (token != TEOF)
    {
        ParseFunction(program, token);
        token = mLexer.Get();
    }

    return program;
}

If you now go back up through this text, you’ll notice the scope of each preceding parse function getting smaller and smaller. (Also you will hear a subliminal message asking you to deposit money into a certain account). This is the beauty of parsing by recursive descent, it’s logic and method is transparent to the programmer writing the parser. There is no magic happening under the table (or in tables, ha ha, bottom up LALR, ha ha, wink wink).

The parser will also generate the parse tree, which we can print by adding a few utility commands into the ParseNode.

class ParseNode
{
private:
    Token mToken;
    string mText;
    vector <shared_ptr<ParseNode>> mChildren;

public:
    ParseNode() {...}
    ParseNode(Token token, string text) {...}
    void AddChild(shared_ptr<ParseNode> child) {...}

    string ToString()
    {
        string result;
        result += TokenToString(mToken);
        if (!mText.empty())
            result += " '" + mText + "'";

        return result;
    }

    void Print(int indent)
    {
        for (int p = 0; p < indent; p++)
            cout << " ";

        if (mToken != TEOF)
            cout << ToString() << endl;
        else
            cout << "[+]" << endl;

        for (auto& itr : mChildren)
        {
            itr->Print(indent + 3);
        }
    }
};

The resultant parse tree looks like this:

[+]
   TReservedFunc 'func'
      TIdentifier 'add'
      TPunctuationScopeOpen '{'
      [+]
         TReservedInt 'int'
         TIdentifier 'a'
         TPunctuationSeperator ','
         TIdentifier 'b'
         TPunctuationSeperator ','
         TIdentifier 'sum'
         TPunctuationEnd ';'
      [+]
         TIdentifier 'a'
         TOperatorAssign '='
         TLitInteger '1'
         TPunctuationEnd ';'
      [+]
         TIdentifier 'b'
         TOperatorAssign '='
         TLitInteger '2'
         TPunctuationEnd ';'
      [+]
         TIdentifier 'sum'
         TOperatorAssign '='
         TIdentifier 'a'
         TOperatorPlus '+'
         TIdentifier 'b'
         TPunctuationEnd ';'
      TPunctuationScopeClose '}'

Notice that the resultant parse tree contains all characters (with the exception of white space) from the source code. This is a property of the parse tree. Given the parse tree you can regenerate the source code fairly easily.

Finally, Let’s clean up the parser and put it in a class for reuse.

class Parser
{
private:
	Lexer mLexer;

	void Error(string error)
	{
		cout << "Error: " << error << endl;
		throw;
	}

	shared_ptr<ParseNode> Accept(Token expected, Token actual, shared_ptr<ParseNode> parent)
	{
		if (expected == actual)
		{
			shared_ptr<ParseNode> node(new ParseNode(actual, mLexer.GetText()));
			parent->AddChild(node);
			return node;
		}
		else
		{
			Error(string("Unexpected token: ") + to_string(actual));
		}
	}

	void ParseDefinition(shared_ptr<ParseNode> statement, Token token)
	{
		Accept(TReservedInt, token, statement);
		Accept(TIdentifier, mLexer.Get(), statement);

		token = mLexer.Get();
		while (true)
		{
			if (token == TPunctuationSeperator)
			{
				Accept(TPunctuationSeperator, token, statement);
			}
			else if (token == TIdentifier)
			{
				Accept(TIdentifier, token, statement);
			}
			else if (token == TPunctuationEnd)
			{
				Accept(TPunctuationEnd, token, statement);
				break;
			}
			else
			{
				Error(string("Unexpected token: (") + to_string(token) + ") '" + mLexer.GetText() + "'"); // Throws
			}

			token = mLexer.Get();
		}
	}

	void ParseAssignment(shared_ptr<ParseNode> statement, Token token)
	{
		Accept(TIdentifier, token, statement);
		Accept(TOperatorAssign, mLexer.Get(), statement);

		token = mLexer.Get();
		if (token == TLitInteger)
		{
			Accept(TLitInteger, token, statement);
		}
		else if (token == TIdentifier)
		{
			Accept(TIdentifier, token, statement);
			Accept(TOperatorPlus, mLexer.Get(), statement);
			Accept(TIdentifier, mLexer.Get(), statement);
		}

		Accept(TPunctuationEnd, mLexer.Get(), statement);
	}

	void ParseStatement(shared_ptr<ParseNode> function, Token token)
	{
		shared_ptr<ParseNode> statement(new ParseNode());
		function->AddChild(statement);

		if (token == TReservedInt)
		{
			ParseDefinition(statement, token);
		}
		else if (token == TIdentifier)
		{
			ParseAssignment(statement, token);
		}
	}

	void ParseFunction(shared_ptr<ParseNode> program, Token token)
	{
		shared_ptr<ParseNode> function = Accept(TReservedFunc, token, program);

		Accept(TIdentifier, mLexer.Get(), function);
		Accept(TPunctuationScopeOpen, mLexer.Get(), function);

		token = mLexer.Get();

		while (token != TPunctuationScopeClose)
		{
			ParseStatement(function, token);
			token = mLexer.Get();
		}

		Accept(TPunctuationScopeClose, token, function);
	}

public:
	Parser(Lexer& lexer)
		: mLexer(lexer)
	{
	}


    shared_ptr<ParseNode> ParseProgram()
    {
        shared_ptr<ParseNode> program (new ParseNode());

        Token token = mLexer.Get();
        while (token != TEOF)
        {
            ParseFunction(program, token);
            token = mLexer.Get();
        }

        return program;
    }
};

The above parser will almost correctly parse the cunning source code.

func add
{
    int a, b, sum;
    a = 1;
    b = 2;

    sum = a + b;
}

I say almost because there is an error in the parser. It will accept incorrect Cunning code. See if you can spot it.

This is one of the downsides of handwritten recursive descent parsers. They are difficult to verify as correct. There is no way to be certain that it will parse the source code and only the source code. This is why many parsers are generated by tools and grammars are defined formally, rather than through example as we did here.

This is a post in the Compilers series.