Blog Series Tags

How Compilers Work - Lexical Analysis

Compilers are wonderful, if somewhat under appreciated programs. Like the foundations of great buildings, they form the latticework on which modern computing is built. Unfortunately, they have been a bit neglected of late. Writing code tends to be more about innovating and creating exciting new businesses than the actual pleasure of knowing how things work, how to break them and how to fix them again. The modern day queen of compilers, Grace Hopper[1] , started life by taking clocks apart and trying to put them back together.

There is a joy that comes from the act of creation that somehow gets lost in the details of building business models, but I digress.

The process of taking human understandable text and converting it into a form that a machine can understand and execute efficiently is not as complex as you would expect. Granted, there are a lot of details and some of those details are fairly, eh, detailed themselves, but the general process isn’t hard to understand or implement. Take for example the following play language (which I henseforth name cunning).

func add
{
    int a, b, sum;
    a = 1;
    b = 2;

    sum = a + b;
}

The above language is made with people in mind. It’s meant to be easy to remember, easy to reason out in and easy to interpret. However from the point of view of a machine, it’s just a random stream of bytes that have no special meaning. Imbuing the machine with a sense of meaning is the primary task of a compiler. The first step in this process is lexical analysis.

Lexical Analysis

Lexical Analysis is the process of grouping characters in the programming language together to form higher order units. Units that we would call “words” in our everyday language. During the lexical analysis process, those words are called tokens. Take a look at the code again.

func add
{
    int a, b, sum;
    a = 1;
    b = 2;

    sum = a + b;
}

We can see words like func, add, int. We can also see single character words like a, =, +. These while all tokens are fundamentally different from each other.

func is a word that means the next bit of code belongs to one function.

add, a, b and sum are all names of things, be it functions or variables.

{, = and + have special meanings such as the start of scope, assignment and addition.

During the lexical analysis process, the compiler concentrates on creating tokens that make sense. Sometimes a token will define something that has a specific, constant value like, for example func. Other times a token will define something that can have a range of values like variables names count or customer and literal values 42 or "Zaphod". Creating a lexical analyser isn’t difficult, if you know what your looking for. The first thing to do is to create a list of token types you expect to find. An enum fills this requirement fairly decently.

    enum Tokens
    {
        TEOF,                    /* Special token for end of source file */
        TReservedFunc,           /* Reserved Keyword: func */
        TIdentifier,             /* All identifiers, names of variables and functions */
        TReservedInt,            /* Reserved Keyword: int */
        TPunctuationSeperator,   /* , */
        TPunctuationEnd,         /* ; */
        TPunctuationScopeOpen,   /* { */
        TPunctuationScopeClose,  /* } */
        TOperatorAssign,         /* = */
        TOperatorPlus,           /* + */
        TLitInteger              /* Literal: Any integer, 1795 for example */
    };

Now that we have tokens, it’s time to match them to the contents of the cunning source. You could use regular expressions for this process, but that’s generally quite complex in itself. A much clearer alternative is an if block and some scanning code. First we handle the more complicated cases of names, reserved words and literals.

        mText.clear();
        
        char c = mSource.get();

        while (isspace(c))
            c = mSource.get();

        if (c == char_traits<char>::eof())
            return TEOF;
                            
        if (isdigit(c))
        {
            mText += c;
            
            c = mSource.peek();
            while (isdigit(c))
            {
                mText += mSource.get();
                c = mSource.peek();
            }

            return TLitInteger;
        }
        else if (isalpha(c))
        {
            mText += c;
            c = mSource.peek();
            
            while (isalnum(c))
            {
                mText += mSource.get();
                c = mSource.peek();
            }

            if (mText == "func")
                return TReservedFunc;
            else if (mText == "int")
                return TReservedInt;
            else
                return TIdentifier;
        }

The code is fairly self evident. Integers begin and end with numbers, while names and reserved words need to begin with letters. Names can have number at its end while special checks exist for reserved words. Punctuation on the other hand (like { and }) are simpler to handle.

        // Single character lexical units
        mText += c;
        if (c == '{')
            return TPunctuationScopeOpen;
        else if (c == '}')
            return TPunctuationScopeClose;
        else if (c == ',')
            return TPunctuationSeperator;
        else if (c == ';')
            return TPunctuationEnd;
        else if (c == '=')
            return TOperatorAssign;
        else if (c == '+')
            return TOperatorPlus;
        
        throw exception();

Notice that at the end we throw an exception if the text doesn’t match any token than we look for? That’s a compiler error that we will eventually emit from the lexical analysis stage. For now, we leave it empty. Now you can put it all together into a class called Lexer and call it from the main function to parse our cunning code. We also include a function to get the textual version of the token, we will at times need this during the later stages of compilation.

class Lexer
{
private:
    
    ifstream& mSource;
    string mText;
public:
    
    Lexer(ifstream& source)
        : mSource(source)
    {
    }

    Token Get()
    {
        mText.clear();
        
        char c = mSource.get();

        if (c == char_traits<char>::eof())
            return TEOF;

        while (isspace(c))
            c = mSource.get();
                            
        if (isdigit(c))
        {
            mText += c;
            
            c = mSource.peek();
            while (isdigit(c))
            {
                mText += mSource.get();
                c = mSource.peek();
            }

            return TLitInteger;
        }
        else if (isalpha(c))
        {
            mText += c;
            c = mSource.peek();
            
            while (isalnum(c))
            {
                mText += mSource.get();
                c = mSource.peek();
            }

            if (mText == "func")
                return TReservedFunc;
            else if (mText == "int")
                return TReservedInt;
            else
                return TIdentifier;
        }

        // Single character lexical units
        mText += c;
        if (c == '{')
            return TPunctuationScopeOpen;
        else if (c == '}')
            return TPunctuationScopeClose;
        else if (c == ',')
            return TPunctuationSeperator;
        else if (c == ';')
            return TPunctuationEnd;
        else if (c == '=')
            return TOperatorAssign;
        else if (c == '+')
            return TOperatorPlus;
        
        throw exception();
    }

    string GetText()
    {
        return mText;
    }
};

int main(int argc, char** argv)
{
    ifstream source(argv[1]);
    Lexer lex(source);
    int tok;
        
    while ((tok = (int)lex.Get()) != TEOF) {
        cout << lex.GetText() << " : " << tok << endl;
    }
}

Running the above code creates the following output in the terminal, where the numbers represent the integer value given to each enum item.

func : 1
add : 2
{ : 6
int : 3
a : 2
, : 4
b : 2
, : 4
sum : 2
; : 5
a : 2
= : 8
1 : 10
; : 5
b : 2
= : 8
2 : 10
; : 5
sum : 2
= : 8
a : 2
+ : 9
b : 2
; : 5
} : 7

And that’s it. The first phase of compiling the cunning language is now complete.

[1]

“Amazing” Grace Hopper is widely considered to have built the first modern compiler, though her work was preceded by Konrad Tsuze’s “Plan Calculus”. Tsuze would have have gone further if not for the Second World War and its effects on his work. Namely the British bombed it out of existence.

This is a post in the Compilers series.