Skip to main content

About this book

This textbook is intended for an introductory course on Compiler Design, suitable for use in an undergraduate programme in computer science or related fields.

Introduction to Compiler Design presents techniques for making realistic, though non-optimizing compilers for simple programming languages using methods that are close to those used in "real" compilers, albeit slightly simplified in places for presentation purposes. All phases required for translating a high-level language to machine language is covered, including lexing, parsing, intermediate-code generation, machine-code generation and register allocation. Interpretation is covered briefly.

Aiming to be neutral with respect to implementation languages, algorithms are presented in pseudo-code rather than in any specific programming language, and suggestions for implementation in several different language flavors are in many cases given. The techniques are illustrated with examples and exercises.

The author has taught Compiler Design at the University of Copenhagen for over a decade, and the book is based on material used in the undergraduate Compiler Design course there.

Additional material for use with this book, including solutions to

selected exercises, is available at

Table of Contents

Chapter 2. Syntax Analysis

Where lexical analysis splits the input into tokens, the purpose of syntax analysis (also known as parsing) is to recombine these tokens. Not back into a list of characters, but into something that reflects the structure of the text. This “something” is typically a data structure called the syntax tree of the text. As the name indicates, this is a tree structure. The leaves of this tree are the tokens found by the lexical analysis, and if the leaves are read from left to right, the sequence is the same as in the input text. Hence, what is important in the syntax tree is how these leaves are combined to form the structure of the tree and how the interior nodes of the tree are labelled. In addition to finding the structure of the input text, the syntax analysis must also reject invalid texts by reporting syntax errors. As syntax analysis is less local in nature than lexical analysis, more advanced methods are required. We, however, use the same basic strategy: A notation suitable for human understanding is transformed into a machine-like low-level notation suitable for efficient execution. This process is called parser generation.
Torben Ægidius Mogensen

Chapter 4. Interpretation

The simplest way to execute a program is interpretation. Interpretation is done by a program called an interpreter, which takes the abstract syntax tree of a program and executes it by inspecting the syntax tree to see what needs to be done. This is similar to how a human evaluates a mathematical expression: We insert the values of variables in the expression and evaluate it bit by bit, starting with the innermost parentheses and moving out until we have the result of the expression. We can then repeat the process with other values for the variables. There are some differences, however. Where a human being will copy the text of the formula with variables replaced by values and then write a sequence of more and more reduced copies of a formula until it is reduced to a single value, an interpreter will keep the formula (or, rather, the abstract syntax tree of an expression) unchanged and use a symbol table to keep track of the values of variables. Instead of reducing a formula, the interpreter is a function that takes an abstract syntax tree and a symbol table as arguments and returns the value of the expression represented by the abstract syntax tree. The function can call itself recursively on parts of the abstract syntax tree to find the values of subexpressions, and when it evaluates a variable, it can look its value up in the symbol table. This process can be extended to also handle statements and declarations, but the basic idea is the same: A function takes the abstract syntax tree of the program and, possibly, some extra information about the context (such as a symbol table or the input to the program) and returns the output of the program.
Torben Ægidius Mogensen

Chapter 7. Machine-Code Generation

The intermediate language we have used in Chap. 6 is quite low-level and similar to the type of machine code you can find on modern RISC processors, with a few exceptions: We have used an unbounded number of variables, where a processor will have a bounded number of registers, we have used high-level instructions for function definitions, calls and return, we have assumed that any constant can be an operand to an arithmetic instruction. Typically, RISC processors allow only small constants as operands, and in the intermediate language, the IF-THEN-ELSE instruction has two target labels, where, on most processors, the conditional jump instruction has only one target label, and simply falls through to the next instruction when the condition is false. The problem of mapping a large set of variables to a small number of registers is handled by register allocation, as explained in Chap. 8. Functions are treated in Chap. 9. We will look at the remaining two problems in this chapter.
Torben Ægidius Mogensen

Chapter 9. Functions

In Chap. 6 we have shown how to translate the body of a single function. Function calls and returns were left (mostly) untranslated by using the CALL and RETURN instructions in the intermediate code. Nor did we in Chap. 7 show how these instructions should be translated. We will, in this chapter, remedy these omissions. We will initially assume that all variables are local to the function that access them and that parameters are call-by-value, meaning that the value of an argument expression is passed to the called function. This is the default parameter-passing mechanism in most languages, and in many languages (e.g, C or SML) it is the only one. A single procedure body uses (in most languages) a finite number of variables. We have seen in Chap. 8 that we can map these variables into a (possibly smaller) set of registers. A program that uses recursive procedures or functions may, however, use an unbounded number of variables, as each recursive invocation of the function has its own set of variables, and there is no bound on the recursion depth. We can not hope to keep all these variables in registers, so we will use memory for some of these. The basic idea is that only variables that are local to the active (most recently called) function will be kept in registers. All other variables will be kept in memory. When a function is called, all the live variables of the calling function (which we will refer to as the caller) need to be stored in memory so the registers will be free for use by the called function (the callee). When the callee returns, the stored variables are loaded back into registers. We will use a stack for this storing and loading, pushing register contents on the stack when they must be saved and popping them back into registers when they must be restored.
Torben Ægidius Mogensen
Additional information