Skip to main content
main-content

About this book

The second edition of this textbook has been fully revised and adds material about loop optimisation, function call optimisation and dataflow analysis. It presents techniques for making realistic compilers for simple programming languages, using techniques that are close to those used in "real" compilers, albeit in places slightly simplified for presentation purposes. All phases required for translating a high-level language to symbolic machine language are covered, including lexing, parsing, type checking, intermediate-code generation, machine-code generation, register allocation and optimisation, interpretation is covered briefly.

Aiming to be neutral with respect to implementation languages, algorithms are presented in pseudo-code rather than in any specific programming language, but suggestions are in many cases given for how these can be realised in different language flavours.

Introduction to Compiler Design is intended for an introductory course in compiler design, suitable for both undergraduate and graduate courses depending on which chapters are used.

Table of Contents

Chapter 1. Lexical Analysis

A lexical analyser, also called a lexer or scanner, will as its input take a string of individual letters and divide this string into word-like entities called tokens. Additionally, it will filter out whatever separates the tokens (the so-called white-space), i.e, lay-out characters (spaces, newlines etc.) and comments. For lexical analysis, specifications are traditionally written using regular expressions: An algebraic notation for describing sets of strings. The generated lexers are in a class of extremely simple programs called finite automata. This chapter will describe regular expressions and finite automata, their properties and how regular expressions can be converted to finite automata. Finally, we discuss some practical aspects of lexer generators.

Torben Ægidius Mogensen

Chapter 2. Syntax Analysis

Where lexical analysis splits a text into tokens, the purpose of syntax analysis (also known as parsing) is to recombine these tokens. Not back into a list of characters, but into something that reflects the structure of the text. This “something” is typically a data structure called the syntax tree of the text. As the name indicates, this is a tree structure. The leaves of this tree are the tokens found by the lexical analysis, and if the leaves are read from left to right, the sequence is the same as in the input text. Hence, what is important in the syntax tree is how these leaves are combined to form the structure of the tree, and how the interior nodes of the tree are labelled. In addition to finding the structure of the input text, the syntax analysis must also reject invalid texts by reporting syntax errors. As syntax analysis is less local in nature than lexical analysis, more advanced methods are required. We, however, use the same basic strategy: A notation suitable for human understanding and algebraic manipulation is transformed into a machine-like low-level notation suitable for efficient execution. This process is called parser generation.

Torben Ægidius Mogensen

Chapter 3. Scopes and Symbol Tables

An important concept in programming languages is the ability to name items such as variables, functions and types. Each such named item will have a declaration, where the name is defined as a synonym for the item. This is called binding. Each name will also have a number of uses, where the name is used as a reference to the item to which it is bound. Often, the declaration of a name has a limited scope: a portion of the program where the name will be visible. Such declarations are called local declarations, whereas a declaration that makes the declared name visible in the entire program is called global. It may happen that the same name is declared in several nested scopes. In this case, it is normal that the declaration closest to a use of the name will be the one that defines that particular use. A compiler will need to keep track of names and the items these are bound to, so that any use of a name will be attributed correctly to its declaration. This is typically done using a symbol table.

Torben Ægidius Mogensen

Chapter 4. Interpretation

The simplest way to execute a program is interpretation. Interpretation is done by a program call ed an interpreter, which takes the abstract syntax tree of a program and executes it by inspecting the syntax tree to see what needs to be done. This is similar to how a human evaluates a mathematical expression: We insert the values of variables in the expression and evaluate it bit by bit, starting with the innermost parentheses and moving out until we have the result of the expression. We can then repeat the process with other values for the variables. There are some differences, however. Where a human being will copy the text of the formula with variables replaced by values, and then write a sequence of more and more reduced copies of a formula until it is reduced to a single value, an interpreter will keep the formula (or, rather, the abstract syntax tree of an expression) unchanged and use a symbol table to keep track of the values of variables. Instead of reducing a formula, the interpreter is a function that takes an abstract syntax tree and a symbol table as arguments and returns the value of the expression represented by the abstract syntax tree. The function can call itself recursively on parts of the abstract syntax tree to find the values of subexpressions, and when it evaluates a variable, it can look its value up in the symbol table. This process can be extended to also handle statements and declarations, but the basic idea is the same: A function takes the abstract syntax tree of the program and, possibly, some extra information about the context (such as a symbol table or the input to the program) and returns the output of the program.

Torben Ægidius Mogensen

Chapter 5. Type Checking

LexingType checkingLexing andparsingParsing will reject many texts as not being correct programs. However, many languages have well-formedness requirements that can not be handled exclusively by the techniques seen so far. These requirements can, for example, be static type correctness or a requirement that pattern-matching or case-statements are exhaustive.

Torben Ægidius Mogensen

Chapter 6. Intermediate-Code Generation

The final goal of a compiler is to get programs written in a high-level languageLanguagehigh-level to run on a computer. This means that, eventually, the program will have to be expressed as machine codeMachine code that can run on the computer. This does not mean that we need to translate directly from the high-level abstract syntax to machine code. Many compilers use a medium-level language as a stepping-stone between the high-level language and the very low-level machine code. Such stepping-stone languages are called intermediate codeIntermediate code.

Torben Ægidius Mogensen

Chapter 7. Machine-Code Generation

The intermediate languageIntermediate language we have used in Chap. 6 is quite low-level and similar to the type of machine code you can find on modern RISCRISC processors, with a few exceptions.

Torben Ægidius Mogensen

Chapter 8. Register Allocation

When generating intermediate codeIntermediate code in Chap. 6, we have freely used as many variables as we found convenient.In Chap. 7, we have simply translated variables in the intermediate language one-to-one into registersRegister in the machine language. Processors, however, do not have an unlimited number of registers, so we need register allocation to handle this conflict. The purpose of register allocationRegister allocation is to map a large number of variables into a small(ish) number of registers. This can often be done by letting several variables share a single register, but sometimes there are simply not enough registers in the processor. In this case, some of the variables must be temporarily stored in memory. This is called spillingSpilling

Torben Ægidius Mogensen

Chapter 9. Functions

InFunction calls Chap. 6 we have shown how to translate the body of a single function. Function calls and returns were left (mostly) untranslated by using the CALL and RETURN instructions in the intermediate code. Nor did we in Chap. 7 show how these instructions should be translated.

Torben Ægidius Mogensen

Chapter 10. Data-Flow Analysis and Optimisation

Most compilers perform some optimisations on the code that generated by the main phases of the compiler.

Torben Ægidius Mogensen

Chapter 11. Optimisations for Loops

Since programs tend to spend most of their time in loops, it is worthwhile to study optimisations specific for loops. We will in this chapter look at three specific optimisations for loops: Code hoistingCode hoisting, which moves loop-invariant code out of a loop, memory prefetchingMemory prefetching, which improves cache utilisation, and incrementalisationIncrementalisation, which replaces multiplication in a loop by addition.

Torben Ægidius Mogensen
Additional information