Writing a simple compiler!
writing-a-simple-compiler
Blogs
Robin Tempert
Blogs
12/16/2025
7 min
0

Writing a simple compiler!

12/16/2025
7 min
0

As developers we write a lot of code. With the use of many tools it will be converted to machine code that can be executed by the CPU.

Personally, I’ve always seen compilers as a kind of magic box. Recently, I did a deep dive to understand how they work, and the core concept isn’t as difficult as you might think.

In 1952, Grace Hopper’s team created the first compiler for the A-0 system. It translated instructions into machine code, similar to assembly today. Today, there is little reason to write your own compiler, as the problem has largely been solved.

Within this article we will write a very simple compiler that converts a simple expression to a program we can execute. The program will convert text into assembly using C#. The assembly will then be converted into a program that can be executed.


The core concept

Programming code is text that is converted into tokens. These tokens are then transformed into a tree structure based on predefined rules, which can subsequently be used to generate machine code.

The text is converted into tokens through lexical analysis, also called a lexer. It separates words, numbers, operators, and similar elements into tokens with specific types. This is typically done by splitting the code using whitespace and word boundaries. Within the lexer, we define the tokens that the language will recognize, for example, whether function declarations use funcfunction, or void.

The tokens produced by the lexer are then used to create an abstract syntax tree (AST). This hierarchical structure represents the programming language. Each node in the tree requires certain values, and invalid values result in errors. This is also where the language’s rules, such as whether curly brackets are required, are enforced.

The AST is then used to generate machine code. There are many approaches to this process. However, in this example we generate assembly code, which can then be converted into machine code. The compiler walks through the AST and emits the necessary assembly instructions.

Our program

For this article we will create a very simple programming language that is only able to add two numbers together.

10 + 5

This will be stored in a variable content.

The lexer

The lexer is responsible for converting source code into tokens. In the example above, we see three tokens: 10+, and 5. The lexer operates in a loop until it reaches the end of the file. Each token is assigned a type, which is later used to implement the language’s logic. These types can be stored in an enum such as SyntaxTokenType.

public enum SyntaxTokenType
{
EndOfFile,

Number,
Plus
...
}

In addition to the type, each token requires extra information, specifically its start and end positions in the source text, which allows us to extract its content. This information can be stored in a class named SyntaxToken.

public sealed class SyntaxToken
{
public SyntaxTokenType Type { get; }
public int StartPosition { get; }
public int EndPosition { get; }

...
}

Using SyntaxToken, we can now construct our lexer class.

public sealed class Lexer
{
private readonly string _content;
private int _position = 0;

private char Current => _content[_position];

public Lexer(string content)
{
_content = content;
}

public SyntaxToken Peek()
{
...
}
}

A lexer ignores all whitespace but uses it to indicate token boundaries. When we start to peek at the next character, we must first consume any whitespace.

public sealed class Lexer
{
...

public SyntaxToken Peek()
{
ConsumeWhiteSpace();

...
}

private void ConsumeWhiteSpace()
{
while (true)
{
if (!Char.IsWhiteSpace(Current))
{
break;
}

_position++;
}
}
}

Once we are positioned at a token, we can determine its type. The first token in our example is a number. After consuming any preceding whitespace, we check if the current character is a digit. If so, we continue reading until we encounter a non-digit character. This forms our first number token.

public sealed class Lexer
{
...

public SyntaxToken Peek()
{
ConsumeWhiteSpace();

var startPosition = _position;

if (Char.IsDigit(Current))
{
while (Char.IsDigit(Current))
{
_position++;
}

return new SyntaxToken(SyntaxTokenType.Number, startPosition, _position);
}

...
}
}

On the next peek, we again consume any whitespace and begin reading the next token, which is +. Similar to the number token, we check if the current character is a + symbol and create a corresponding token.

public sealed class Lexer
{
...

public SyntaxToken Peek()
{
ConsumeWhiteSpace();

var startPosition = _position;

...

if (Current == '+')
{
return new SyntaxToken(SyntaxTokenType.Plus, startPosition, _position);
}

...
}
}

The following token is another number. Since we already implemented number parsing, it will be consumed in the same manner. The last character in the file is a null terminator, which indicates that we have reached the end of the source and the lexer has completed its work.

public sealed class Lexer
{
...

public SyntaxToken Peek()
{
ConsumeWhiteSpace();

var startPosition = _position;

...

if (Current == '\0')
{
return new SyntaxToken(SyntaxTokenType.EndOfFile, startPosition, _position);
}

...
}
}

We can extend the lexer to recognize reserved words in the language. For example, when the lexer encounters the word function, it can generate a SyntaxTokenType.Function.

Abstract syntax tree

The tokens can now be used to generate the Abstract Syntax Tree (AST). This is a tree-like representation of the code. The deeper you go into the tree, the smaller and more specific the representation becomes. Nodes in an AST indicate what they represent, such as a FunctionExpressionMethodCall, etc. At this stage, we also implement most of the programming language logic.


An expression is simply a syntactic element within a language. Every line and statement we write is an expression. First, we create a base class that will serve as the foundation for all AST nodes.
public abstract class ExpressionSyntax;

Next, we can create a BinaryExpression class to represent expressions like 1 + 1. As the name indicates, this expression has two components: a left-hand side, a right-hand side, and an operator.

public sealed class BinaryExpressionSyntax : ExpressionSyntax
{
public SyntaxToken Left { get; }
public SyntaxToken Operator { get; }
public SyntaxToken Right { get; }
}

To keep things simple, the left and right components will initially be SyntaxToken instances. However, they could also be expressions. For example, in the expression 1 + (2 + 3), the right-hand side is itself a binary expression. When creating a syntax class for a function, it will require the function token, an identifier, arguments, and a body.

Syntax parser

We can now construct our AST by iterating through the syntax tokens. This process is similar to the lexer, where we move sequentially through a list of tokens. We continue reading tokens until we encounter the end-of-file token. Since everything in our language is an expression, we attempt to parse each as one.

public sealed class SyntaxParser
{
private readonly List _tokens;
private int _position = 0;

private SyntaxToken Current => _tokens[_position];

public SyntaxPraser(List tokens)
{
_tokens = tokens;
}

public List Parse()
{
var expressions = new List();
while (Current.Kind != SyntaxTokenType.EndOfFile)
{
expressions.Add(ParseExpression());
}

return expressions;
}
}

As our language currently only supports binary expressions, we can parse them directly. We read each syntax token and advance the position to the next one. However, we can also check the type of each syntax token before parsing. For example, if the token is SyntaxTokenType.Function, we can begin parsing a function expression.

public sealed class SyntaxParser
{
...
private SyntaxToken Current => _tokens[_position];
...

public Expression ParseExpression()
{
var left = Current;
_position++;
var operatorToken = Current;
_position++;
var right = Current;

return new BinaryExpressionSyntax(left, operatorToken, right);
}
}

This approach can be extended to support multiple expression types. We can define as many parsing rules as needed — for instance, whether to allow or disallow parentheses, brackets, and other syntactic constructs. The parser reads tokens sequentially until an expression is fully parsed. If an unexpected token is encountered, the parser can return an error and terminate compilation. Such an error constitutes a build-time error.

Compilation

Now that we have our AST, we can use it to generate machine code. There are many approaches to this step. For example, C# compiles into the Common Intermediate Language (IL), which is then translated into machine code, while Java compiles into bytecode first. The advantage of these approaches is that the output is platform-independent.

To keep our example simple, we will generate assembly code directly, which we will then compile into machine code. We will use x86–64 assembly. This process mirrors what a compiler does in general.

public sealed class Compiler
{
private readonly List _expressions;

public Compiler(List expressions)
{
_expressions = expressions;
}

public void Compile()
{
var builder = new StringBuilder();
...
File.WriteAllText("program.asm", ...);

// Code to call nasm and ld
}
}

We can write a compiler class that outputs text into a file called program.asm. The assembly requires a specific structure. Within the .text section, we indicate that the program starts at the _start label. At _start, we first load our two values, add them together, and then exit the program. The result of the addition will be used as the program’s exit code

section .text
global _start

_start:
mov eax, 1 ; First value
mov ebx, 2 ; Second value
add eax, ebx ; Add ebx to eax

; Exit program
mov ebx, eax ; Move result to ebx for exit code
mov eax, 1 ; Set exit flag
int 0x80 ; Systemcall

Which will result into the following machine code.

b8 01 00 00 00          mov    eax, 0x1
bb 02 00 00 00 mov ebx, 0x2
01 d8 add eax, ebx
89 c3 mov ebx, eax
b8 01 00 00 00 mov eax, 0x1
cd 80 int 0x80

In our compiler class we can loop through our expression and generate the assembly from it.

public sealed class Compiler
{
...

public void Compile()
{
var builder = new StringBuilder();

foreach (var expression in _expressions)
{
if (expression is BinaryExpression binaryExpression)
{
builder.Append($"mov eax, {expression.Left.Content}");
builder.Append($"mov ebx, {expression.Right.Content}");
builder.Append(expression.Operator.Type switch
{
SyntaxTokenType.Plus => "add eax, ebx"
});
}
}

var asm = """
...
_start:
{builder}
...
"
"";
File.WriteAllText("program.asm", asm);
}
}

Here, elf64 stands for Executable and Linkable Format, 64-bit, which produces a program.o file. This object file can then be linked into an executable using ld, which resolves any system library dependencies and produces a runnable program. Once executed, the exit code of the program can be retrieved with:

nasm -f elf64 program.asm
ld program.o -o program
./program
echo $?

Adding it all together

We have our lexer, AST and compiler. And can execute the program.

// Lexer
var lexer = new Lexer(content);

var tokens = new List();

SyntaxToken token;
do
{
token = lexer.Peek();
tokens.Add(token);
} while (token.Kind is not SyntaxTokenType.Unknown and not SyntaxTokenType.EndOfFile);

// Parser
var parser = new SyntaxParser(tokens);
var expressions = parser.Parse();

// Compiler
var compiler = new Compiler();
compiler.Compile();

// Execute ./program

Conclusion

That is it! We now have the most basic compiler ever. There is a lot to be desired in error handling, performance, etc. However, these are the basic concepts of a compiler. I have a lot of respect for Grace Hopper’s team, who created the first compiler without the tools we have today.

Reacties
Categorieën