The goal is to create a simple C compiler. The compiler should be able to read a basic C file, convert it into NASM assembly, and generate an executable file from it.
C is a straightforward language, which is why the compiler was chosen to be a C compiler.
Through this project, I will learn how a compiler is structured and gain a deeper understanding of the theory behind it. Additionally, I will become familiar with and use assembly language. Patterns like Lexer and Parser will be applied, giving me practical experience.
The compiler will be developed according to the Open-Closed Principle, allowing for incremental extensions over time. It should be able to output strings without preprocessor directives and handle basic integer calculations, as well as initialize and use variables. A semantic analyzer and optimizer will not be implemented initially. However, calculations that don’t involve variables will be computed directly.
The compiler is being developed using .NET 6.0 and C#. I chose the .NET framework and C# because I have the most experience with them, and the project itself is already very complex. However, I am aware that a functional programming language (e.g., Haskell) would likely be more suitable.
Lunarvim is my preferred development environment on the Linux distribution EndeavourOS. Since I work on this project in my free time, it was advantageous to stay in my familiar environment. This allowed me to focus entirely on the complexity of the project.
The lexer analyzes source code and divides it into tokens.
To Do:
Implement a lex
method.
Input: C source code Output: List of Tokens List<Token>
Structure:
There is a Token
class that includes the following elements:
-
Two enumerations:
- TokenType
- LiteralType
- TokenType
-
Three attributes:
- TokenType Type
- string Value
- LiteralType Literal
- TokenType Type
-
Two Enumerations:
- TokenType: {Int, Return, Literal, Semicolon, …}
- LiteralType: {StringLiteral, IntegerLiteral, …}
- TokenType: {Int, Return, Literal, Semicolon, …}
-
Example: For the statement
Int a = 5;
- Type = Identifier
- Value = “a”
- Literal = IntegerLiteral
- Type = Identifier
Lexical Elements
Here are all the tokens that this lexer can recognize, along with the regular expressions that define each one:
Term | Definition | Example |
---|---|---|
Keyword | All lowercase letters used for reserved keywords (e.g., basic data types, if-else statements). | int, return, ... |
Identifier | A sequence of non-numeric characters (both uppercase and lowercase) including underscores (e.g., function and variable names). | [a-zA-Z], [0-9] |
Constant | A constant holds a value (either a string or an integer). | string-literal, integer-literal |
String Literal | A sequence of characters enclosed in double quotation marks. | "[a-zA-Z]", "[0-9]", ... |
Integer Literal | A sequence of numeric characters. | [0-9] |
Operator | Used for arithmetic operations. | arithmetic |
Punctuation | Symbols used for grouping and assignment. | (), =, :, ... |
Example: int variableName = 5;
- Keyword: int
- Identifier: variableName
- Punctuator: =
- Constant: Integer Literal: 5
- Punctuator: ;
The parser processes the token list to create an Abstract Syntax Tree (AST) and performs a partial syntax check. This verification occurs automatically, as, for instance, a variable name is expected after a data type.
To Do:
Implement a pars
method.
Input: List of Tokens List<Token>
Output: Abstract Syntax Tree (AST)
Structure:
The program consists of two main classes: Node
and ExpressionTree
, along with their subclasses.
Class: Node
The Node class represents the nodes that form the Abstract Syntax Tree (AST).
-
Constructor:
- This constructor checks whether the sequence of tokens represents an expression or a statement, and initializes the
Type
andTokens
variables accordingly.
- This constructor checks whether the sequence of tokens represents an expression or a statement, and initializes the
-
Attributes:
- NodeType Type
- List Tokens
- Node Left
- Node Right
- NodeType Type
-
Enum:
NodeType {Program, FuncDecl, Statement, IntegerExpression, StringExpression}
Subclass: ExpressionNode
This subclass creates string
or integer literal
objects from the ExpressionTree
class and constructs an AST for handling those expressions.
Subclass: StatementNode
This subclass is designed to handle statements such as return
and if-else
. However, its functionality will not be implemented in this project. It exists to support simple statements like return 0;
.
Class: ExpressionTree
- Enum:
OperatorType {Addition, Subtraction, Multiplication, Division}
Subclass: IntegerLiteralExpressionTree
This subclass recursively forms an equation AST and implements the method TreeNodeSolver()
, which computes equations that do not contain variables directly.
- Attributes:
- OperatorType Operand
- bool IsOperator
- string Value
- bool IsValue
- int Precedence
- bool TreeGotOptimized
- OperatorType Operand
Subclass: StringLiteralExpressionTree
This subclass constructs an AST with a node that contains the string being processed.
- Attribute:
- string Value
- string Value
AST:
A data structure that represents the structure of a program (or code snippet). The Abstract Syntax Tree (AST) should be rooted in a program node and trigger an error in the event of invalid syntax. The left nodes form the main trunk that runs through the program, while the right nodes represent branches (expressions that are handled by the ExpressionTree class
).
The code generator translates an Abstract Syntax Tree (AST) into assembly code. During the recursive traversal of the AST nodes, each node is passed to its corresponding translation function.
To Do:
Implement a generate
method.
Input: AST Output: Assembly Code
Structure:
There are two main classes: AssemblyGenerator
and Section
.
AssemblyGenerator
The AssemblyGenerator```` class traverses the AST (Abstract Syntax Tree) and organizes all
StringBuilderobjects, which are populated by the
Section``` class, into a complete assembly code output. It also collects all variables to ensure they are correctly initialized in the appropriate sections.
The newLine
string helps generalize the code and makes it easier to insert line breaks.
- Attributes:
- string newLine
- List stringVariables
- Dictionary<string, string> integerVariables
- StringBuilder SECTIONdata
- StringBuilder SECTIONtextTop
- StringBuilder SECTIONtextBody
- StringBuilder SECTIONbss
- string newLine
Section
The Section
class handles each node
passed to it by the AssemblyGenerator
, along with the corresponding StringBuilder
. The respective StringBuilder
is then used to append the newly generated assembly code.
- Attribute:
newLine
The following C code served as the template for testing:
int main(){
int a = 6;
int b = 2 + 3;
printf("ThisIsAString");
int c = 3 + (8 - 2) * 3;
int d = 3 + (a - b) * 3;
printf("AnotherString");
return 0;
}
This code is constantly being modified.
Additionally, an assembler code file has been generated from the existing test code to serve as a reference. Generated code is reviewed and analyzed with ChatGPT, as debugging with assembler code is not possible.
Functional Lexer:
The lexer operates reliably according to the defined requirements, successfully breaking down the source code into the appropriate token types. Recognized token types include the following keywords: Int, String, Return, If, Else, Identifier, Literal, Operand, OpenParenthesis, CloseParenthesis, OpenBrace, CloseBrace, Equals, and Semicolon
. The source code is deconstructed into its components, and a corresponding token object is created for each recognized element. This serves as the foundation for the subsequent parsing steps.
Parser Progress:
The parser is capable of processing a list of tokens and generating an abstract syntax tree (AST) that accurately represents the logical structure of the code. While the handling of mathematical equations is still flawed, the parser functions correctly for all other elements and accurately reflects the program's structure.
Code Generator:
The code generator can completely translate the correctly generated AST into assembly code. This encompasses all essential language features that the parser can successfully handle. The generated assembly code reflects the logical structure of the AST, maintaining the correct sequence of instructions. This ensures that the resulting assembly code meets the expected criteria.
Currently, there is a major issue with the incorrect processing of equations, particularly regarding operator precedence (parentheses before multiplication before addition). The existing parsing routines are unable to correctly establish the order of operations.
Functioning
Equation: 3 + a * b
AST:
+
/ \
3 /
/ \
a 5
Assembly Code:
push 3
push dword [a]
push 5
pop rbx
pop rax
cqo
idiv rbx
push rax
pop rbx
pop rax
add rax, rbx
push rax
Incorrect
Equation: 3 + (a – b) * 3 / 6
Expected AST: Actual AST:
+ +
/ \ / \
3 / 3 *
/ \ / \
* 6 - /
/ / \ / \
- a b 3 6
/ \
a b
Due to the faulty AST, the assembly code generated is also flawed accordingly.
In the current state of the project, the generation of the abstract syntax tree (AST) for equations is not yet fully functional. Specifically, it's necessary to ensure that the correct operator precedence—parentheses before multiplication/division before addition/subtraction—is maintained. This is crucial for accurately parsing and interpreting mathematical expressions.
Accordingly, the code generator is also being adjusted.