Skip to content

Latest commit

 

History

History
300 lines (205 loc) · 11.9 KB

doc.md

File metadata and controls

300 lines (205 loc) · 11.9 KB

C Compiler

1 Objective

The goal is to create a simple C compiler. The compiler should be able to read a basic C file, convert it into NASM assembly, and generate an executable file from it.

C is a straightforward language, which is why the compiler was chosen to be a C compiler.

Through this project, I will learn how a compiler is structured and gain a deeper understanding of the theory behind it. Additionally, I will become familiar with and use assembly language. Patterns like Lexer and Parser will be applied, giving me practical experience.

The compiler will be developed according to the Open-Closed Principle, allowing for incremental extensions over time. It should be able to output strings without preprocessor directives and handle basic integer calculations, as well as initialize and use variables. A semantic analyzer and optimizer will not be implemented initially. However, calculations that don’t involve variables will be computed directly.

2 Approach

2.1 Technology Selection

The compiler is being developed using .NET 6.0 and C#. I chose the .NET framework and C# because I have the most experience with them, and the project itself is already very complex. However, I am aware that a functional programming language (e.g., Haskell) would likely be more suitable.

Lunarvim is my preferred development environment on the Linux distribution EndeavourOS. Since I work on this project in my free time, it was advantageous to stay in my familiar environment. This allowed me to focus entirely on the complexity of the project.

2.2 Modular Structure

2.2.1 Lexer

The lexer analyzes source code and divides it into tokens.

To Do:
Implement a lex method.
Input: C source code Output: List of Tokens List<Token>

Structure:

There is a Token class that includes the following elements:

  • Two enumerations:

    • TokenType
    • LiteralType
  • Three attributes:

    • TokenType Type
    • string Value
    • LiteralType Literal
  • Two Enumerations:

    • TokenType: {Int, Return, Literal, Semicolon, …}
    • LiteralType: {StringLiteral, IntegerLiteral, …}
  • Example: For the statement Int a = 5;

    • Type = Identifier
    • Value = “a”
    • Literal = IntegerLiteral

Lexical Elements

Here are all the tokens that this lexer can recognize, along with the regular expressions that define each one:

Term Definition Example
Keyword All lowercase letters used for reserved keywords (e.g., basic data types, if-else statements). int, return, ...
Identifier A sequence of non-numeric characters (both uppercase and lowercase) including underscores (e.g., function and variable names). [a-zA-Z], [0-9]
Constant A constant holds a value (either a string or an integer). string-literal, integer-literal
String Literal A sequence of characters enclosed in double quotation marks. "[a-zA-Z]", "[0-9]", ...
Integer Literal A sequence of numeric characters. [0-9]
Operator Used for arithmetic operations. arithmetic
Punctuation Symbols used for grouping and assignment. (), =, :, ...

Example: int variableName = 5;

  • Keyword: int
  • Identifier: variableName
  • Punctuator: =
  • Constant: Integer Literal: 5
  • Punctuator: ;

2.2.2 Parser

The parser processes the token list to create an Abstract Syntax Tree (AST) and performs a partial syntax check. This verification occurs automatically, as, for instance, a variable name is expected after a data type.

To Do:
Implement a pars method.
Input: List of Tokens List<Token> Output: Abstract Syntax Tree (AST)

Structure:

The program consists of two main classes: Node and ExpressionTree, along with their subclasses.

Class: Node

The Node class represents the nodes that form the Abstract Syntax Tree (AST).

  • Constructor:

    • This constructor checks whether the sequence of tokens represents an expression or a statement, and initializes the Typeand Tokensvariables accordingly.
  • Attributes:

    • NodeType Type
    • List Tokens
    • Node Left
    • Node Right
  • Enum:

    • NodeType {Program, FuncDecl, Statement, IntegerExpression, StringExpression}

Subclass: ExpressionNode

This subclass creates string or integer literal objects from the ExpressionTree class and constructs an AST for handling those expressions.

Subclass: StatementNode

This subclass is designed to handle statements such as return and if-else. However, its functionality will not be implemented in this project. It exists to support simple statements like return 0;.

Class: ExpressionTree

  • Enum:
    • OperatorType {Addition, Subtraction, Multiplication, Division}

Subclass: IntegerLiteralExpressionTree

This subclass recursively forms an equation AST and implements the method TreeNodeSolver(), which computes equations that do not contain variables directly.

  • Attributes:
    • OperatorType Operand
    • bool IsOperator
    • string Value
    • bool IsValue
    • int Precedence
    • bool TreeGotOptimized

Subclass: StringLiteralExpressionTree

This subclass constructs an AST with a node that contains the string being processed.

  • Attribute:
    • string Value

AST:

A data structure that represents the structure of a program (or code snippet). The Abstract Syntax Tree (AST) should be rooted in a program node and trigger an error in the event of invalid syntax. The left nodes form the main trunk that runs through the program, while the right nodes represent branches (expressions that are handled by the ExpressionTree class).

2.2.3 Code Generator

The code generator translates an Abstract Syntax Tree (AST) into assembly code. During the recursive traversal of the AST nodes, each node is passed to its corresponding translation function.

To Do:
Implement a generate method.
Input: AST Output: Assembly Code

Structure:

There are two main classes: AssemblyGenerator and Section.

AssemblyGenerator

The AssemblyGenerator```` class traverses the AST (Abstract Syntax Tree) and organizes all StringBuilderobjects, which are populated by theSection``` class, into a complete assembly code output. It also collects all variables to ensure they are correctly initialized in the appropriate sections.

The newLine string helps generalize the code and makes it easier to insert line breaks.

  • Attributes:
    • string newLine
    • List stringVariables
    • Dictionary<string, string> integerVariables
    • StringBuilder SECTIONdata
    • StringBuilder SECTIONtextTop
    • StringBuilder SECTIONtextBody
    • StringBuilder SECTIONbss

Section

The Section class handles each node passed to it by the AssemblyGenerator, along with the corresponding StringBuilder. The respective StringBuilder is then used to append the newly generated assembly code.

  • Attribute:
    • newLine

2.2.4 Test Strategy

The following C code served as the template for testing:

int main(){
    int a = 6;
    int b = 2 + 3;
    printf("ThisIsAString");
    int c = 3 + (8 - 2) * 3;
    int d = 3 + (a - b) * 3;
    printf("AnotherString");
    return 0;
}

This code is constantly being modified.
Additionally, an assembler code file has been generated from the existing test code to serve as a reference. Generated code is reviewed and analyzed with ChatGPT, as debugging with assembler code is not possible.

3 Current Status

3.1 Milestones Achieved

Functional Lexer:

The lexer operates reliably according to the defined requirements, successfully breaking down the source code into the appropriate token types. Recognized token types include the following keywords: Int, String, Return, If, Else, Identifier, Literal, Operand, OpenParenthesis, CloseParenthesis, OpenBrace, CloseBrace, Equals, and Semicolon. The source code is deconstructed into its components, and a corresponding token object is created for each recognized element. This serves as the foundation for the subsequent parsing steps.

Parser Progress:

The parser is capable of processing a list of tokens and generating an abstract syntax tree (AST) that accurately represents the logical structure of the code. While the handling of mathematical equations is still flawed, the parser functions correctly for all other elements and accurately reflects the program's structure.

Code Generator:

The code generator can completely translate the correctly generated AST into assembly code. This encompasses all essential language features that the parser can successfully handle. The generated assembly code reflects the logical structure of the AST, maintaining the correct sequence of instructions. This ensures that the resulting assembly code meets the expected criteria.

3.2 Problems

Currently, there is a major issue with the incorrect processing of equations, particularly regarding operator precedence (parentheses before multiplication before addition). The existing parsing routines are unable to correctly establish the order of operations.

3.3 Example

Functioning
Equation: 3 + a * b

AST:

    +
  /    \
3       /
      /  \
    a     5

Assembly Code:

push	3
push dword [a]
push	5
pop	rbx
pop	rax
cqo
idiv	rbx
push 	rax
pop 	rbx
pop 	rax
add 	rax, rbx
push 	rax

Incorrect
Equation: 3 + (a – b) * 3 / 6

Expected AST: Actual AST:

		+							+
     /    \						  /   \
   3       /					 3     *
        /  \				  /    \
       *     6				 -      /
     /       			   /   \   /  \
    -     				  a     b 3    6
  /   \   	
 a     b 

Due to the faulty AST, the assembly code generated is also flawed accordingly.

4 Outlook

In the current state of the project, the generation of the abstract syntax tree (AST) for equations is not yet fully functional. Specifically, it's necessary to ensure that the correct operator precedence—parentheses before multiplication/division before addition/subtraction—is maintained. This is crucial for accurately parsing and interpreting mathematical expressions.

Accordingly, the code generator is also being adjusted.

5 References

Nora Sandler write a compiler

lexer tutorial

c bible

compiler explorer

assembly totorial

memory sheit

stack stuff

GG Simple Code Generator