Note
The aim of the minishell
42 project is to create a lightweight command-line interpreter that reproduces the essential features of bash. What sets this implementation apart is its robust parsing system, completely decoupled from execution, built on LALR(1) grammar principles, producing a clean and efficient Abstract Syntax Tree (AST) for command execution. This project demonstrates advanced parsing techniques and provides a basis for understanding how some modern shells interpret and execute commands.
- 🧩 Tokenizer: Flexible and scalable lexical analyzer that converts raw input into meaningful tokens
- 🔎 Grammar Parser: Predictive parsing using Look-Ahead LR(1) techniques
- 🔃 AST Generation: Efficient Abstract Syntax Tree construction thanks to grammar production rules
- 🔗 Efficient Builtins: Implementation of essential shell builtins (cd, echo, exit, etc.)
- 🧹 Resource Caching: Cached file descriptors and memory allocations with automatic cleanup on program exit
- ⚡ Hashmap-powered Environment: Fast O(1) environment variables access
- 📏 42 School Compliant: Follows 42 School norm and coding standards
- Clang compiler
- GNU Make
- readline library
# Clone the repository
git clone --recurse-submodules https://github.com/MykleR/minishell.git
# Enter the directory and compile project
cd minishell; make
# Run the shell
./minishell
Important
Don't forget --recurse-submodules
otherwise dependencies will not be cloned
An LR parser is a powerful tool used by interpreters and compilers to analyze the structure of code or commands. "LR" stands for "Left-to-right" reading of the input, building up the parse tree in a way that matches the grammar rules of a language. This type of parser works from the bottom up: it starts with the raw input (like shell commands), gradually groups symbols to form higher-level structures, and ultimately recognizes valid syntax.
This grammar formally describes the language's syntax.
- Left side: Productions, used to represent symbols or in our case AST nodes.
- Right side: Requirements for the production (these may be tokens or other productions).
program -> list
list -> list AND list
list -> list OR list
list -> list PIPE list
list -> LBRACKET list RBRACKET
list -> command
redirection -> REDIR_IN arg
redirection -> REDIR_OUT arg
redirection -> REDIR_APP arg
command -> arg
command -> redirection
command -> command arg
command -> command redirection
arg -> ARG
LR parsers rely on two main sets of instructions, called tables:
- Action Table: This table tells the parser what to do next, depending on the current situation. The possible actions are:
- Shift: Reads and places the next token from the input onto the stack, gathering more information before reducing to a grammar rule.
- Reduce: Replaces gathered symbols on the stack with a single symbol, according to a grammar rule. (e.g., a sequence of tokens words might be reduced to a single "command" symbol)
- Accept: Successfully finish parsing. The grammar was fully respected.
- Error: Indicate a problem in the input. The grammar was not respected.
- Goto Table: After a reduction, this table tells the parser which state to move next, based on the new symbol on top of the stack.
Note
The parser uses a stack to keep track of symbols and parser states. As it shifts tokens and reduces groups of symbols, the stack helps the parser remember where it is and what structures have been recognized so far. Also Actions/Gotos tables are central data structures used in compiler construction—specifically in parsers generated by algorithms like LR parsing.
In our minishell project, these action and goto tables are precomputed and built directly into the parsing engine. When the user enters a command, the parser uses these tables to decide what to do for each token—whether to shift, reduce, accept, or signal an error. This setup allows minishell to quickly and reliably understand complex shell command syntax, making it robust and efficient.
Tip
You can generate tables, visualize parse trees and try other grammars with this online tool LALR(1) Parser Generator.
flowchart LR
A[Input] -->|Lexer|B(TOKENS)
B -->|"special tokens" |G(HEREDOC)
G -->|Parser |C{AST}
C -->|Execution |C
- Input Capture:
- GNU Readline for input with history support
- Tokenization:
- Input string to tokens like redirections '>>', pipes '|' or words
- Each token is classified based on its role in the shell language
- You will find an exhaustive list of all the tokens type in “headers/lexer.h”.
- Token enum values are very important as they are used as index in the action table
- Heredoc Processing:
- Handles heredocs and converts them to redirections '<' to a temp file
- fork the program to allow readline input
- AST Construction:
- Builds an abstract syntax tree using the LALR(1) parser
- As grammar rules are recognized, corresponding AST nodes are created
- Nodes are connected to form a tree structure representing the command hierarchy
- The tree captures command relationships and execution order
- Tree Traversal:
- Executes commands through post-order traversal of the binary tree. The AST is traversed in post-order to respect command dependencies.
- Nodes are processed according to their type (command, redirection, logical operator)
- Execution results propagate up the tree to determine logical branch paths and exit code status