A Python lexical analyzer and parser written in C++23 that tokenizes Python source code and builds an Abstract Syntax Tree (AST).
- Lexical Analysis: Tokenizes Python source code into meaningful tokens
- Syntax Parsing: Builds an AST using recursive descent parsing
- Python Language Support:
- Function definitions (
def) - Class definitions (
class) - Control flow (
if,elif,else,while,for) - Pattern matching (
match/case) - Binary operators (
+,-,*,/,//,%,**) - Assignment operators (
=,+=,-=,*=,/=) - Literals (integers, floats, strings)
- Return statements
- Function definitions (
- CMake 3.10 or higher
- C++23 compatible compiler (GCC 11+, Clang 14+)
cmake -B build
cmake --build buildThe executable will be located at build/bin/lexical_analyzer.
./build/bin/lexical_analyzer <python_file>./build/bin/lexical_analyzer example.pyThis will:
- Tokenize the input Python file
- Parse the tokens into an AST
- Display the AST structure
lexical/
├── src/
│ ├── lexical.hpp # Lexer interface
│ ├── lexical.cpp # Lexer implementation
│ ├── parser.hpp # Parser interface
│ ├── parser.cpp # Parser implementation
│ ├── ast.hpp # AST node definitions
│ └── token.hpp # Token type definitions
├── main.cpp # Entry point
├── CMakeLists.txt # Build configuration
└── example.py # Example Python file
The parser generates an AST with the following node types:
PROGRAM- Root nodeFUNCTION_DEF- Function definitionsCLASS_DEF- Class definitionsIF_STMT,ELIF_STMT,ELSE_STMT- Conditional statementsWHILE_STMT,FOR_STMT- Loop statementsMATCH_STMT,CASE_STMT- Pattern matchingASSIGNMENT- Variable assignmentsBINARY_OP- Binary operationsRETURN_STMT- Return statementsIDENTIFIER- Variable/function namesINTEGER_LITERAL,FLOAT_LITERAL,STRING_LITERAL- Literals
For the following Python code:
def greet(name):
message = "Hello, " + name
return messageThe parser generates:
Node: PROGRAM
Node: FUNCTION_DEF (value: "greet")
Node: PARAMETER_LIST
Node: PARAMETER (value: "name")
Node: BLOCK
Node: ASSIGNMENT (value: "=")
Node: IDENTIFIER (value: "message")
Node: BINARY_OP (value: "+")
Node: STRING_LITERAL (value: "Hello, ")
Node: IDENTIFIER (value: "name")
Node: RETURN_STMT
Node: IDENTIFIER (value: "message")
The parser uses recursive descent with the following precedence hierarchy:
parse_statement()- Statement dispatcherparse_assignment()- Assignment expressionsparse_operator()- Binary operatorsparse_expression()- Primary expressions- Specialized parsers for control structures
The lexer recognizes:
- Keywords:
def,class,if,while,for,match,case,return, etc. - Operators:
+,-,*,/,//,%,**,=,+=, etc. - Literals: integers, floats, strings
- Delimiters:
(),[],{},:,, - Indentation:
INDENT,DEDENT,NEWLINE
This project is provided as-is for educational purposes.
Contributions are welcome! Please feel free to submit pull requests or open issues for bugs and feature requests.