Chapter 6 of Writing a C Compiler was a short one. So will Chapter 7. Do not even bother to buckle up.
Syntax Tree and Parsing
See? Not even new tokens for lexer.
This chapter is all about implementing compound statements. To abstract the similarities between a function body and compound statements, a new Block
AST node shall be created. Along with a tiny change in FuncDef
, you get this beauty.
pub const FuncDef = struct {
name: []const u8,
body: Block,
};
pub const Block = struct {
body: std.SegmentedList(BlockItem, 0),
};
And adding a new statement type in Stmt
.
pub const Stmt = union(enum) {
@"return": *Expr,
expr: *Expr,
@"if": struct { cond: *Expr, then: *Stmt, @"else": ?*Stmt },
compound: Block, // <-- this one
null: void,
};
I think it does not need a pointer. I will wait to see if the Zig compiler yells at me.
Updating the parser requires a new function: parse_block
. It is simple enough and the implementation is copied over from parse_func_def
.
fn parse_block(
arena: std.mem.Allocator,
tokens: *lexer.Tokenizer,
) Error!ast.Block {
try expect(.l_brace, tokens);
var body: std.SegmentedList(ast.BlockItem, 0) = .{};
while (tokens.next()) |next_token| {
if (next_token.tag == .r_brace) break;
tokens.put_back(next_token);
const item = try parse_block_item(arena, tokens);
try body.append(arena, item);
} else return error.NotEnoughJunk;
return .{ .body = body };
}
And parse_func_def
is adjusted accordingly:
fn parse_func_def(
arena: std.mem.Allocator,
tokens: *lexer.Tokenizer,
) Error!ast.FuncDef {
try expect(.type_int, tokens);
const name = try expect(.identifier, tokens);
try expect(.l_paren, tokens);
try expect(.keyword_void, tokens);
try expect(.r_paren, tokens);
const block = try parse_block(arena, tokens);
return .{ .name = name, .block = block };
}
And that is it.
Pretty Printing Tricks
I encountered an interesting problem in the AST pretty printer. Because I am using an indentation based scheme, rather than braces, it makes it slightly difficult to know where blocks begin and end if I just, increase the indentation. So, for a regular block, I add the keyword DO
.
int main(void) {
int a = 2;
int b;
{
a = -4;
int a = 7;
b = a + 1;
}
return b == 8 && a == -4;
}
PROGRAM
FUNCTION main
int a <- 2;
int b;
DO
a <- (- 4);
int a <- 7;
b <- (+ a 1);
RETURN (&& (== b 8) (== a (- 4)))
But when the block is in the if
statement, this would look something like this, which is .. eh, unseemly.
IF a
DO
int b <- 2;
RETURN b
So instead I get rid of DO
and just have indent immediately after IF
. The thing is, the logic for compound statement printing does not know, normally, whether it is preceded by IF
(and later FOR
and WHILE
), or if it is standing by its own. Doing the custom formatting when printing IF
would lead to a lot of duplicated code for every new control flow statement. So time to abuse another feature of the formatter!
One of the options in srd.fmt.FormatOptions
is alignment
, which is normally used to align the printed items right, left, or center, within given padding. In the fmt
string, <
indicates left, ^
indicates center, and >
indicates right, which is the default value.
In other words, when printing the statement following IF
, all I need to do there is this, which sets the alignment
option in the called function to .center
.
try writer.print("{:^[1]}", .{ cs.then, w + 1 });
// ^ right here
And now I can do the check right within .compound
:
.compound => |b| {
var iter = b.body.constIterator(0);
var cw: usize = undefined;
if (options.alignment != .center) { // <-- here
try writer.writeAll("DO");
cw = w + 1;
} else {
if (iter.next()) |item|
try writer.print("{}", .{item});
cw = w;
}
while (iter.next()) |item| {
try writer.print("\n{:[1]}", .{
item,
cw,
});
}
},
The other issue, which you might have noticed, is that the new line characters which separate the internal block items are printed before each statement. The reason for that is for the block not to end in a newline character, as the new line is usually taken care of by the parent block. This simply avoids an additional blank line after every block.
This prints the AST nicely, even for ridiculous C code.
int main(void) {
int ten = 10;
{}
int twenty = 10 * 2;
{{}}
return ten + twenty;
}
PROGRAM
FUNCTION main
int ten <- 10;
DO
int twenty <- (* 10 2);
DO
DO
RETURN (+ ten twenty)
I could just as well omit the empty blocks right here and then.1 Maybe I will do that if I ever write a real compiler.
Semantic Analysis
I wish this were as easy as filling up the blanks, but that is actually where the meat of this chapter is. Because C allows shadowing in inner scopes. The following is perfectly valid C code.
int main(void) {
int x = 1; // <-- x0
{
int x = 2; // <-- x1
if (x > 5) { // <-- x1
x = 3; // <-- x1
int x = 4; // <-- x2
}
return x ; // <-- x1
}
return x; // <-- x0
}
What this means in practice is that each new scope (read: new block), requires its own variable_map
. Previously, the variable_map
was defined at program scope in resolve_prgm
:
pub fn resolve_prgm(
gpa: std.mem.Allocator,
strings: *utils.StringInterner,
prgm: *ast.Prgm,
) Error!void {
var variable_map: std.StringHashMapUnmanaged([:0]const u8) = .empty;
defer variable_map.deinit(gpa);
const bp: Boilerplate = .{
.gpa = gpa,
.strings = strings,
.variable_map = &variable_map,
};
try resolve_func_def(bp, prgm.func_def);
}
The variable_map
is discarded by the end of this function because, by the time the function is done, all the user-supplied variable names have been given unique names. The scopes are identified and done at this point.
It does not serve to create a new empty variable_map
for each new block, because blocks can use variables from parent blocks.[^global] So for every new block, or compound statement, a copy of the variable map is passed instead of the original, and existing keys are replaced by the bindings in the inner scope. However, to avoid double declarations, a marker is needed to note whether this variable belongs to the current scope (so it cannot be declared again), or a parent one, (so it can).
This is the part where double declarations are checked:
fn resolve_decl(
bp: Boilerplate,
decl: *ast.Decl,
) Error!void {
if (bp.variable_map.contains(decl.name))
return error.DuplicateVariableDecl;
// snip --
The data for whether this is shadowing a parent variable (legal) or repeating a local declaration (illegal) is not tracked yet. The simplest way to use a Entry
type that has both the new name, and a little boolean toggle whether it is local. The data oriented way is perhaps to create a new map for each new scope, but this seems complicated when the scopes get too deep and you need to keep a stack of maps, etc. I will stick with a boolean, or at least an enum.
const Entry = struct {
new_name: [:0]const u8,
scope: enum { local, parent } = .local,
};
I really love anonymous types. Then change the variable map to a std.StringHashMapUnmanaged(Entry)
. This currently needs a couple of call sites to update, but all very straightforward. The default value of .local
makes initialization simpler, just a .{ .name = unique_name }
when inserting into the map.
Now a small change to the duplicate declaration checking can be done. I need to check if it exists and if it is a .local
scope. A cute zig syntactic trick makes this nicely terse. Who needs if let
chains.
if (bp.variable_map.get(decl.name)) |entry| if (entry.scope == .local)
return error.DuplicateVariableDecl;
All that is left is creating a new, duplicate, variable_map
when resolving a compound statement. Before I do that, I will do some refactorings moving the old variable_map
into FuncDef
scope instead, and moving the old resolve_func_def
logic into a new resolve_block
that can be reused. This also simplifies later chapters when each function has its own scope as well.
pub fn resolve_prgm(
gpa: std.mem.Allocator,
strings: *utils.StringInterner,
prgm: *ast.Prgm,
) Error!void {
try resolve_func_def(gpa, strings, prgm.func_def);
}
fn resolve_func_def(
gpa: std.mem.Allocator,
strings: *utils.StringInterner,
func_def: *ast.FuncDef,
) Error!void {
var variable_map: std.StringHashMapUnmanaged(Entry) = .empty;
defer variable_map.deinit(gpa);
const bp: Boilerplate = .{
.gpa = gpa,
.strings = strings,
.variable_map = &variable_map,
};
try resolve_block(bp, &func_def.block);
}
fn resolve_block(
bp: Boilerplate,
block: *ast.Block,
) Error!void {
var iter = block.body.iterator(0);
while (iter.next()) |item| switch (item.*) {
.S => |*s| try resolve_stmt(bp, s),
.D => |*d| try resolve_decl(bp, d),
};
}
Now back to compound statements. First the variable_map
is copied into a new one, with all the .local
s changes into .parent
s. Zig helpfully has a clone
method. then iterate over all the values and update the scope
field.
// inside resolve_stmt
.compound => |*b| {
var variable_map = try bp.variable_map.clone(bp.gpa);
defer variable_map.deinit(bp.gpa);
var iter = variable_map.valueIterator();
while (iter.next()) |value|
value.* = .{ .name = value.name, .scope = .parent };
const inner_bp: Boilerplate = .{
.gpa = bp.gpa,
.strings = bp.strings,
.variable_map = &variable_map,
};
try resolve_block(inner_bp, b);
},
This compiles fine. The eye test passes on all C files in chapter 7's tests in the test suite. Time to actually run the test suite, and thankfully all tests pass.
Now what? Almost nothing. No changes to IR (except handling the new statement) or any following stage necessary. Running the full tests, up to compilation, works fine. Time to move on to chapter 8.
Lessons Learned
The most fun I had here was experimenting with the AST pretty printer. Cracking the nut of the duplicate new line problem.