Abdul Rahman Sibahi | Writing a C Compiler, Chapter 6, in Zig

This is chapter 6 of implementing Writing a C Compiler in Zig. So, without further ado, time to get on with if statements and conditional expressions.

Lexer

New keywords! if and else, and two new tokens ? and :. Apparently the word query is an acceptable single word name for a question mark so I am going with that in the lexer.

The new keywords are simple, enough, but it requires updating the keyword map.

pub const keywords = std.StaticStringMap(Tag).initComptime(.{
    .{ "int", .type_int },
    .{ "return", .keyword_return },
    .{ "void", .keyword_void },

    // new lines :
    .{ "if", .keyword_if },
    .{ "else", .keyword_else },
});

The two new tokens are trivial to add and require no new states in the lexer's state machine. So onto AST and parsing.

AST

One new statement and one new expression today. Compound statements survive another day without being implemented.

pub const Stmt = union(enum) {
    // snip --
    @"if": struct { cond: *Expr, then: *Stmt, @"else": ?*Stmt },
};

pub const Expr = union(enum) {
    // snip --
    ternary: struct { *Expr, *Expr, *Expr },
};

And that is pretty much it for the AST. No revision of previous major design decisions.

Parser

So how do you parse an expression like c ? a : b?. The rule for it is that it is right associative (not unlike the assignment operator =), but the expression in the middle is treated as it is part of the expression itself or between two parenthesis. In other words, think of ? a : as an binary operator between c and b all by itself, only for parsing purposes. So first update the binop_precedence function

pub fn binop_precedence(self: @This()) ?struct { u8, u8 } {
    return switch (self) {
        .query => .{ 3, 0 },
        // the rest
    };
}

Then update parse_expr to deal with it. This is slightly trickier than usual. The middle expression should be parsed before the right hand side is. Therefore this silly dance, before parsing rhs.

const then_ptr: ?*ast.Expr = if (current.tag == .query) t: {
    const then = try parse_expr(arena, tokens, 0);
    try expect(.colon, tokens);

    break :t try utils.create(ast.Expr, arena, then);
} else null;

Then, in the big switch board, this tiny line is added.

.greater_equals => .{ .binop_ge = bin_op },
.query => .{ .ternary = .{ lhs_ptr, then_ptr.?, rhs_ptr } }, // <-

And voila. Ternary expressions are parsed now. This creates a very tiny bit of waste creating the boiler-plate reducing bin_op (which is shared between all the other expressions). But all is fair in the name of easier to write code.

For parsing the if statement, I hit upon a previous design decision. In adding declarations last chapter, I decided to forgo a separate parse_stmt in exchange for a simpler, at the time, single parse_block_item, and discriminating then between declarations and statements. Now, since the then argument of the if statement can only be a statement and not a declaration, this requires a divorce, and a separate parse_stmt. Declarations are not ready to leave the house yet and therefore will remain in parse_block_item. This is the trimmed parse_block_item, followed by parse_stmt building its own house.

fn parse_block_item(
    arena: std.mem.Allocator,
    tokens: *lexer.Tokenizer,
) Error!ast.BlockItem {
    const current = tokens.next() orelse
        return error.NotEnoughJunk;

    switch (current.tag) {
        .type_int => {
            const name = try expect(.identifier, tokens);
            const new_token = tokens.next() orelse
                return error.NotEnoughJunk;

            const init: ?*ast.Expr = switch (new_token.tag) {
                .equals => ret: {
                    const expr = try parse_expr(arena, tokens, 0);
                    const expr_ptr = try utils.create(ast.Expr, arena, expr);

                    try expect(.semicolon, tokens);
                    break :ret expr_ptr;
                },
                .semicolon => null,
                else => return error.SyntaxError,
            };

            return .decl(.{ .name = name, .init = init });
        },
        else => {
            tokens.put_back(current);
            const stmt = try parse_stmt(arena, tokens);

            return .stmt(stmt);
        },
    }
}

fn parse_stmt(
    arena: std.mem.Allocator,
    tokens: *lexer.Tokenizer,
) Error!ast.Stmt {
    const current = tokens.next() orelse
        return error.NotEnoughJunk;
    switch (current.tag) {
        .semicolon => return .null,
        .keyword_return => {
            const expr = try parse_expr(arena, tokens, 0);
            const expr_ptr = try utils.create(ast.Expr, arena, expr);
            try expect(.semicolon, tokens);
            return .{ .@"return" = expr_ptr };
        },
        // if statement goes here
        else => {
            tokens.put_back(current);
            const expr = try parse_expr(arena, tokens, 0);
            const expr_ptr = try utils.create(ast.Expr, arena, expr);
            try expect(.semicolon, tokens);
            return .{ .expr = expr_ptr };
        },
    }
}

The if statement parsing is slightly more involved than usual. The condition is always an expression that is parenthesized, and the then statement always exists. The else statement's existence depends on the existence of the keyword else, so I have to peek. But because of the way I implemented peeking, by "putting back" the consumed token, it makes for some weird looking code, making more than usual use of block expressions.

.keyword_if => {
    try expect(.l_paren, tokens);
    const cond = try parse_expr(arena, tokens, 0);
    const cond_ptr = try utils.create(ast.Expr, arena, cond);
    try expect(.r_paren, tokens);

    const then = try parse_stmt(arena, tokens);
    const then_ptr = try utils.create(ast.Stmt, arena, then);

    const peek = tokens.next() orelse
        return error.NotEnoughJunk;

    // weird looking code starts here.
    const else_ptr: ?*ast.Stmt = if (peek.tag == .keyword_else) s: {
        const e = try parse_stmt(arena, tokens);
        break :s try utils.create(ast.Stmt, arena, e);
    } else n: {
        tokens.put_back(peek);
        break :n null;
    };

    return .{ .@"if" = .{
        .cond = cond_ptr,
        .then = then_ptr,
        .@"else" = else_ptr,
    } };
},

Now, before eating the pudding, the semantic analysis phase requires some tiny updates. There are no changes other than handling the new statements and expressions, and adding thoise is 11 lines total.

// resolve_stmt
.@"if" => |i| {
    try resolve_expr(bp, i.cond);
    try resolve_stmt(bp, i.then);
    if (i.@"else") |e|
        try resolve_stmt(bp, e);
},

// resolve_expr
.ternary => |t| {
    try resolve_expr(bp, t.@"0");
    try resolve_expr(bp, t.@"1");
    try resolve_expr(bp, t.@"2");
},

And that is it, really. Back to proving the pudding. Testing it on this lovely and non-confusing C file:

int main(void) {
    int a = 0;
    if (!a)
        if (3 / 4)
            a = 3;
        else
            a = 8 / 2;

    return a;
}

Gives me this result, which judging by the formatting is probably parsed correctly.

PROGRAM
	FUNCTION main
		int a <- 0;
		IF (! a)
			IF (/ 3 4)
				a <- 3;
			ELSE
				a <- (/ 8 2);
		RETURN a

All parsing and semantic analysis are succeeding and failing where they should. That's good news.

The Eye Test

Since I implemented all this nice pretty parsing, I would like to inspect on all the valid test files of the chapter. At least, visually confirm that there is nothing out of the ordinary. Same as I have been doing for C files in each chapter, but instead of copying a couple of interesting ones and running my app on them, I can do a batch comparison between each C file and the visual result of parsing it or validation or whatever.

The first thought was to do it with a shell script. But shell is voodoo to me.¹ build.zig is right there, so maybe I can use it to that effect.

The new command would need two arguments: the stage, which is then passed directly to the executable, and the target folder. I cannot just pass the target chapter because not all chapters have the same structure.

What followed was one of the most aggravating and annoying parts of this series.

The first hurdle was navigating Zig's standard library's file system API. The docs are not very clear and the different APIs are organized in weird places, and how does any of that tie back in the build system?

There is a walk function.² Excellent, but it needs a std.fs.Dir object. All I have is a std.Build.LazyPath. How do I make one into the other?

Turns out LazyPath has a getPath3 method (getPath and getPath2 are deprecated) that gives you a std.Build.Cache.Path object. And that in its turn has an openDir method that gives you a std.fs.Dir that then you can walk. Try figuring that out only from the docs.

The document traversal afterwards was simple enough. Basic Zig code.

while (try walker.next()) |entry| if (entry.kind == .file and
    std.mem.endsWith(u8, entry.basename, ".c")) // not to do .s files
{
    // insert entry handling here
};

The second hurdle, and if you actually know what I know now this will all seem very stupid, is a bit more involved. At first I passed the arguments as I did for the other previous commands. So like this

zig build eye -- path/to/test/folder --parse

And I desugared the arguments accordingly.

const args: []const []const u8 = b.args orelse
    &.{ "./c_files/", "--lex" }; // least harmful default;

So this way, if I don't pass anything, these would pass and the tree would be walked properly. Now zig build eye worked fine.

The problem is, if I run another command than eye, (say .. run), the directory walk would run anyway and then the build script would fail because the first argument is not a valid directory or whatever.

I did know that, btw.

Looking for help again someone suggested that the solution is to create a custom step, and even provided helpful code to do so. Here was a nice Gist I found on Google. This seems perfect. So I do this:

const std = @import("std");
pub fn build(b: *std.Build) !void {

    // stuff unchanged

    { // `zig build eye` command
        const eye_step = b.step("eye", "eye test all the files in a given directory");

        var closure = b.allocator.create(Closure) catch unreachable;
        closure.* = .{ .exe = exe, .step = std.Build.Step.init(.{
            .id = .custom,
            .name = "inner_eye",
            .makeFn = make_eye_step,
            .owner = b,
        }) };

        eye_step.dependOn(&closure.step);
    }
}

const Closure = struct {
    exe: *std.Build.Step.Compile,
    step: std.Build.Step,
};

fn make_eye_step(step: *std.Build.Step, _: std.Build.Step.MakeOptions) !void {
    const b = step.owner;
    const closure: *const Closure = @fieldParentPtr("step", step);
    const exe = closure.exe;

    const args: []const []const u8 = b.args orelse
        &.{ "./c_files/", "--lex" }; // least harmful default;

    // directory api boilerplate
    const lazy = b.path(args[0]);
    const path = lazy.getPath3(b, null);

    const dir = try path.openDir("", .{ .access_sub_paths = false });
    var walker = try dir.walk(b.allocator);
    defer walker.deinit();

    var prev_run_cmd: ?*std.Build.Step.Run = null;

    while (try walker.next()) |entry| {
        if (entry.kind == .file and std.mem.endsWith(u8, entry.basename, ".c")) {
            const file = entry.path;

            const bat = b.addSystemCommand(&.{ "bat", file });
            bat.setCwd(lazy); // I am proud of myself for finding this
            bat.stdio = .inherit;

            // each invokation of `bat` depends on the previous run step so they're sequential
            if (prev_run_cmd) |c|
                bat.step.dependOn(&c.step);

            const run_cmd = b.addRunArtifact(exe);
            run_cmd.setCwd(lazy);
            run_cmd.addArg(file);
            run_cmd.addArgs(args[1..]);
            run_cmd.stdio = .inherit;

            // exactly what the default for `zig build run` does.
            run_cmd.step.dependOn(b.getInstallStep());

            // keeping the squence
            run_cmd.step.dependOn(&bat.step);
            prev_run_cmd = run_cmd;
        }
    }

    // making sure the OG step depends on the tail of the chain.
    step.dependOn(&prev_run_cmd.?.step);
}

The way I see it, I am doing all the right things. All the steps depend on the proper steps, and best of all, the other zig build commands like run and test and what have you, all work fine.

Except this does not work. And I could not figure out.

Running this would dutifully go through the loop properly, as I was able to verify by inserting a few prints, but the commands do not run. And asking for help on the Discord only led me into StackOverflow-esque answers that I am not interested in repeating.

Asking for help on Ziggit actually gave me enough info to understand the real problem. This reply in particular answered the dilemma for me.

Do you know where the two phases of the build system are mentioned in the Official Build System Documentation and Intro? Because I cannot find it. It is mentioned in passing in the documentation of some methods like getPath3 and run, but with no proper explanation.

Anyway, it clicked, and I was able to use the user provided options to insert the needed details.

{ // `zig build eye` command
    const eye_step = b.step("eye", "Eye test all the files in a given directory");

    if (b.option(std.Build.LazyPath, "folder", "Path to eye")) |lazy| {
        // same logic as before, more or less
        try walk_tree(b, exe, eye_step, lazy);
    } else {
        // *this* step, and only this one would fail
        // if the `folder` option is not set, because it
        // depends on `fail` only if the path does not exist
        const fail = b.addFail("folder needed for eye");
        eye_step.dependOn(&fail.step);
    }
}

b.option creates a new argument for the user of the build system. This way, if the argument is present, the directory tree is walked and the proper dependencies are added to the eye command. If it is not, then the eye command will fail.

Other commands are unchanged and unaffected. The only difference from my first envisioning it is the way it is called, which I am not happy with, but eh. Pick your battles.

zig build eye -Dfolder="./c_files/" -- --parse

And it works exactly like I wanted. It calls bat on every file in a given directory followed by paella at the given stage, allowing for a quick eye test between the two. I'd paste the whole output here but I will settle for one file.

File: multiple_if.c
1 int main(void) {
2     int a = 0;
3     int b = 0;
4
5     if (a)
6         a = 2;
7     else
8         a = 3;
9
10     if (b)
11         b = 4;
12     else
13         b = 5;
14
15     return a + b;
16 }
PROGRAM
	FUNCTION main
		int a <- 0;
		int b <- 0;
		IF a
			a <- 2;
		ELSE
			a <- 3;
		IF b
			b <- 4;
		ELSE
			b <- 5;
		RETURN (+ a b)
================================

Automatic Formatting

While mucking around in the documentation I discovered a nice addFmt method that is there to check for proper formatting in CI, but can just format the code for you. Adding these two lines pretty much anywhere in the file formatted everything every time I zig build.

const fmt_step = b.addFmt(.{ .paths = &.{"./"} });
exe.step.dependOn(&fmt_step.step);

Internal Representation

Now is the time to get on with new IR generation. Similarly to lat chapter, no new instructions need be appended. So after the IR generation stage, it is done. Short chapter all around.

Generating the IR for if statements is straightforward, with a slight complication around the optional else statement.

.@"if" => |c| {
    const cond = try expr_emit_ir(bp, c.cond);
    const else_label = try bp.make_temporary("else");
    try bp.append(.{ .jump_z = .{ .cond = cond, .target = else_label } });
    try stmt_emit_ir(bp, c.then);
    if (c.@"else") |@"else"| {
        const end_label = try bp.make_temporary("end");
        try bp.append(.{ .jump = end_label });
        try bp.append(.{ .label = else_label });
        try stmt_emit_ir(bp, @"else");
        try bp.append(.{ .label = end_label });
    } else try bp.append(.{ .label = else_label });
},

For conditional expressions, it is much of the same, except expr_emit_ir is called instead and its value is returned, and there is no optional else clause.

.ternary => |t| {
    const else_label = try bp.make_temporary("else");
    const end_label = try bp.make_temporary("end");
    const dst_name = try bp.make_temporary("ter");
    const dst: ir.Value = .{ .variable = dst_name };

    const cond = try expr_emit_ir(bp, t.@"0");
    try bp.append(.{ .jump_z = .{ .cond = cond, .target = else_label } });
    const then = try expr_emit_ir(bp, t.@"1");
    try bp.append(.{ .copy = .init(then, dst) });

    try bp.append(.{ .jump = end_label });
    try bp.append(.{ .label = else_label });

    const else_ = try expr_emit_ir(bp, t.@"2");
    try bp.append(.{ .copy = .init(else_, dst) });

    try bp.append(.{ .label = end_label });

    return dst;
},

And that's it. I do not even have to run the intermediate codepaths as no changes to instructions or assembly generation this chapter. There is a small extra credit of implementing goto, but I am skipping extra credit this time.

Lessons Learned

Not to repeat ranting at Zig's documentation, but I definitely now understand Zig's build system at a deeper level than before.