Writing a C Compiler, Chapter 1, in Zig

First foray into the Zig programming language

In a constant drive to spend, or waste, time while I am looking for work, I got and worked through Nora Sandler's Writing a C Compiler in Rust. It is an excellent introduction to things like immediate representation and z86_64 assembly. About halfway through the book, however, I lost focus.

Continuing to have nothing particularly productive to do, I figured that I could try following the book again in a different language: either Swift or Zig. I do want to learn Swift eventually as writing Apps is probably more useful than writing operating systems, but the tooling leaves something to be desired. So Zig it is, for the time being.

Zig is a very opinionated "C replacement". It has a lot of weird and arbitrary design decisions: no typed integer literals (@as(u8, 1) instead of 1u8) for example. But it also has a lot of nice ideas and it is being posited as the hottest thing since sliced bread, so why not Zig. Swift can happen later.

Zig, and the Zig Version Manager were already on my machine from previous flirtations with the language. 0.14.0 is the latest zig "release", so that is what I am using.

I decided to call this project paella. I like paella. So without further ado, let's go.

build.zig

Zig is not being marketed, as it were, as a language. It is being marketed as a build system that just happens to have a language attached. zig init in the terminal creates a template project with full explanations of the basic building blocks.

Editing it to what I need, this is where I landed:

// build.zig
const std = @import("std");
pub fn build(b: *std.Build) void {

    // entry point
    const exe_mod = b.createModule(.{
        .root_source_file = b.path("src/main.zig"),

        // this is the target for the Book and my machine.
        .target = b.resolveTargetQuery(.{
            .cpu_arch = .x86_64,
            .os_tag = .macos,
        }),

        // choose optimization based on --release
        .optimize = switch (b.release_mode) {
            .off => .Debug,
            else => .ReleaseFast,
        }
    });

    const exe = b.addExecutable(.{
        .name = "paella",
        .root_module = exe_mod,
    });
    b.installArtifact(exe);

    // the rest is boiler plate for `zig build run` and `zig build test`
    // ...
}

I did not include all the boiler plate that would allow me to, use zig build run instead of zig run main src/main.zig. Not much better, just nicer. Also a build system currently would let me add dependecnies easier later.

build.zig.zon is whatever the initial command spit up, tbh. But it is where dependencies are to be listed later on.

The test suite and the book expect everything to use the x86_64 CPU architecture, which is a little inconvenient in macOS due to it being on aarm64. This is howver well explained in the book and the environemnt set up for it is already done when doing the Book previously in Rust.


The Compiler Driver

The next step is what is called in the book the Compiler Driver, or rather the framework which surrounds the compiler bits the Book is axctually about.

The Driver does several things:

  1. Takes a file path and an optional phase flag as arguments.
  2. Calls the system's compiler, gcc to preprocess the C source file at said path, (so no macros or comments or compiler flags or whatever),
  3. Calls the to-be-implemented compiler to transform the preproccessed file on disk into assembly, or stop at the requested compilation phase.
  4. And finally, calls gcc againt to assemble the assembly file into machine code.

The reason I am actually spelling it out, even though it is spelt out in the Book, is that it got me really confused at first. I thought the Driver was a separate thing that we will get to later. But no, it was pretty much step zero.

Argument Parsing

The simple argument parsing needed doesnt warrant getting a whole library in. The program will be called by the test suite and is not really meant for public consumption. So the bare mimimum it is. (The returned errors are for my benenfit.)

const Args = struct {
    path: [:0]const u8, // a string in zigland
    mode: Mode,
};
pub const Mode = enum { lex, parse, codegen, compile, assembly };

fn parse_args() !Args {
    var args = std.process.args();
    _ = args.skip();
    const path = args.next() orelse
        return error.PathNotFound;

    const mode: Mode = if (args.next()) |arg|
        std.meta.stringToEnum(Mode, arg[2..]) orelse
            return error.UnrecognizedFlag
    else
        .compile;

    return .{ .path = path, .mode = mode };
}

This code snippet makes use of some of the higher level Zig features, including two ways to unwrap optional values.

const optional_int: ?i32 = 5; // ?i32 is an optional i32

// first one
const If_is_an_expression: i32 = if (optional_int) |int|
    // `int` is in scope here and we can do things
    int
else
    7;

// second onw
// unwraps the result or returns a value.
const definite_int: i32 = optional_int orelse 7;

// the two are exactly the same.

Calling Commands

To run the preprocessor before compiling and the assembler after compiling, one needs to make use of std.process.Child.

The quickest and least painful way to do so is Child.run(). However, this by default allocates, and returns, stdout and stderr. If the intent is to forward them to the parent process (which it is), the easiest way is this:

var child = std.process.Child.init(
    &.{ "echo", "something" },
    allocator, // zig is all about passing allocators
);
// change options of `child` here: for example
// child.stdout_behavior = .Ignore;

// term is the retruned result of said process.
const term = try child.spawnAndWait();

Manipulating File Extensions

However, before that, let's get started at the much more complex problem of changing a file's extension.[^estension] Zig's standard library provides a few ways to get the components, as it were, from a file path. They live in the std.fs.path and std.mem namespacez. After a bunch of trails and errors, I came up with the folliwng.

const path = "foo/bar.txt";
const stem = try std.fs.path.join(
    alloc,
    &.{
        std.fs.path.dirname(path) orelse "",
        std.fs.path.stem(path),
    },
);
defer alloc.free(stem);

const new_path = try std.mem.join(
    alloc,
    ".",
    &.{
        stem,
        "TXT",
    },
);
defer alloc.free(new_path);
std.debug.assert(std.mem.eql(u8, new_path, "foo/bar.TXT"));

Finally!(?)

Putting two and two together, this is the first draft of the compiler driver. I am not even sure if it compiles or if I made any dumb mistakes, because the hole in the middle is rather large and it needs filling first.

const std = @import("std");

pub fn main() !void {
    var debug_allocator = std.heap.DebugAllocator(.{}).init;
    defer _ = debug_allocator.deinit(); // leak detection in Debug

    const gpa = debug_allocator.allocator();

    const args = try parse_args();
    try run(gpa, args);
}

pub fn run(
    alloc: std.mem.Allocator,
    args: Args,
) !void {
    const input_path = args.path;
    const pp_out, const asm_out, const exe =
        try get_output_paths(alloc, input_path);
    defer { // yay for manual memory management
        alloc.free(pp_out);
        alloc.free(asm_out);
        alloc.free(exe);
    }

    { // preprocessor
        var child = std.process.Child.init(
            &.{ "gcc", "-E", "-P", input_path, "-o", pp_out },
            alloc,
        );

        const term = try child.spawnAndWait();
        if (!std.meta.eql(term, .{ .Exited = 0 }))
            return error.PreprocessorFail;
    }

    { // compiler
        _ = args.mode;
        // mode controls compilation here

        // todo
        // take from path `pp_out` output to path `asm_out`
    }

    { // assembler
        var child = std.process.Child.init(
            &.{ "gcc", asm_out, "-o", exe },
            alloc,
        );

        defer std.fs.cwd().deleteFile(asm_out) catch {}; // cleanup

        const term = try child.spawnAndWait();
        if (!std.meta.eql(term, .{ .Exited = 0 }))
            return error.AssemblerFail;
    }
}

pub const Args = struct {
    path: [:0]const u8,
    mode: Mode,
};
pub const Mode = enum {
    lex,
    parse,
    codegen,
    compile, // default
    assembly, // unused by test script - useful for debugging
};

pub fn parse_args() !Args {
    var args = std.process.args();
    _ = args.skip();
    const path = args.next() orelse
        return error.PathNotFound;

    const mode: Mode = if (args.next()) |arg|
        std.meta.stringToEnum(Mode, arg[2..]) orelse
            return error.UnrecognizedFlag
    else
        .compile;

    return .{ .path = path, .mode = mode };
}

fn get_output_paths(
    alloc: std.mem.Allocator,
    input_path: []const u8,
) !struct {
    []const u8,
    []const u8,
    []const u8,
} {
    const exe = try std.fs.path.join(
        alloc,
        &.{
            std.fs.path.dirname(input_path) orelse "",
            std.fs.path.stem(input_path),
        },
    );
    errdefer alloc.free(exe);

    const pp = try std.mem.join(
        alloc,
        ".",
        &.{ exe, "i" },
    );
    errdefer alloc.free(pp);

    // keywords and arbitrary strings become idenitfiers in zig like so
    const @"asm" = try std.mem.join(
        alloc,
        ".",
        &.{ exe, "s" },
    );

    return .{ pp, @"asm", exe };
}

Lexer

The Book takes things very gradually. So in the first chapter it just focuses on build the bare minimum of the compiler. The C program to be compiled in that chapter is this:

int main (void) {
    return 0;
}

No, seriously, that's it. And the preprocessor takes care of pesky things like comments. The lexer is absurdly simple, for now.

In Ruat, I did all the lexing ahd parsing with nom, the lovely parser combinator library. However, here things should be done the old-fashioned, imperative way. That's why the 0.14.0 Zig release instroduced a new feature: labelled switch!1

const State = enum { a, b, c, d };

loop: switch (State.a) {
    .a => {
        std.debug.print("reached .a\n", .{});
        // fallthrough at home
        continue :loop .b;
    },
    .b => {
        std.debug.print("reached .b\n", .{});
        if (today == .monday)
            continue :loop .c
        else
            continue :loop .d;
    },
    .c => std.debug.print("reached .c\n", .{}),
    .d => std.debug.print("reached .d\n", .{}),
}

The biggest, clearest example of using it is in the Zig compiler itself. It is a cool design so I decided to just copy the design as is.2.

There are a few moving parts, so I added a bunch of comments to explain.

// lexer.zig
const std = @import("std");

// each Token carries a tag and a span (called `loc` here.)
pub const Token = struct {
    tag: Tag,
    loc: Loc,

    // location within the source file
    pub const Loc = struct {
        start: usize,
        end: usize,
    };

    // keywords identified here with their corresponding tags.
    pub const keywords = std.StaticStringMap(Tag).initComptime(.{
        .{ "return", .keyword_return },
        .{ "int", .type_int },
        .{ "void", .keyword_void },
        .{ "main", .keyword_main },
    });

    // helper function for the above
    pub fn getKeyword(bytes: []const u8) ?Tag {
        return keywords.get(bytes);
    }

    // the tag collection
    pub const Tag = enum {
        // punctuation
        l_paren,
        r_paren,
        l_brace,
        r_brace,
        semicolon,

        number_literal,

        // keywords
        type_int,
        keyword_void,
        keyword_return,
        keyword_main, // identifiers happen later

        // helpers
        identifier, // useful for state for now
        invalid,

        // this and the one below it are unused for now, but maybe useful later?
        pub fn lexeme(tag: Tag) ?[]const u8 {
            return switch (tag) {
                .invalid,
                .number_literal,
                .identifier,
                => null,

                .l_paren => "(",
                .r_paren => ")",
                .l_brace => "{",
                .r_brace => "}",

                .semicolon => ";",

                .type_int => "int",
                .keyword_void => "void",
                .keyword_return => "return",
                .keyword_main => "main",
            };
        }

        pub fn symbol(tag: Tag) []const u8 {
            return tag.lexeme() orelse switch (tag) {
                .invalid => "invalid token",
                .identifier => "an identifier",
                .number_literal => "a number literal",
                else => unreachable,
            };
        }
    };
};

// the lexer state machine
pub const Tokenizer = struct {
    // pointer to source
    buffer: [:0]const u8,

    // where in source we are
    index: usize = 0,

    pub fn init(buffer: [:0]const u8) Tokenizer {
        return .{ .buffer = buffer };
    }

    // states. this will grow in complexity with more tokens.
    const State = enum {
        start,
        identifier,
        int,
    };

    // the loop itself
    pub fn next(self: *Tokenizer) ?Token {
        // the eventually returned value.
        var result: Token = .{
            .tag = undefined,
            .loc = .{
                .start = self.index,
                .end = undefined,
            },
        };
        state: switch (State.start) {
            // the starting state for every new token.
            .start => switch (self.buffer[self.index]) {
                0 => { // nullbyte
                    if (self.index == self.buffer.len) {
                        return null; // eof
                    } else {
                        result.tag = .invalid;
                    }
                },
                ' ', '\n', '\t', '\r' => { // whitespace
                    self.index += 1; // advance cursor
                    result.loc.start = self.index;
                    continue :state .start; // and restart
                },
                'a'...'z', 'A'...'Z', '_' => {
                    result.tag = .identifier;
                    continue :state .identifier; // move to identifier state
                },
                '0'...'9' => {
                    result.tag = .number_literal;
                    self.index += 1;
                    continue :state .int; // move to integer state
                },
                // all of the following is self-explanatory really
                '(' => {
                    result.tag = .l_paren;
                    self.index += 1;
                },
                ')' => {
                    result.tag = .r_paren;
                    self.index += 1;
                },
                ';' => {
                    result.tag = .semicolon;
                    self.index += 1;
                },
                '{' => {
                    result.tag = .l_brace;
                    self.index += 1;
                },
                '}' => {
                    result.tag = .r_brace;
                    self.index += 1;
                },
                else => result.tag = .invalid,
            },

            .identifier => {
                self.index += 1;
                switch (self.buffer[self.index]) {
                    // keep going until ...
                    'a'...'z', 'A'...'Z', '_', '0'...'9' => continue :state .identifier,
                    // .. something other than those happens
                    else => {
                        const ident = self.buffer[result.loc.start..self.index];

                        // check if it s a keyword
                        if (Token.getKeyword(ident)) |tag| {
                            result.tag = tag;
                        }
                    },
                }
            },

            .int => switch (self.buffer[self.index]) {
                '0'...'9' => {
                    self.index += 1;
                    continue :state .int;
                },
                // integers not allowed to have letters after them, for now.
                'a'...'z', 'A'...'Z' => result.tag = .invalid,
                else => {},
            },
        }

        // tidy up.
        result.loc.end = self.index;
        return result;
    }
};

It is a bit cut down from the original behemoth, but this will grow in complexity as I advance through the book. The corresponding file in my Rust implementation eventually became full of macros and slowed rust-analyzer to a crawl.

The stub in main.zig is filled as follows:

// main.zig
{ // compiler
    // the use of the more complex function here is to specify the sentinel null terminator ...
    const src = try std.fs.cwd().readFileAllocOptions(
        alloc,
        pp_out,
        std.math.maxInt(usize),
        null,
        @alignOf(u8),
        0, // .. this null terminator.here. The rest are defaults.
    );
    try std.fs.cwd().deleteFile(pp_out); // cleanup
    defer alloc.free(src);

    var tokenizer = lexer.Tokenizer.init(src);
    while (tokenizer.next()) |token| {
        // just to see the results for now.
        std.debug.print("{?}: {s}\n", .{
            token.tag,
            src[token.loc.start..token.loc.end],
        });

        switch (token.tag) {
            .invalid => return error.LexFail,
            else => {},
        }
    }

    if (args.mode == .lex) return;

    // to be continued
}

Unlike the original implementation, there are a few differences: it does not attempt to recover on new lines, but breaks as soon as one invalid token happens. This is actually a terrible idea but it is passable in a toy compiler. Also there is no eof token, choosing to return null on source file end. Not very data oriented of me, as it increases the size by a byte (probably), but I will live.

Bugs shall be found when I tie it to the book's test suite.

Setting Up Tests

This is actually so dumb it is not worth its own section. Just clone the book's tests repo and create a symlink to the compiler in the base directory. For futre reference, this is what I typed to create said symlink:

# while in the tests folder
ln -s ../paella/zig-out/bin/paella

Then I just call the tests as follows from the same said folfer.

./test_compiler ./paella --chaoter 1 --stage lex

All tests should pass now.

Ran 24 tests in 2.008s

FAILED (failures=19)

Oh come on.

All of these errors seem to be hitting the error.UnrecognizedFlag specified in the argument parser. Debugging the situation, with friendly std.debug.print, showed me that not all calls to the function have the same order of arguments, which is definitely a surprise. The argument parsing code clearly needs to be revised.

pub fn parse_args() !Args {
    var args = std.process.args();
    _ = args.skip();

    var path: ?[:0]const u8 = null;
    var mode: Mode = .compile;

    while (args.next()) |arg| {
        if (arg[0] == '-') {
            mode = std.meta.stringToEnum(Mode, arg[2..]) orelse
                return error.UnrecognizedFlag;
        } else if (path == null) {
            path = arg;
        } else {
            return error.PathDeplicated;
        }
    }

    return .{
        .path = path orelse return error.PathNotFound,
        .mode = mode,
    };
}

Let's try again.

----------------------------------------------------------------------
Ran 24 tests in 2.420s

OK

Yay!


AST

Dealing with abstract syntax trees (ASTs) is pretty much the reason the book tells the reader to use a functional language with sum types3 and pattern matching. In fact, the reference implementation is in relatively easy to read OCaml.

To explore how to represent the AST in Zig, I mostly just looked for C resources. This very nice article explains how to represent a basic AST in C, and as a bonus compares it to an equivelant OCaml (as it happens) implementation.

I also had a read on Mitchell Hashimoto's series on the Zig compiler. The series is very informative, but to be honest there is no real need for this project to be super data oriented design driven with maximum cache friendliness and whatever maybe. A, relatively, naive implementation of an AST is perhaps much more useful for learning putposes.

That said, I can adapt some of the interesting ideas mentionedm like "splatting" some enums. For example, assume an expression can be either a binary operation or a unary operation. It would be represented naively as follows:

const Expr = union(enum) {
    number: u64,
    binary: struct { BinOp, *Expr, *Expr },
    unary: struct { UnOp, *Expr },
};

const BinOp = enum { add, sub, and_, or_ };
const UnOp = enum { negate, not };

However, splatting the Expr enum would make the whole type slightly smaller. As an added benefit, many parts of the code would have different behaviour depending on what type of binary or unary operator it is, leading to switch statements within switch statements. The flatter design would look liek this:

const Expr = union(enum) {
    number: u64,
    bin_add: struct { *Expr, *Expr },
    bin_sub: struct { *Expr, *Expr },
    bin_and: struct { *Expr, *Expr },
    bin_or: struct { *Expr, *Expr },
    un_negate: *Expr,
    un_not: *Expr,
};

It looks slightly more verbose, but it is exactly the same amount of data. The types are being summed in our sum types!

In the actual Zig compiler, the pointers are replaced by indices into the other Expr nodes where they are stored in a giant array: Handles being the better pointers and all that. As I do not particualy need that kind of optimization, I will just throw every node into an Arena and free it all at once once I am done with the AST (in a later stage from parsing.)

The needs for this chapter are a lot simpler, however. The entirety of our current AST can be summarized as follows.

const Prgm = struct {
    func_def: *FuncDef,
};

const FuncDef = struct {
    name: []const u8,
    body: *Stmt,
};

const Stmt = union(enum) {
    @"return": *Expr,
};

const Expr = union(enum) {
    constant: u64,
};

Parser

Writing a parser is a bit more involved, and it is something I have never done before. In my previous implementation I used a parser combinator library. It was probably the easiest and smoothest part.

There are parser combinator libraries for Zig. mecha is the first result on Google.4

I think at first I will try my handrolling the parser. Should that prove to be arduous, I will maybe switch to mecha.

The pseudocode given in the book translates, in Zig, to the next listing. I am passing the allocator as all items should be allocated in an Arena, as mentioned before. I am also passing the tokenizer as a reference. I do not expect to have a lot of backtracking, as it is C I am compiling, after all, so the tokens are allocated nowehre and the tokenizer just produces them as it goes along.

fn parse_stmt(
    alloc: std.mem.Allocator,
    tokens: *lexer.Tokenizer,
) !*ast.Stmt {
    try expect(.keyword_return, tokens);
    const expr = try parse_expr(alloc, tokens); // stub
    try expect(.semicolon, tokens);

    const ret = try alloc.create(ast.Stmt);
    ret.* = .{ .@"return" = expr };

    return ret;
}

fn expect(
    expected: lexer.Token.Tag,
    tokens: *lexer.Tokenizer,
) !void {
    if (tokens.next()) |actual| {
        if (actual.tag != expected)
            return error.SyntaxError;
    } else return error.SyntaxError;
}

Looks simple enough. Let's try the rest.

parse_expr has currently only one thing: an integer. It should have some structure, however. It does not use expect as expect discards its token.

fn parse_expr(
    alloc: std.mem.Allocator,
    tokens: *lexer.Tokenizer,
) !*ast.Expr {
    const current = tokens.next() orelse
        return error.ExpectExpr;

    switch (current.tag) {
        .number_literal => {
            const lit = tokens.buffer[current.loc.start..current.loc.end];
            const res = try std.fmt.parseInt(u64, lit, 10);

            const ret = try alloc.create(ast.Expr);
            ret.* = .{ .constant = res };

            return ret;
        },
        else => return error.ExpectExpr,
    }
}

The other two, parse_func_def and parse_prgm are as straightforward as they get. I am putting off dealing with identifiers for now, because I do not want to put them in the arena, but use some sort of global hashmap instead. Maybe for the next chapter.

fn parse_prgm(
    alloc: std.mem.Allocator,
    tokens: *lexer.Tokenizer,
) !*ast.Prgm {
    const func_def = try parse_func_def(alloc, tokens);

    const ret = try alloc.create(ast.Prgm);
    ret.* = .{ .func_def = func_def };

    return ret;
}

fn parse_func_def(
    alloc: std.mem.Allocator,
    tokens: *lexer.Tokenizer,
) !*ast.FuncDef {
    try expect(.type_int, tokens);

    // this should be an identifier but i am taking a rain check
    try expect(.keyword_main, tokens);

    try expect(.l_paren, tokens);
    try expect(.keyword_void, tokens);
    try expect(.r_paren, tokens);

    try expect(.l_brace, tokens);
    const body = try parse_stmt(alloc, tokens);
    try expect(.r_brace, tokens);

    const ret = try alloc.create(ast.FuncDef);
    ret.* = .{ .name = "main", .body = body };

    return ret;
}

Great. Now to call them from main.zig and build and run the tests.

AssertionError: Didn't catch error in chapter_1/invalid_parse/extra_junk.c

----------------------------------------------------------------------
Ran 24 tests in 4.164s

FAILED (failures=1)

Pfft. This is extra_junk.c

int main(void)
{
    return 2;
}
// A single identifier outside of a declaration isn't a valid top-level construct
foo

Oh that's why. Adding a small check to parse_prgm should solve the issue.

pub fn parse_prgm(
    alloc: std.mem.Allocator,
    tokens: *lexer.Tokenizer,
) !*ast.Prgm {
    const func_def = try parse_func_def(alloc, tokens);

    // funny piece of syntax.
    if (tokens.next()) |_| return error.ExtraJunk;

    const ret = try alloc.create(ast.Prgm);
    ret.* = .{ .func_def = func_def };

    return ret;
}

This passes all tests.

To be honest I do not if I am doing things correctly yet. Less so in writing a compiler part but more so in the not leaking any memory part. The DebugAllocator does tell if there are leakages, but the test runner hides the output from me unless there is a test failure. It does not matter in the writing a compiler side of things, but I guess it does matter on the learning properly side of things. Whatever. What's the next task?

Pretty Printing the AST

Before I start on the actual next task, now would be a pretty good time to start implementing pretty printing. It would help massively down the line with debugging.

It is possible in Zig to create custom formatters for types, which I plan to use. However it relies on weird generic pseudo-interface shenanigans, very unlike what is in the Rust system. Zig's interface like behaviour is structural: it checks if there is a function a specific name and type in there, and if not, does something else.

First let's show the result without pretty printing, fir the following C program.

int main (void) {
    return 0;
}

When parsed into an AST, it prints by default like that:

ast.Prgm{ .func_def = ast.FuncDef{ .name = { 109, 97, 105, 110 }, .body = ast.Stmt{ .return = ast.Expr{ ... } } } }

Not particularly useful or helpful.

The way to create a custom formatter is to add this function to the type's definition:

pub fn format(
    self: @This(),
    comptime fmt: []const u8,
    options: std.fmt.FormatOptions,
    writer: anytype,
) !void {
    // do stuff
}

The writer field seems to have the functions print and writeAll, based on various examples, but I do not actually know from this signature what is vailable.5 This resolves at compile time via C++ template-like SFINAE behaviour (as far as I understand it anyway).

FormatOptions is a lot more interesting. this is the definition from the Zig standard library:

pub const FormatOptions = struct {
    precision: ?usize = null,
    width: ?usize = null,
    alignment: Alignment = default_alignment,
    fill: u21 = default_fill_char,
};

Like in did in my Rust implementation, I can use the optional width variable to pass the indentation level to each child element. If this does not make sense now, it might in a minute.

The formatting apparatus in Zig is also explained in the standard library documentation, which is explained in neat examples in zig.guide. Here is the format written out for your convenience:

/// `{[argument][specifier]:[fill][alignment][width].[precision]}`

There was one missing piece though. How do I pass a variable to the width paramter? All the examples seem to use a literal. This took some tinkering to figure out, but figure it out I did.

The signature of std.debug.print accepts in its second argument an anytype, which is analyzed at comptime and matched with the given formatting string (which must be comptime known), and it fails to compile if they do not. I always passed on a tuple and relied on its general ordering, like so.

std.debug.print("{?}\n", .{prgm});

But the args parameter does not have to be a tuple. The struct elements can have names!! This explained very briefly in the docs (found in std.fmt.zig in the standard library) like so:

/// - *argument* is either the numeric index or the field name of the argument that should be inserted
///   - when using a field name, you are required to enclose the field name (an identifier) in square
///     brackets, e.g. {[score]...} as opposed to the numeric index form which can be written e.g. {2...}

I did not know that. But that laos meant I can pass the width as a variable as well if I pass it in as a named field and surround it in square brackets.

//    magic happens here vvvvvvv
try writer.print("{[def]:[width]}", .{
    .def = self.func_def,
    .width = w, // passed on from before.
});

// this also works. vvv index
try writer.print("{:[1]}", .{
    self.func_def,
    w,
});

This passes the w variabe to the FormatOptions of the called formatter. voila!!

What follows is conceptually simple. This is my format function implemented for ast.Prgm:

pub fn format(
    self: @This(),
    comptime _: []const u8,
    options: std.fmt.FormatOptions,
    writer: anytype,
) !void {
    try writer.print("PRORGAM\n", .{});

    try writer.print("{[def]:[w]}", .{
        .def = self.func_def,
        .w = (options.width orelse 0) + 1,
    });
}

And this is the one for FuncDef:

pub fn format(
    self: @This(),
    comptime _: []const u8,
    options: std.fmt.FormatOptions,
    writer: anytype,
) !void {
    const w = options.width orelse 0;
    try writer.writeByteNTimes('\t', w);

    try writer.print("FUNCTION {s}\n", .{self.name});
    try writer.print("{[body]:[w]}", .{
        .body = self.body,
        .w = w + 1,
    });
}

And the rest, as chess players say, is just technique. Now running our program on the C snippet gives this glorious output.

PRORGAM
	FUNCTION main
		RETURN 0

This creates an additional maintainance burden the bigger the compiler grows, but it is worth it for the debugging goodness.

Now back to work.


Assembly Generation

Straight to the point. I like it. No intermediate representations or anything tacky of the sort.6 So let's get on with it.

The first step in assembly generation is to actually put up an syntax tree for assembly. the second tree is whwat we use to actually generate the assembly code at the end. So, the first thing I will do here is to create an assembly.zig file with the similar structure as ast.zig. This is fairly straightforward.

pub const Prgm = struct {
    func_def: *FuncDef,
};

pub const FuncDef = struct {
    name: []const u8,
    instrs: std.ArrayListUnmanaged(Inst),
};

pub const Inst = union(enum) {
    mov: struct { src: Operand, dst: Operand },
    ret: void,
};

pub const Operand = union(enum) {
    imm: u64,
    reg: void,
};

To generate this AST from the source AST, I am going to srtck the logic in yet another file I will call asm_gen.zig. So many files. I am not quite sure about storing the body of instructions this way, but it saves from the allocation dance every time I need to edit the contents down the line. Using a SimplyLinkedList might be an option too. For now, this is it.

The Unmanaged part is sems intimidating, but it just really means that it does not keep a reference to its allocator, putting the responsibility of making sure the same allocator is passed every time on the caller. My understanding is that this is mean to be the default interface for ArrayList going forard.

Back to task. After wrangling a bit with zig's compiler errors and its type checks7, I came up with these intimidating looking functions with all kinds of memory leaks in the sad path (I fear).

pub fn prgm_to_asm(
    alloc: std.mem.Allocator,
    prgm: ast.Prgm,
) !assembly.Prgm {
    const func_def = try alloc.create(assembly.FuncDef);
    errdefer alloc.destroy(func_def);
    func_def.* = try func_def_to_asm(alloc, prgm.func_def.*);

    return .{ .func_def = func_def };
}
fn func_def_to_asm(
    alloc: std.mem.Allocator,
    func_def: ast.FuncDef,
) !assembly.FuncDef {
    const name = try alloc.dupe(u8, func_def.name);
    errdefer alloc.free(name);

    const instrs = try stmt_to_asm(alloc, func_def.body.*);

    return .{
        .name = name,
        .instrs = instrs,
    };
}

This next one took a lot of wrangling until I figured out the syntax for a slice of unions.

fn stmt_to_asm(
    alloc: std.mem.Allocator,
    stmt: ast.Stmt,
) !std.ArrayListUnmanaged(assembly.Inst) {
    switch (stmt) {
        .@"return" => |value| {
            var result: std.ArrayListUnmanaged(assembly.Inst) = .empty;
            try result.appendSlice(alloc, &.{
                .{ .mov = .{
                    .src = try expr_to_asm(alloc, value.*),
                    .dst = .reg,
                } },

                // I could just ype `.ret,` here but that might be confusing
                .{ .ret = {} },
            });

            return result;
        },
    }
}

And last but not least, the absolute bottom of the totem pole. The allocator is passed in here just in case because I think I might need it later.

fn expr_to_asm(
    _: std.mem.Allocator,
    expr: ast.Expr,
) !assembly.Operand {
    switch (expr) {
        .constant => |i| return .{ .imm = i },
    }
}

Testing this out with the tiny C program gives me a rather ugly output. Implementing pretty printing (based on our previous attempt) for our assembly AST gives us this nice output.

PRORGAM
	FUNCTION main
		mov imm 0, register
		ret

Obviously this can just be repurposed to generate the actual assembly. So that is what is going to happen next.

Code Emission

Here is a funny thing. The format function that needs to be implemented for the formatting interface has a field that I have so far discarded. This field is one of the option the user tells the formatter what to do, and it is used all the time for everyone. For example, when you want to print a digit as a digit, you do print("{d}, .{my_digit}). Wheh nyou want an ASCII character, print("{c}, .{my_digit}) is your friend. This tiny d or c is the string passed in through the hitherto discarded input. And I can define it to be anything I want!! I actually tried the phrase free syria with a space, and some Arabic text,, and it passed through as expected.

I have been printing to the terminal like so:

std.debug.print("{}\n", .{prgm});

Which gives me the (arguably) nice debugging output seen above. But if I, say, call it like so:

std.debug.print("{gen}\n", .{prgm});

I can have it emit the assembly code wanted just fine. And since it works with any Writer, I can write it to a file immediately. Pretty cool, and a similar interface to what Rust does, anytype notwithstanding.

Without boring you with the details, the formatting prints this really nice assembly (which does not forget the underscore since I am on macOS):

	.globl _main
_main:
	movl $0, %eax
	ret

And this concludes all tests for this chapter, which pass with flying colors.


Lessons Learned

This is, currently, a tiny do-nothing program. However, in the course of implementing it, I learned the followings things:

  1. More comfort with Zig's syntax.
  2. Learned a bit about Zig's standard library (environemnt, file system, child processes, collections), and build system.
  3. Not shown here, but I managed to figure out how to actually add dependencies. It is a bit more involved than cargo add, but it is nice enough.
  4. I used jj and learned its commands for this project. However, I have not yet gotten to pushing this project to GitHub. That maybe later.

I think this program is a decent testing ground for someone wanting to learn a new language, especially if done before in a familiar one.

Jury's out on whether to continue with chapter 2 with Zig or try reimplementing chapter 1 in Swift, (or maybe Gleam? How do you even run Gleam programs?)

That's all, and thanks for reading.