January 10, 2025

Brainfuck v2.0

Last time, we implemented a simple Brainfuck interpreter in a naive way. We defined a data pointer, instruction pointer, and some memory and simply stepped through each instruction. But we can do better. Let's tackle two simple optimizations. First, we compress blocks of identical instructions, such that '+++++' becomes '+5'. This works for the +, -, >, and < instructions. The second optimization, will be to precompute the matching pairs of jump instructions. This way, when we encounter a '[', we don't have to iterate our way all through the program to find the matching ']', but can instantly jump there using a pre-computed memory address.

What we are really doing is converting our program into an intermediate representation. Instead of a bare character, we want to define some sort of Token data type that holds more information. For what I'm calling our 'MathTokens' (i.e., +, -, >, <), we want to hold a size field. For our jump tokens, we want to hold a desitnation address. And our I/O tokens, we don't need any extra information. First, let's define an enum TokenType to designate the eight different types of token:

enum TokenType {
    case Add
    case Subtract
    case Inc
    case Dec
    case JumpIfZero
    case JumpIfNotZero
    case Output
    case Input
}

We'll also find it useful to map the actual string characters to their corresponding token types: swift let tokenMap: [Character: TokenType] = [ "+": .Add, "-": .Subtract, "<": .Dec, ">": .Inc, "[": .JumpIfZero, "]": .JumpIfNotZero, ".": .Output, ",": .Input ]

Now we could just declare one sort of Token struct and make the size and destination properties optionals. But let's have some fun with Swift's type system and overengineer things a bit. I'm going to define a protocol. Swift protocols are a bit like abstract classes. They define some set of fields or methods that an object which conforms to that protocol is guaranteed to have.

protocol BFToken {
    var type: TokenType { get set }
    var line: Int { get set }
    var lexeme: String {get set }
}

Here we are saying that any BFToken object must have a type, a line number, and a lexeme (the original program string that generated this token). Those last two might be helpful for debugging. So we can define three different types of token that conform to this protocol:

struct MathToken: BFToken {
    var type: TokenType
    var size: Int
    var line: Int
    var lexeme: String
}

struct JumpToken:BFToken {
    var type: TokenType
    var destination: Int
    var line: Int
    var lexeme: String
}
struct IOToken:BFToken {
    var type: TokenType
    var line: Int
    var lexeme: String
}

Now that we have our data types, we can implement the first pass of our lexer/tokenizer.

var line = 1
var idx = 0
var tokenList: [BFToken] = []

Note the type of tokenList. It can hold any time that conforms to the BFToken protocol. Neat!

while idx < program.count {
    let cur = program.index(program.startIndex, offsetBy: idx) // Swift uses a separate string index type!
    
    switch progra[cur] {
    case "+", "-", ">", "<":
        var count = 1
        var new_idx = idx + 1
        var new_cur = program.index(program.startIndex, offsetBy: new_idx)
        let stopSymbols = Set("+-><][.,]").subtracting([program[cur]])
        
        while new_idx < program.count && !stopSymbols.contains(program[new_cur]) {
            if program[new_cur] == "\n" { line += 1 }
            if (program[new_cur] == program[cur] { count += 1 })
            new_idx += 1
            new_cur = program.index(program.startIndex, offsetBy: new_idx)
        }
        
        tokenList.append(MathToken(
                                type: tokenMap[program[cur]]!,
                                size: count,
                                line: line
                                lexeme: program.substring(from: idx, to: new_idx)))
        idx = new_idx

This first case in our switch block is doing a few things. First, Swift lets us match against several cases at once, so we check for a math token. Then, we need to see how long the run of identical tokens is, so we define a new idx and increment it until we run into a stop symbol (i.e., a different Brainfuck instruction). We want to continue moving and ignore any non-Brainfuck characters, which we treat as comments. This way, a block of BF code like +++ comment ++- Will get tokenized into a +5 token and a -1 token. Once we find the end of the block, we add a MathToken to our tokenList. Handling the other symbols is pretty simple: swift case "[", "]": tokenList.append(JumpToken( type: tokenMap[program[cur]]!, destination: -1, line: line, lexeme: String(program[cur]))) idx += 1 case ",", ".": tokenList.append(IOToken( type: tokenMap[program[cur]]!, line: line, lexeme: String(program[cur]))) idx += 1 case "\n": idx += 1 line += 1 default: idx += 1 } } And that's our first pass. We now have a list of BF tokens. However, we just filled in -1 for the destination of each JumpToken as a placeholder. Now we have to fix that in our second pass. Finding matching pairs of brackets is a classic CS problem. We scan through our list and every time we find an open bracket, we push the index onto a stack. When we find a closing bracket, we pop from the stack. If we try to pop from an empty stack, or if we are left with a non-empty stack at the end, we know there's a mismatched bracket somewhere and we can throw an error.

var stack: [Int] = []
for (idx, token) in tokenList.enumerated() {
    if token.type == .JumpIfZero { //Open bracket
        stack.append(idx)
    } else if token.type == .JumpIfNotZero {
        let match = stack.popLast() {
            var closer = tokenList[match] as! JumpToken
            var opener = tokenList[idx] as! JumpToken
            closer.destination = idx
            opener.destination = match
            tokenList[match] = closer
            tokenList[idx] = opener
        } else {
            fatalError("Unmatched bracket at line \(token.line)")
        }
    }
}
if !stack.isEmpty {
    fatalError("Unmatched brackets left over from line \(tokenList[stack[0]].line)")
    
}

I'm doing a little dancing around the type system here. Because tokenList only knows that each element is some kind of BFToken, I'm explicitly casting to JumpToken so we can modify the destination value. I'm sure there is a more idiomatic way to do this, but I haven't found it yet. If we find unmatched brackets, we throw an error and let the user know which line we found them at.

So that's two passes to convert a string of tokens into a more useful intermediate representation. I wouldn't quite call it an abstract syntax tree, but we're doing something similar, if much simpler. Now we can interpret this list of tokens and save ourselves a lot of iterating over our program string. After we set up our instructionPointer, dataPointer, memory, and output buffer jsut the same as last time, we can write a new interpret loop with more of that sweet, sweet pattern matching:

while instructionPointer < tokenList.count {
    switch tokenList[instructionPointer] {
    case let token as MathToken:
        switch token.type {
        case .Add: memory[dataPointer] += token.size
        case .Subtract: memory[dataPointer] -= token.size
        case .Inc: dataPointer = (dataPointer + token.size) //% MAX_MEM
        case .Dec: dataPointer -= token.size
        default: break
        }
        instructionPointer += 1
    case let token as IOToken:
        if token.type == .Input {
            memory[dataPointer] = Int(getchar())
        } else if token.type == .Output {
            output.append(memory[dataPointer])
        }
        instructionPointer += 1
     case let token as JumpToken:
         if token.type == .JumpIfZero && memory[dataPointer] == 0 {
             instructionPointer = token.destination
         } else if token.type == .JumpIfNotZero && memory[dataPointer] != 0 {
             instructionPointer = token.destination
         } else {
             instructionPointer += 1
         }
         default:
             break
    }
}

This time, the case let token as MathToken not only looks like a token of type MathToken, but does the cast for us, so we can treat token as a MathToken and access the size field without explicit casting. Much cleaner. Anyway, that's our v2.0 interpreter! You can find the code on my github here, where I've moved both passes and the interpret loop into a class (because I am infected with the OOP mind virus).

One final post script. Up there, I used a string method subtring(from: Int, to:Int) which you won't find in Swift. I added it as an extension to the String class because messing around with startIndex and the special string index type every time I wanted a substring was annoying, especially when I was doing Advent of Code. I find it pretty useful, though I suppose it would be inefficient at scale. It's also a good excuse to show off extensions in Swift:

extension String {
    func substring(from: Int, to: Int) -> String {
        let start = self.index(self.startIndex, offsetBy: from)
        let end = self.index(self.startIndex, offsetBy: to)
        return String(self[start..<end])
    }
}

Where do we go from here? I haven't quite decided yet. Maybe we'll try to compile to assembly from our intermediate representation. Making a just-in-time compiler might be beyond my expertise for now, but compiling to asm or even llvm IR might be fun. Stay tuned!