Brainfuck v2.0
Last time, we implemented a simple Brainfuck interpreter in a naive way. We defined a data pointer, instruction pointer, and some memory and simply stepped through each instruction. But we can do better. Let's tackle two simple optimizations. First, we compress blocks of identical instructions, such that '+++++' becomes '+5'. This works for the +, -, >, and < instructions. The second optimization, will be to precompute the matching pairs of jump instructions. This way, when we encounter a '[', we don't have to iterate our way all through the program to find the matching ']', but can instantly jump there using a pre-computed memory address.
What we are really doing is converting our program into an intermediate representation. Instead of a bare character, we want to define some sort of Token data type that holds more information. For what I'm calling our 'MathTokens' (i.e., +, -, >, <), we want to hold a size field. For our jump tokens, we want to hold a desitnation address. And our I/O tokens, we don't need any extra information. First, let's define an enum TokenType to designate the eight different types of token:
enum TokenType {
case Add
case Subtract
case Inc
case Dec
case JumpIfZero
case JumpIfNotZero
case Output
case Input
}
We'll also find it useful to map the actual string characters to their corresponding token types:
swift
let tokenMap: [Character: TokenType] = [
"+": .Add, "-": .Subtract, "<": .Dec, ">": .Inc,
"[": .JumpIfZero, "]": .JumpIfNotZero,
".": .Output, ",": .Input
]
Now we could just declare one sort of Token struct and make the size and destination properties optionals. But let's have some fun with Swift's type system and overengineer things a bit. I'm going to define a protocol. Swift protocols are a bit like abstract classes. They define some set of fields or methods that an object which conforms to that protocol is guaranteed to have.
protocol BFToken {
var type: TokenType { get set }
var line: Int { get set }
var lexeme: String {get set }
}
Here we are saying that any BFToken object must have a type, a line number, and a lexeme (the original program string that generated this token). Those last two might be helpful for debugging. So we can define three different types of token that conform to this protocol:
struct MathToken: BFToken {
var type: TokenType
var size: Int
var line: Int
var lexeme: String
}
struct JumpToken:BFToken {
var type: TokenType
var destination: Int
var line: Int
var lexeme: String
}
struct IOToken:BFToken {
var type: TokenType
var line: Int
var lexeme: String
}
Now that we have our data types, we can implement the first pass of our lexer/tokenizer.
var line = 1
var idx = 0
var tokenList: [BFToken] = []
Note the type of tokenList. It can hold any time that conforms to the BFToken protocol. Neat!
while idx < program.count {
let cur = program.index(program.startIndex, offsetBy: idx) // Swift uses a separate string index type!
switch progra[cur] {
case "+", "-", ">", "<":
var count = 1
var new_idx = idx + 1
var new_cur = program.index(program.startIndex, offsetBy: new_idx)
let stopSymbols = Set("+-><][.,]").subtracting([program[cur]])
while new_idx < program.count && !stopSymbols.contains(program[new_cur]) {
if program[new_cur] == "\n" { line += 1 }
if (program[new_cur] == program[cur] { count += 1 })
new_idx += 1
new_cur = program.index(program.startIndex, offsetBy: new_idx)
}
tokenList.append(MathToken(
type: tokenMap[program[cur]]!,
size: count,
line: line
lexeme: program.substring(from: idx, to: new_idx)))
idx = new_idx
This first case in our switch block is doing a few things. First, Swift lets us match against several cases at once, so we check for a math token. Then, we need to see how long the run of identical tokens is, so we define a new idx and increment it until we run into a stop symbol (i.e., a different Brainfuck instruction). We want to continue moving and ignore any non-Brainfuck characters, which we treat as comments. This way, a block of BF code like
+++ comment
++-
Will get tokenized into a +5 token and a -1 token. Once we find the end of the block, we add a MathToken to our tokenList. Handling the other symbols is pretty simple:
swift
case "[", "]":
tokenList.append(JumpToken(
type: tokenMap[program[cur]]!,
destination: -1,
line: line,
lexeme: String(program[cur])))
idx += 1
case ",", ".":
tokenList.append(IOToken(
type: tokenMap[program[cur]]!,
line: line,
lexeme: String(program[cur])))
idx += 1
case "\n":
idx += 1
line += 1
default:
idx += 1
}
}
And that's our first pass. We now have a list of BF tokens. However, we just filled in -1 for the destination of each JumpToken as a placeholder. Now we have to fix that in our second pass. Finding matching pairs of brackets is a classic CS problem. We scan through our list and every time we find an open bracket, we push the index onto a stack. When we find a closing bracket, we pop from the stack. If we try to pop from an empty stack, or if we are left with a non-empty stack at the end, we know there's a mismatched bracket somewhere and we can throw an error.
var stack: [Int] = []
for (idx, token) in tokenList.enumerated() {
if token.type == .JumpIfZero { //Open bracket
stack.append(idx)
} else if token.type == .JumpIfNotZero {
let match = stack.popLast() {
var closer = tokenList[match] as! JumpToken
var opener = tokenList[idx] as! JumpToken
closer.destination = idx
opener.destination = match
tokenList[match] = closer
tokenList[idx] = opener
} else {
fatalError("Unmatched bracket at line \(token.line)")
}
}
}
if !stack.isEmpty {
fatalError("Unmatched brackets left over from line \(tokenList[stack[0]].line)")
}
I'm doing a little dancing around the type system here. Because tokenList
only knows that each element is some kind of BFToken, I'm explicitly casting to JumpToken so we can modify the destination value. I'm sure there is a more idiomatic way to do this, but I haven't found it yet. If we find unmatched brackets, we throw an error and let the user know which line we found them at.
So that's two passes to convert a string of tokens into a more useful intermediate representation. I wouldn't quite call it an abstract syntax tree, but we're doing something similar, if much simpler. Now we can interpret this list of tokens and save ourselves a lot of iterating over our program string. After we set up our instructionPointer, dataPointer, memory, and output buffer jsut the same as last time, we can write a new interpret loop with more of that sweet, sweet pattern matching:
while instructionPointer < tokenList.count {
switch tokenList[instructionPointer] {
case let token as MathToken:
switch token.type {
case .Add: memory[dataPointer] += token.size
case .Subtract: memory[dataPointer] -= token.size
case .Inc: dataPointer = (dataPointer + token.size) //% MAX_MEM
case .Dec: dataPointer -= token.size
default: break
}
instructionPointer += 1
case let token as IOToken:
if token.type == .Input {
memory[dataPointer] = Int(getchar())
} else if token.type == .Output {
output.append(memory[dataPointer])
}
instructionPointer += 1
case let token as JumpToken:
if token.type == .JumpIfZero && memory[dataPointer] == 0 {
instructionPointer = token.destination
} else if token.type == .JumpIfNotZero && memory[dataPointer] != 0 {
instructionPointer = token.destination
} else {
instructionPointer += 1
}
default:
break
}
}
This time, the case let token as MathToken
not only looks like a token of type MathToken, but does the cast for us, so we can treat token
as a MathToken
and access the size field without explicit casting. Much cleaner. Anyway, that's our v2.0 interpreter! You can find the code on my github here, where I've moved both passes and the interpret loop into a class (because I am infected with the OOP mind virus).
One final post script. Up there, I used a string method subtring(from: Int, to:Int)
which you won't find in Swift. I added it as an extension to the String class because messing around with startIndex and the special string index type every time I wanted a substring was annoying, especially when I was doing Advent of Code. I find it pretty useful, though I suppose it would be inefficient at scale. It's also a good excuse to show off extensions in Swift:
extension String {
func substring(from: Int, to: Int) -> String {
let start = self.index(self.startIndex, offsetBy: from)
let end = self.index(self.startIndex, offsetBy: to)
return String(self[start..<end])
}
}
Where do we go from here? I haven't quite decided yet. Maybe we'll try to compile to assembly from our intermediate representation. Making a just-in-time compiler might be beyond my expertise for now, but compiling to asm or even llvm IR might be fun. Stay tuned!