Why on earth would you want to do this? This is a typical question my peers ask me sometimes. My answer, more often than not, is just: “I like to reinvent the wheel…I’m masochistic like that”. But that is not the truth. Not the whole truth, at least.
Sure…I do like reinventing the wheel (and don’t even get me started on that whole masochism thing). Sitting at the desk for hours and hours trying to wrap your head around a topic you don’t understand…it is, in my opinion, the best possible way of getting to know a technology and truly understanding it.
Reinventing the wheel can teach you so much. It leads to, sometimes unplanned, innovation and keeps your grey matter working. This world needs people who reinvent the wheel. (Please don’t actually reinvent the wheel…that would be stupid, right?)
I’m not saying that any of my projects will lead to anything or are in any way better than existing solutions, but they allow me to understand a specific technology better and might allow people who are already familiar with said technology to get a new, fresh viewpoint on a specific topic.
I have previously made some attempts at writing my own instruction set and VM but I have always failed. Not because I lacked the theoretical knowledge or anything, but much rather because I lacked the deep understanding. I always got too experimental. This time around, I tried a different approach: start with what you know, go on from there.
But what is an instruction set?
An instruction set is a group of commands for a CPU in machine language. The term can refer to all possible instructions for a CPU or a subset of instructions to enhance its performance in certain situations.
An instruction set consists of opcodes. An opcode is a byte that signifies the operation to be executed. Many instruction sets are developed not to be executed by a physical CPU, but rather by a VM (or Virtual Machine).
And what is a VM?
A virtual machine is a program that simulates a CPU that implements an instruction set. A desktop CPU implements the x86 instruction set in hardware (the CPU is wired to implement the instruction set). A hypervisor like VirtualBox also implements the x86 instruction set but does this in software.
Some VM’s run instruction sets that are designed to not run on hardware, but just on software in a virtual machine. These VM’s are often called language runtimes. Some examples for this are the JVM or the CLR.
The JVM implements the Java Bytecode instruction set and is the core runtime for languages like Java, Scala or Clojure.
The CLR implements the IL instruction set and is the core runtime of the .NET framework.
And if these VMs are able to implement these instruction sets, hardware should be able to do so too, right? Surprisingly enough, Sun Microsystems (the company that used to develop the Java programming language) has attempted to do so. The project was called picoJava and apparently it even worked quite well!
Defining our own instruction set
So why would you actually define your own instruction set other than for learning purposes? The answer’s quite simple: speed. When writing interpreters, the first thing you do (some would even call it the naïve thing to do) is to run the commands from the abstract syntax tree. Comparing strings, creating dozens of objects and doing other nasty, performance-killing things in the process. While, in some cases, using tree-walk (that’s an interpreter that executes directly from the abstract syntax tree) is justified, in most cases, it isn’t.
The solution to this speed-problem is quite simple: before you run your interpreted code, you compile it to something called bytecode (an instruction set).
I have already defined an instruction set with opcodes. You can find the specification here.
So technically the human readable form isn’t the instruction set. It is what we call a mnemonic representation. My reference implementation (which is written in Go), for instance, doesn’t compile to bytecode but rather interprets this mnemonic representation line-by-line as I needed something to test the mnemonic representation of the instruction set on. A quick and dirty solution if you will. (Basically I just need it as a behavioral reference. So when I write the actual VM later on, I know how the bytecode should behave.)
But how would we actually go about implementing this? And how can we compile a language to our custom bytecode? This can be answered by a couple steps.
- Lex/Parse the source language
- Source language -> mnemonic representation
- Lex/Parse the mnemonic representation
- Mnemonic representation -> Byte representation
- Execute program byte-by-byte
The execution of the program is also fairly trivial. It is literally just an enormous switch statement on the opcode byte.
Since I now have a prototype of the instruction set, I will proceed with writing an actual VM that follows these steps. So there will definitely be a follow-up.
Oh and before I forget: I actually did finish My new home setup! But it was terribly unstable and I couldn’t be bothered to fix something in my config every other day. So I just stuck with Windows and WSL. I run all the docker stuff in VMware (I have my VMs running in bridged mode).