Instruction Encoding, aka Virtual Machines, Part 1 If you didn't read it yet and feel the need, do have a look at <2021-12-24-virtual-machines-0.txt> for a bit of background. Instruction encoding is the mapping between abstract instructions ("add x and y") and actual bytes that exist in computer memory somewhere. There are a lot of different ways to do this, and even more ways for virtual machines, but it often comes down to how the instruction set itself is designed and what capabilities the machine has. In terms of architecture, there are roughly three schools of thought: School 1: CISC ("Complex Instruction Set Computer") This is the school of capable hardware, powerful instructions, and short programs (as expressed in machine code). CISC designs tend to implement a lot in hardware, have a lot of modes and configurable behavior, and have relatively short[1], powerful instructions. CISC architectures often have variable-length instructions, allowing for frequently used instructions to be written compactly. The current-day poster child for this school is the x86 family, but historically there have been others, like the VAX architecture. School 2: RISC ("Reduced Instruction Set Computer") This is the school of relatively few, simple instructions. These virtually always have fixed-length instructions and usually have only a few ways to do a given operation. Often, these architectures require that instructions be aligned to a word-length boundary as well. The ARM instruction set is somewhat in this camp, but MIPS (not as used these days) is probably the exemplar. School 3: Everyone Else This school includes a bunch of different oddball stuff, like VLIW (Very Large Instruction Word) architectures, stack-based architectures, architectures with non-byte-aligned instructions, the "object-oriented" iAPX architecture, and so on and so forth. A lot of these are very interesting to talk about in their own right but I'll gloss over them here because practically all modern architectures, especially those intended for general use, have coalesced into schools 1 or 2 and these are generally considered dead ends. Regardless of which school an architecture falls into, the fundamental tension for encodings is generally between compactness, expressive power of a single instruction, and simplicity. Specifically, for a given set of architectural features, the following choice always appears: do you try to define a "more frequently used" set of those features and give them shorter encodings, yielding more compact machine code but more difficult encoding, or do you leave your instruction set "orthogonal", with all instructions being encoded the same way? As an example, let's have a look at how the same instruction is encoded in two different architectures: MIPS and x86. We'll be looking at adding a constant value to a register. In MIPS, that instruction is called ADDI, and its encoding is always this: 001000 SSSSS TTTTT IIIIIIII_IIIIIIII Where the 001000 is the ADDI opcode, SSSSS and TTTTT are 5-bit register specifiers, and the IIIIIIII_IIIIIIII field is a 16-bit immediate value[2]. The behavior of this operation is: registers[T] = registers[S] + I where S and T may refer to the same register, or different ones. Encoding an add of any value to any specific register is a straightforward matter of sticking the register numbers into the S and T fields and the immediate into the I field. For x86 the situation is a bit murkier. We use the ADD instruction for it, which is also the instruction used to add registers to each other and add values from/to memory. If we write: ADD eax, 0x100 that is encoded (in bytes) as: 66 05 00 01 00 00 where the 66 is some x86 nonsense[3], 05 is specifically the opcode for "add a 32-bit immediate to eax", and then we have the 4-byte immediate value in little-endian order. However, if we instead wrote: ADD ebx, 0x100 we'd get: 66 81 c3 00 01 00 00 Where the 05 of "ADD eax, thing" has become 81 c3. In this case, 81 is the generic "add a 32-bit immediate to a 32-bit register" instruction, and the c3 byte serves to specify[4] that it should be ebx. This is an example of x86 giving a more "common" instruction (adding to eax) a more direct encoding: while ebx is the "base" register and conceptually used to point to the beginnings of data structures, eax is the "accumulator" register and the architecture intends for it to be used for counting and summing. There are plenty of more extreme examples, but one that deserves special attention is the various REP pseudo-instructions that x86 has. One can write: REPNE SCASB which means basically this: while (mem[edi] != low_8_bits(eax) && ecx != 0) { edi++; ecx--; } or essentially the C memchr(3) builtin. This compound instruction has the two-byte encoding f2 ae, but if you wanted to do a similar search using any other combination of registers, you'd have to hand-write it with a loop, etc. As you might already be gleaning, it can be an extremely major headache to decode x86 instructions correctly in all their glory, while doing so for MIPS or other very RISC-y architectures is a simple exercise in bitshifting and masking. To make matters worse, the sheer complexity of decoding CISC instructions often takes significant time and power on its own - so why did people do it? The answer has to do with trade-offs between CPU and storage, especially memory. Modern computers have a very, very different ratio between CPU and memory than earlier computers did. For example, the venerable Commodore 64 (now 40 years old almost) had a 1MHz CPU but only 64KB of RAM and 20KB of ROM - and each byte of that RAM or ROM expended on instructions couldn't be used to store data. In that kind of environment it made total sense to trade off some speed and complexity in the CPU for better use of that scarce RAM. For comparison's sake, the machine I'm writing this on has a 3900 MHz CPU core, which does sound impressive, but is only about 4000 times faster[5] than the C64's core. However, it has 32GB of RAM, which is 500,000 times as much storage, and many terabytes of disk as well. To make matters even better, modern RAM is far faster to access than old RAM was, which reduces the performance penalty of needing to fetch more instruction bytes from that RAM. As a result, the CPU vs memory trade-off is totally different. That brings us full circle back to instruction encodings for VMs. Both variable- and fixed-length encodings can work; the trade-off primarily comes down to ease of implementation (both of the VM and of related tooling) versus size of the VM's programs. Which one matters more very much depends on the problem domain, and on your own tolerance for implementing instruction decoding. For example, a complete MIPS decoder is relatively straightforward to write, but a complete x86 decoder could turn into quite a hobby. Thanks for reading! Stay tuned for part 2: peripherals & IO for VMs :) Bonus Section: Width Switching It does seem like perhaps there might be a way to have our cake and eat it too, although again at the price of complexity. Instead of having one complex instruction decoder that can handle many different widths, what if we had several individually-simple decoders and swapped between them depending on some piece of CPU state? ARM does exactly this, and x86 sort of does as well. ARM is a RISC instruction set with fixed-width 32-bit instructions, but it has a special CPU mode called "Thumb" that can be enabled or disabled. In that mode, it uses a completely separate instruction decoder which uses fixed-width 16-bit instructions instead. Storage-sensitive applications can compile their code in Thumb mode, switch into Thumb when they start off, and maybe get considerable space savings at the price of not having access to the full 32-bit architecture's power. There is also a separate thing, which ARM calls "Thumb 2", which allows for encoding 32-bit wide ARM instructions as a pair of 16-bit Thumb instructions, sort of, thus providing a lot of the benefit without having a width switch but again at the price of decode complexity. x86, too, has multiple modes, which imply different default instruction widths along with a bunch of other things. I won't talk about them in detail here, except to remark that footnote [3] begins to hint at the level of complexity involved. Also, as a fun extra, some ARM chips have a mode called "Jazelle", in which they can decode and execute Java VM instructions in hardware (!). [1]: It is a relatively well-known factoid that there are legal x86 instructions of 15 or more bytes, depending on the features in use, but in general variable-length encodings are more compact. [2]: The _ is a syntactic convention to make long bit fields easier to read, by breaking them up into 8-bit groups; it doesn't actually appear in the encoded instruction in any way. [3]: Specifically, it is the "operand-size override prefix", which is present because I, like you, have a 64-bit machine but wanted to express a 32-bit add. If compiled for a 32-bit machine, this prefix would not be here. Many but not all instructions allow use of one of these prefixes, and there are several other sets of optional prefixes. [4]: The c3 here is the notorious "ModR/M byte", which describes what the instructions's operands are and how to interpret them. For each instruction, only certain ModR/M combinations are legal. There are a series of tables in Volume 2A of the IA-32/x86-64 architecture manual's section 2 that describe how to this byte works. [5]: Obviously not all clock cycles are equal, and I also have 16 of these 3900 MHz cores, and pipelining, microcode, caches, yada yada yada. Please excuse this lie.