Instruction Encoding, aka Virtual Machines, Part 1

If you didn't read it yet and feel the need, do have a look at
<2021-12-24-virtual-machines-0.txt> for a bit of background.

Instruction encoding is the mapping between abstract instructions ("add x and
y") and actual bytes that exist in computer memory somewhere. There are a
lot of different ways to do this, and even more ways for virtual machines, but
it often comes down to how the instruction set itself is designed and what
capabilities the machine has. In terms of architecture, there are roughly three
schools of thought:

School 1: CISC ("Complex Instruction Set Computer")
This is the school of capable hardware, powerful instructions, and short
programs (as expressed in machine code). CISC designs tend to implement a lot in
hardware, have a lot of modes and configurable behavior, and have relatively
short[1], powerful instructions. CISC architectures often have variable-length
instructions, allowing for frequently used instructions to be written compactly.
The current-day poster child for this school is the x86 family, but historically
there have been others, like the VAX architecture.

School 2: RISC ("Reduced Instruction Set Computer")
This is the school of relatively few, simple instructions. These virtually
always have fixed-length instructions and usually have only a few ways to do a
given operation. Often, these architectures require that instructions be aligned
to a word-length boundary as well. The ARM instruction set is somewhat in this
camp, but MIPS (not as used these days) is probably the exemplar.

School 3: Everyone Else
This school includes a bunch of different oddball stuff, like VLIW (Very Large
Instruction Word) architectures, stack-based architectures, architectures with
non-byte-aligned instructions, the "object-oriented" iAPX architecture, and so
on and so forth. A lot of these are very interesting to talk about in their own
right but I'll gloss over them here because practically all modern
architectures, especially those intended for general use, have coalesced into
schools 1 or 2 and these are generally considered dead ends.

Regardless of which school an architecture falls into, the fundamental tension
for encodings is generally between compactness, expressive power of a single
instruction, and simplicity.  Specifically, for a given set of architectural
features, the following choice always appears: do you try to define a "more
frequently used" set of those features and give them shorter encodings, yielding
more compact machine code but more difficult encoding, or do you leave your
instruction set "orthogonal", with all instructions being encoded the same way?

As an example, let's have a look at how the same instruction is encoded in two
different architectures: MIPS and x86. We'll be looking at adding a constant
value to a register.

In MIPS, that instruction is called ADDI, and its encoding is always this:

  001000 SSSSS TTTTT IIIIIIII_IIIIIIII

Where the 001000 is the ADDI opcode, SSSSS and TTTTT are 5-bit register
specifiers, and the IIIIIIII_IIIIIIII field is a 16-bit immediate value[2].
The behavior of this operation is:

  registers[T] = registers[S] + I

where S and T may refer to the same register, or different ones.

Encoding an add of any value to any specific register is a straightforward
matter of sticking the register numbers into the S and T fields and the
immediate into the I field.

For x86 the situation is a bit murkier. We use the ADD instruction for it, which
is also the instruction used to add registers to each other and add values
from/to memory. If we write:

  ADD eax, 0x100

that is encoded (in bytes) as:

  66 05 00 01 00 00

where the 66 is some x86 nonsense[3], 05 is specifically the opcode for "add a
32-bit immediate to eax", and then we have the 4-byte immediate value in
little-endian order. However, if we instead wrote:

  ADD ebx, 0x100

we'd get:

  66 81 c3 00 01 00 00

Where the 05 of "ADD eax, thing" has become 81 c3. In this case, 81 is the
generic "add a 32-bit immediate to a 32-bit register" instruction, and the c3
byte serves to specify[4] that it should be ebx. This is an example of x86
giving a more "common" instruction (adding to eax) a more direct encoding: while
ebx is the "base" register and conceptually used to point to the beginnings of
data structures, eax is the "accumulator" register and the architecture intends
for it to be used for counting and summing.

There are plenty of more extreme examples, but one that deserves special
attention is the various REP pseudo-instructions that x86 has. One can write:

  REPNE SCASB

which means basically this:

  while (mem[edi] != low_8_bits(eax) && ecx != 0) {
    edi++;
    ecx--;
  }

or essentially the C memchr(3) builtin. This compound instruction has the
two-byte encoding f2 ae, but if you wanted to do a similar search using any
other combination of registers, you'd have to hand-write it with a loop, etc.

As you might already be gleaning, it can be an extremely major headache to
decode x86 instructions correctly in all their glory, while doing so for MIPS or
other very RISC-y architectures is a simple exercise in bitshifting and masking.
To make matters worse, the sheer complexity of decoding CISC instructions often
takes significant time and power on its own - so why did people do it?

The answer has to do with trade-offs between CPU and storage, especially memory.
Modern computers have a very, very different ratio between CPU and memory than
earlier computers did. For example, the venerable Commodore 64 (now 40 years old
almost) had a 1MHz CPU but only 64KB of RAM and 20KB of ROM - and each byte of
that RAM or ROM expended on instructions couldn't be used to store data. In that
kind of environment it made total sense to trade off some speed and complexity
in the CPU for better use of that scarce RAM.

For comparison's sake, the machine I'm writing this on has a 3900 MHz CPU core,
which does sound impressive, but is only about 4000 times faster[5] than the
C64's core. However, it has 32GB of RAM, which is 500,000 times as much storage,
and many terabytes of disk as well. To make matters even better, modern RAM is
far faster to access than old RAM was, which reduces the performance penalty of
needing to fetch more instruction bytes from that RAM. As a result, the CPU vs
memory trade-off is totally different.

That brings us full circle back to instruction encodings for VMs. Both variable-
and fixed-length encodings can work; the trade-off primarily comes down to ease
of implementation (both of the VM and of related tooling) versus size of the
VM's programs. Which one matters more very much depends on the problem domain,
and on your own tolerance for implementing instruction decoding. For example,
a complete MIPS decoder is relatively straightforward to write, but a complete
x86 decoder could turn into quite a hobby.

Thanks for reading! Stay tuned for part 2: peripherals & IO for VMs :)

Bonus Section: Width Switching

It does seem like perhaps there might be a way to have our cake and eat it too,
although again at the price of complexity. Instead of having one complex
instruction decoder that can handle many different widths, what if we had
several individually-simple decoders and swapped between them depending on some
piece of CPU state? ARM does exactly this, and x86 sort of does as well.

ARM is a RISC instruction set with fixed-width 32-bit instructions, but it has a
special CPU mode called "Thumb" that can be enabled or disabled. In that mode,
it uses a completely separate instruction decoder which uses fixed-width 16-bit
instructions instead. Storage-sensitive applications can compile their code in
Thumb mode, switch into Thumb when they start off, and maybe get considerable
space savings at the price of not having access to the full 32-bit
architecture's power. There is also a separate thing, which ARM calls "Thumb 2",
which allows for encoding 32-bit wide ARM instructions as a pair of 16-bit Thumb
instructions, sort of, thus providing a lot of the benefit without having a
width switch but again at the price of decode complexity.

x86, too, has multiple modes, which imply different default instruction widths
along with a bunch of other things. I won't talk about them in detail here,
except to remark that footnote [3] begins to hint at the level of complexity
involved.

Also, as a fun extra, some ARM chips have a mode called "Jazelle", in which they
can decode and execute Java VM instructions in hardware (!).

[1]: It is a relatively well-known factoid that there are legal x86 instructions
     of 15 or more bytes, depending on the features in use, but in general
     variable-length encodings are more compact.

[2]: The _ is a syntactic convention to make long bit fields easier to read,
     by breaking them up into 8-bit groups; it doesn't actually appear in the
     encoded instruction in any way.

[3]: Specifically, it is the "operand-size override prefix", which is present
     because I, like you, have a 64-bit machine but wanted to express a 32-bit
     add. If compiled for a 32-bit machine, this prefix would not be here. Many
     but not all instructions allow use of one of these prefixes, and there are
     several other sets of optional prefixes.

[4]: The c3 here is the notorious "ModR/M byte", which describes what the
     instructions's operands are and how to interpret them. For each
     instruction, only certain ModR/M combinations are legal. There are a series
     of tables in Volume 2A of the IA-32/x86-64 architecture manual's section 2
     that describe how to this byte works.

[5]: Obviously not all clock cycles are equal, and I also have 16 of these
     3900 MHz cores, and pipelining, microcode, caches, yada yada yada. Please
     excuse this lie.

$#t Virtual Machines, Part 1: Instruction Encoding
$#s How instruction encoding and decoding works in hardware and software
$#o programming, virtual-machines
$#u 8f7f77a8-e66f-4d9b-b87d-9d02ebb6e456