# Binary Transparency

Suppose that you are a Linux distribution, and you are in the business of
shipping binaries to users, as most Linux distributions are[1]. Since users will
generally install those new binaries you ship in a pretty timely way, that
provides a very powerful mechanism for attacks on your users, either by you or
by someone else who compromises your build system. It would be very cool if you
could somehow reduce that risk. Fortunately, there's a collection of techniques,
sometimes collectively called "binary transparency", that let you manage that
risk on behalf of your users, and lets your users do the "verify" part of
"trust-but-verify". Here's how it works!

## What Actually Is Binary Transparency?

Binary transparency provides end users with an audit trail of how a specific
artifact was made. For programs, usually the artifact is the actual program
binary, and the audit trail in question describes which source code that program
was built from. You can use the same sort of techniques for other kinds of
artifacts though - for example, certificate authorities use some of the same
mechanisms to provide an audit trail for certificates they sign[2].

Let's focus on the program case for now. You want users to be assured that the
binaries they're getting from you were built from specific source code. How?
There are several parts to it:

1. Being able to identify a specific set of source files (a revision)
2. Making the translation from a source revision to a binary reproducible
3. Proving that every binary comes from a specific source revision
4. Verifying on the user side that every binary has a correct audit log

## Identifying Revisions

There are a few different ways to do this, but the gold standard is a hash,
taken over *all* the inputs to the build process - the source files of your
program, all the included source files (system headers and so on), all the
resources, all the build scripts, and so on. Listing all of those out can turn
into quite a hobby and making sure the list stays complete can be even more so,
so if you can it's best to build inside an empty chroot or similar so that your
build process doesn't pick up files that aren't checked in, new system headers
that you didn't account for, and that sort of thing[3].

If you already use a version control system where revisions are hashes, great -
the version control system will validate hashes for source files and tie them
together into revisions for you. If you're very lucky you can check everything,
including the build scripts *and* your dependencies[4], into one repository and
be able to identify all of the build inputs with a single version control
revision. If you have multiple repositories or your version control system
doesn't do hashing for you, you'll probably need to do a bit more work; one
relatively cheap way is to always do builds from tarballs, in an empty chroot,
and use the tarball hash to identify the source revision[5].

However, the source files aren't the only components of the build. There are
also the build tools...

## Reproducible Builds

Unless you are particularly ambitious, you probably won't want to include
compiler binaries, copies of the system headers, and so on in your version
control system[6]. You'll probably have to settle for depending on binary
versions of those that are installed on the system or provided some other way.
However, a property you do want is that the build is *reproducible*, which means
that for a given source revision and set of build tools, the output binary is
always bit-for-bit identical.

There are many good reasons to do this, but the most important one for binary
transparency purposes is that it allows others to actually validate your claims
that something was built at a specific revision, because they can check your
source out at that revision, install the same build tools you used, and get the
same resulting binary.

Let's take a specific example: say you built your program from revision
X with clang 15.0.6 and a certain set of build flags. Anyone else[7] should be
able to check out revision X, build with clang 15.0.6 and that same set of build
flags, and get the exact same binary, bit-for-bit. Ideally you would have your
continuous integration system validate this, by doing two separate builds and
comparing the results to each other.

To make builds reproducible, you will also want to make them hermetic: they
depend *only* on a specified list of dependencies and on nothing else. You can
enforce that using chroot(1) or a sandboxing tool, but you will also need to
chase down all the dependencies of your compiler toolchain and so on in the
process so that you can include them in your dependencies.

## Audit Logs

So, you have a cryptographic way to identify a set of source files, and a
reproducible translation from source files to binaries, meaning that any
specific user can check that any specific binary you gave them is genuinely
built from some source revision. However, you can't have every user do a full
source build - that would defeat the entire purpose of shipping the binaries in
the first place. What you want is some way for users to be confident that the
binary you're giving them really did come from a source revision, without them
having to actually do the build. How?

The first key tool is a public, append-only cryptographic log, signed by you, of
every binary revision that you ship. When clients are going to fetch a new
binary from you, they also fetch this log, and check that the revision you're
offering them is present in that log. So far so good, but that's not enough -
you (or anyone with your signing keys) can still do this:

1. Generate a malicious binary in any desired way
2. Take the existing append-only (public) cryptographic log, append the
   malicious binary's hash to it, and then send that modified log, *with* the
   malicious binary, to only a specific target user

Since your builds are reproducible, if that user happens to actually check,
they'll find that the binary they have doesn't correspond to any source
revision, but users will in general never bother to check that unless they're
very, very paranoid[8]. Since (from their perspective) the append-only log does
include the binary's hash, everything seems above board.

Luckily, we can do better. What we really want to assure is that there is only
one, single public audit log, and that individual users can't be given a
modified version of it. We can do that by using "witnesses", which are third
parties[9] that attest that they've seen specific revisions in the log, and
perhaps even that they've reproduced a build from source at that revision. When
a client is about to use a new binary, it checks not only that the binary's hash
appears in the audit log, but that there are witness attestations of that hash
as well.

If you are feeling extra spicy, you then include some code in your updater (the
code that fetches binaries and checks that they are in the public audit log and
properly witnessed) which complains loudly to both you and the witnesses if it
ever sees either a binary that isn't in the log, or a log entry that hasn't been
properly witnessed. That basically prevents anyone from carrying out an attack
involving shipping a malicious binary to certain users - even you!

## Putting It All Together

Your build process then looks like this:
1. Create a fresh environment (chroot, docker container, VM, whatever)
2. Check out a source revision and its dependencies into it
3. Compile and produce artifacts
4. Hash all those artifacts together to make a "binary revision"
5. Make an entry in your append-only log: "binary revision X came from
   source revision Y". Since your build is reproducible, anyone else with
   source revision Y can validate this claim.
6. Attach your signature to the new head of your append-only log, and submit it
   to the witnesses for witnessing.
7. Once enough witnesses (a quorum) have witnessed it, take their signatures and
   attach it to the log head as well.
8. Publish the new log head.

Now, clients fetching updates do this:
1. Fetch the append-only log, check your signature on the head, and check the
   witness signatures on the head.
2. Figure out which version they want to install, validate that it appears in
   the log, and install it.

Note that clients might not always be installing the head version in the log -
for example, if you ship a beta version of your software, non-beta clients might
be installing the most recent stable version instead. However, every version you
ever publish has to appear *somewhere* in the log - just not necessarily at the
head.

## What Does This Actually Give You?

Compared to simply signing your binary updates, this technique doesn't actually
*prevent* any new attacks - instead, it makes them impossible to execute without
leaving a cryptographically-verified audit trail. For example:

* Someone who compromises your build system can't ship a new version of
  your product to anyone without leaving a publicly-verified log of having done
  so
* If your witnesses also validate your source revision -> binary revision
  assertion, someone who compromises your build system can't make a
  binary change without a corresponding publicly logged source change, which
  makes it much easier to detect compromised developer creds
* You can't ship backdoored or malicious binaries to specific users - only to
  everyone or to no-one. This vastly increases the chance that you will get
  caught in the process, especially if your malicious binary change requires a
  malicious source change.

## What Doesn't This Protect Against?

* If you (or someone else) patch your updater to remove the validation logic,
  you can do an end-run around this protection. To avoid that, your update
  mechanism itself also has to be protected by binary transparency.
* If a quorum of witnesses collude with you, you could get them to "witness"
  an entry in the append-only log that you don't show to anyone else, and use
  that to compromise a specific client. However, that client would end up with
  cryptographic proof of their collusion with you, if they happened to notice.

## Further Reading

Mozilla's writeup: https://wiki.mozilla.org/Security/Binary_Transparency
Google's writeup: https://developers.google.com/android/binary_transparency/pixel
RFC 9162 (Certificate Transparency v2)

[1]: There are some distros that are fully "from source" and need a host distro
     to bootstrap a compiled environment.

[2]: They do this because they were required to by web browsers, not because
     they want to.

[3]: Chasing down changes in system headers is an excellent reason to keep your
     dependencies as minimal as you can.

[4]: This approach, called "vendoring", is popular in large projects mostly
     because making sure that dev machines have the right version of every
     dependency is a big headache, but it does also let you include dependency
     revisions in your source revision very easily.

[5]: However, the tarball needs to include any needed system headers, and you
     need to generate the tarball in a deterministic way to avoid problems,
     which means hardcoded mtimes, file modes, owners/groups, etc etc.

[6]: Chromium does this, and in fact has a separate binary revision control
     system called 'cipd' for fetching compiler toolchains and SDKs by hash:
     https://chromium.googlesource.com/infra/luci/luci-go/+/master/cipd/

[7]: If you are working on closed-source software for some reason, "anyone else"
     might be "your coworkers" or "your company's internal security department"
     instead of "random people on the internet", but the principle is the same.

[8]: If they are very, very paranoid, they probably are building from source
     themselves anyway.

[9]: Specifically trustworthy, notable third parties, like the EFF or the IETF
     or someone, instead of Honest Bob's Used Cars and Binary Audit Log
     Witnessing.

$#t Binary Transparency
$#s A technique for shipping binaries which are provably derived from specific
$#s source trees
$#o crypto, security
$#u a0587479-5c8e-46e6-933d-0a4def188776