## Merkle Trees

A Merkle tree is a construct for hashing together many different pieces of
data into a single hash.

### Syntax

	H(m): returns a hash of a message m
	a || b: concatenation of a and b

### Wait, Why?

Obviously, if one wishes to hash a single piece of data m_0, one can compute:

	h = H(m_0)

and if one wishes to hash four pieces of data m_0 ... m_3 one can compute:

	h = H(m_0 || m_1 || m_2 || m_3)

However:
* Most computers these days have multiple cores
* H is almost definitionally not parallelizable

so instead we could do this:

	h = H(H(m_0) || H(m_1) || H(m_2) || H(m_3))

allowing us to parallelize the four inner hashes, concatenate the results, and
then hash that. If m_0 through m_3 are large, this parallelizes almost all of
the work!

### The Tree Structure

Suppose that we wanted to verify that the involved data hadn't been corrupted
in transit. We could re-compute:

	H(H(m_0) || H(m_1) || H(m_2) || H(m_3))

and check that against h, but suppose we only wanted to validate m_2 - we'd
be wasting a lot of work. However, if we instead stored:

	H(m_0) || H(m_1) || H(m_2) || H(m_3)

then it would be possible to compute H(m_2) and check it against the stored
value of that chunk.

However, if there's a lot of these data blocks m_0 ... m_n, then we'd need to
store:

	H(m_0) || H(m_1) || ... || H(m_n)

and so we might instead want to store all of those hashes *and*

	h = H(H(m_0) || H(m_1) || ...|| H(m_n))

so that we can check that the H(m_i) values are okay, and still have a single
hash that validates the entire message.

This now forms a two-level tree of hashes - the leaves are the hashes of
individual data blocks, then the root is the hash of all its children put
together.

### More Formally

Let the message be a sequence of n blocks, m_0 ... m_n, each of some fixed
length which is a multiple of H's output size. Call that fixed length the
tree's "block size" bs, and let the tree's "branching factor" bf be bs divided
by the output size of H. So for example, if:

	H = SHA-256,
	bs = 4096,
	bf = 4096 / 256 = 16

Now to merkle(m) a message m, if length(m) <= bs:

	H(m)

if length(m) <= bs * bf, then break m into bf chunks m_0, m_1, ..., m_bf and:

	H(H(m_0) || H(m_1) || ... || H(m_n))

if length(m) <= bs * bf * bf, then break m into bf chunks m_0, m_1, ..., m_bf.
Since each of those chunks has length(m_i) <= bs * bf, recurse on merkle(m).
Recurse as needed, so we eventually arrive at this recursive formulation of
merkle(m):

	if length(m) <= bs:
		return H(m)
	let { m_0, m_1, ..., m_bf } = split(m)
	let mh_i = merkle(m_i) for i in { 0 .. bf }
	return H(mh_0 || ... || mh_bf })

Note that there is some omitted care needed here when m is not a multiple of the
block size - in that case the input needs to be padded to the block size first.
The example code given below doesn't do that.

### A C Implementation

Here is a toy demonstration of this algorithm, in C. It uses a fake hash
function and a deliberately low block size to cause the tree to be deeper,
making it easier to see what's going on.

To run this example, do:

$ sed -e '1,/^>>>/d' -e '/^<<</,$d' < THISFILE | cc -x c -o merkle -
$ ./merkle

>>>
/* merkle.c */

#include <stdint.h>
#include <stdio.h>

typedef uint32_t hashval;

#define HASHSIZE sizeof(hashval)
#define BLOCKSIZE (HASHSIZE * 8)
#define BRANCHFACTOR (BLOCKSIZE / HASHSIZE)

typedef uint32_t block[BRANCHFACTOR];

uint32_t hash(uint8_t *m, size_t len) {
	uint32_t h = 0;
	for (size_t i = 0; i < len; i++)
		h += m[i];
	return h;
}

uint32_t merkle(uint8_t *m, size_t len) {
	if (len <= BLOCKSIZE)
		return hash(m, len);

	uint32_t mh[BRANCHFACTOR];
	size_t stride = len / BRANCHFACTOR;
	for (size_t i = 0; i < BRANCHFACTOR; i++)
		mh[i] = merkle(m + (stride * i), stride);

	return hash((uint8_t *)mh, sizeof(mh));
}

int main() {
	uint8_t data[4096];
	for (size_t i = 0; i < sizeof(data); i++)
		data[i] = i;
	printf("%08x\n", merkle(data, sizeof(data)));
	return 0;
}
<<<

### Paths To The Root

One interesting property of Merkle trees is that we can check any given
individual data block, as long as we know all the hash values on the path up
the tree from that data block to the root, without needing to validate (or
even posess) the entire tree.

For example, let's suppose that we want to validate a data block in a tree of
depth 2. The root hash of the tree will be:

	h = H(b_0 || b_1 || ... || b_bf)

where

	b_i = H(m_{i * bf + 0} || m_{i * bf + 1} || ...)

so if we know all of the values b_0, ..., b_bf, and we know all of the values
m_{i * bf + 0} ... m_{i * bf + bf - 1}, we can check that:

1. The data block itself is correct, since it is covered by b_i, and
2. b_i is correct, since it is hashed into h

without needing any of the other blocks of m.

In the general case, we can produce a "path to the root" from any given leaf
of the tree by producing the requisite other inputs to the hashes at each level.
This technique is the basis of block chains as a data structure.

$#t Merkle Trees
$#s A data structure (a tree of hashes) to allow for efficient validation of
$#s parts of a huge data set.
$#o crypto, computer-science
$#u 2e8cc582-67ec-4aee-aac5-8e716078f947