## Merkle Trees A Merkle tree is a construct for hashing together many different pieces of data into a single hash. ### Syntax H(m): returns a hash of a message m a || b: concatenation of a and b ### Wait, Why? Obviously, if one wishes to hash a single piece of data m_0, one can compute: h = H(m_0) and if one wishes to hash four pieces of data m_0 ... m_3 one can compute: h = H(m_0 || m_1 || m_2 || m_3) However: * Most computers these days have multiple cores * H is almost definitionally not parallelizable so instead we could do this: h = H(H(m_0) || H(m_1) || H(m_2) || H(m_3)) allowing us to parallelize the four inner hashes, concatenate the results, and then hash that. If m_0 through m_3 are large, this parallelizes almost all of the work! ### The Tree Structure Suppose that we wanted to verify that the involved data hadn't been corrupted in transit. We could re-compute: H(H(m_0) || H(m_1) || H(m_2) || H(m_3)) and check that against h, but suppose we only wanted to validate m_2 - we'd be wasting a lot of work. However, if we instead stored: H(m_0) || H(m_1) || H(m_2) || H(m_3) then it would be possible to compute H(m_2) and check it against the stored value of that chunk. However, if there's a lot of these data blocks m_0 ... m_n, then we'd need to store: H(m_0) || H(m_1) || ... || H(m_n) and so we might instead want to store all of those hashes *and* h = H(H(m_0) || H(m_1) || ...|| H(m_n)) so that we can check that the H(m_i) values are okay, and still have a single hash that validates the entire message. This now forms a two-level tree of hashes - the leaves are the hashes of individual data blocks, then the root is the hash of all its children put together. ### More Formally Let the message be a sequence of n blocks, m_0 ... m_n, each of some fixed length which is a multiple of H's output size. Call that fixed length the tree's "block size" bs, and let the tree's "branching factor" bf be bs divided by the output size of H. So for example, if: H = SHA-256, bs = 4096, bf = 4096 / 256 = 16 Now to merkle(m) a message m, if length(m) <= bs: H(m) if length(m) <= bs * bf, then break m into bf chunks m_0, m_1, ..., m_bf and: H(H(m_0) || H(m_1) || ... || H(m_n)) if length(m) <= bs * bf * bf, then break m into bf chunks m_0, m_1, ..., m_bf. Since each of those chunks has length(m_i) <= bs * bf, recurse on merkle(m). Recurse as needed, so we eventually arrive at this recursive formulation of merkle(m): if length(m) <= bs: return H(m) let { m_0, m_1, ..., m_bf } = split(m) let mh_i = merkle(m_i) for i in { 0 .. bf } return H(mh_0 || ... || mh_bf }) Note that there is some omitted care needed here when m is not a multiple of the block size - in that case the input needs to be padded to the block size first. The example code given below doesn't do that. ### A C Implementation Here is a toy demonstration of this algorithm, in C. It uses a fake hash function and a deliberately low block size to cause the tree to be deeper, making it easier to see what's going on. To run this example, do: $ sed -e '1,/^>>>/d' -e '/^<<>> /* merkle.c */ #include #include typedef uint32_t hashval; #define HASHSIZE sizeof(hashval) #define BLOCKSIZE (HASHSIZE * 8) #define BRANCHFACTOR (BLOCKSIZE / HASHSIZE) typedef uint32_t block[BRANCHFACTOR]; uint32_t hash(uint8_t *m, size_t len) { uint32_t h = 0; for (size_t i = 0; i < len; i++) h += m[i]; return h; } uint32_t merkle(uint8_t *m, size_t len) { if (len <= BLOCKSIZE) return hash(m, len); uint32_t mh[BRANCHFACTOR]; size_t stride = len / BRANCHFACTOR; for (size_t i = 0; i < BRANCHFACTOR; i++) mh[i] = merkle(m + (stride * i), stride); return hash((uint8_t *)mh, sizeof(mh)); } int main() { uint8_t data[4096]; for (size_t i = 0; i < sizeof(data); i++) data[i] = i; printf("%08x\n", merkle(data, sizeof(data))); return 0; } <<< ### Paths To The Root One interesting property of Merkle trees is that we can check any given individual data block, as long as we know all the hash values on the path up the tree from that data block to the root, without needing to validate (or even posess) the entire tree. For example, let's suppose that we want to validate a data block in a tree of depth 2. The root hash of the tree will be: h = H(b_0 || b_1 || ... || b_bf) where b_i = H(m_{i * bf + 0} || m_{i * bf + 1} || ...) so if we know all of the values b_0, ..., b_bf, and we know all of the values m_{i * bf + 0} ... m_{i * bf + bf - 1}, we can check that: 1. The data block itself is correct, since it is covered by b_i, and 2. b_i is correct, since it is hashed into h without needing any of the other blocks of m. In the general case, we can produce a "path to the root" from any given leaf of the tree by producing the requisite other inputs to the hashes at each level. This technique is the basis of block chains as a data structure.