One of the central problems in cryptography is "integrity protection", or proving that some given data is as it was when it was generated or validated by some trusted party. This matters especially for things like ensuring that a software package wasn't tampered with, or that an email is not modified since the sender produced it. Usually this is done using asymmetric signing schemes, but these are slow and have large keys and require clever mathematics to implement properly, so another option that is useful in some contexts is symmetric signing schemes - i.e., ones where the same (private) key is used both to apply the protection and to check that the protection is intact. These schemes aren't useful for either of the cases given above: if a user has the required key to check that a software release with a symmetric signature is intact, then they also have the required key to forge a release of that software, which is what security professionals call a "major bummer" when it happens. However, symmetric signatures can be quite useful in two scenarios: 1. The data is being returned to the point of origin but passing through untrusted hands en route, so the secret key is held and used only by one party; 2. The secret key is negotiated as part of some other protocol and used to provide transient protection for some kind of bulk data. An example of a system of the first type might be a login server that issues users "tokens". Assuming we have a symmetric signing operation: sign(key k, message m) verify(key k, message m) The login server might support an operation like "login", which takes a username and password, checks them against its database, and if successful returns: sign(LOGIN_SERVER_KEY, "logged in:" + user + time) which can then be re-validated by the server at a later date using verify() with that same key[1]. An example of a system of the second type is pretty simple to imagine: you and I meet in person, let's say, and establish a shared secret key; I then transmit to you a large file signed with that key and you verify it. Easy! So, how does this symmetric signing actually work in practice? Well, there are quite a few different schemes, but one of the most common and easiest to use, HMAC, involves use of a hash function. Since hash functions are irreversible[2], in theory, if I generate: hash(secretkey + data) nobody else, observing that hash value, can re-derive the secret key, and therefore nobody could produce hash(secretkey + otherdata). Admirably simply and clear... but it doesn't actually provide the security guarantee you might want, and as a result the actual HMAC standard, RFC 2104, has a more complicated scheme involving two nested applications of the hash function. Wait, though. Why doesn't it work? To understand, we need to dig a bit into how most modern hash functions work. Most hash functions are built from what is called a "one-way compression function", which takes a previous state and a block of data and produces a new state. The previous state and the block of data don't have to be the same size, but they often are. For example, here's a one-way compression function with a state size of 4 and a block size of 4: uint32_t compress(uint32_t state, uint32_t block) { return state ^ block; } Obviously this is a garbage compression function for cryptographic purposes but it'll serve to illustrate the point. From this compression function, we can produce an actual hash function, which takes an arbitrary amount if input and produces a fixed-length hash value of it, like so: 1. Pad the input (which is of arbitrary length) to a multiple of the block size; 2. Break the input into blocks of the block size, b[0] ... b[n] 3. Let the initial state s be some fixed value (the "initialization vector") 4. For each block b[i] of input, do: s = compress(s, b[i]) 5. Extract a hash value from the state - this is called "finalization". This construct is neat and simple to implement, and as a bonus it has some pleasant provable security properties. However, it also has some major flaws, including what is called the "length extension attack", which we'll describe here, along with how HMAC prevents it. In step 1, I somewhat breezily instructed you to "pad the input", and your first question, as a security-conscious implementer, ought to have been "pad with what?". Obviously simply padding with any fixed value p is insufficient, because then messages ending or not ending in p will hash to the same value. What MD schemes often do instead is what's called "length padding", in which the length of the input is used as part of the padding. This clever scheme ensures that, eg, hash("AAA") and hash("AAA\0"), if the padding character is \0, are different since their lengths are different. Let's call this function pad() and declare that it adds some proper padding to the input. Now, in C, our finished hash function[3] might look like this: uint32_t hash(uint8_t *message, int len) { uint32_t state = IV; int i; for (i = 0; i < len - (len % 4); i += 4) state = compress(state, *(uint32_t *)(message + i)); state = compress(state, pad(message + i, len)); return state; } Note that a padding block is always added to the end, even if the original input was a multiple of the block size - in that case, the last block compressed in is just the length and padding. It's often tempting, as our example hash does, simply return the internal state as the final hash - e.g., in the now-broken SHA-1, the finalize() function is basically just the identity function. There is one curious property of this construct though: given an input message m, the hash function returns its own internal state after hashing m with pad(m) appended. Given this internal state, it is very possible to append more data and receive a valid hash! For example, if I compute: hash("secretkey" + "message") what is actually returned is basically: h = compress(compress(...(compress(IV, "secr"), "etke"), ..., pad(orig))) it is very easy to then just compute: h' = compress(h, "zomg") I could then append my own, new, padding block as required by the construct: h'' = compress(h', pad(new)) and it will be the case that: h'' = hash("secretkey" + "message" + pad(orig) + "zomg") despite the fact that I don't know secretkey, I have produced an (admittedly oddly-shaped) message apparently symmetrically signed with secretkey. Depending on the actual application this may or may not be a real problem but it's certainly concerning, and makes this signing construct difficult to use safely without knowing more about how it's used. That, of course, is no good in a security system, so what do we do? Well, the real problem here is that the internal state of the hash is returned such that an attacker can simply continue to run the hash computation. We could use a wider hash which we then "narrow" - eg, if I made the internal state a uint64_t in my example and returned only the lower 32 bits of it, someone trying to forge a message would have no way of reconstructing the upper 32 bits of that state and therefore wouldn't be able to extend the hash. This is called the "wide pipe" approach to fixing length extensions, but it really has to be designed into the hash function at birth, and we want to be able to securely use our existing hash functions if we can. Thus, enter HMAC. HMAC addresses the problem of obscuring the hash's internal state by computing: hmac(k, m) = hash((k xor o) + hash((k xor i) + m)) where o and i are constants, chosen just to ensure that the two hashes definitely do start with different bytes. This approach absolutely prevents length extension, because the state of the "inner" hash, which could be length-extended, is hidden in the output of the "outer" hash and can't be recovered, and the input to the "outer" hash is always of exactly fixed length and can't be length-extended. One could theoretically do: hmac(k, m) = hash(k + hash(k + m)) and the result would likely still be secure, but the use of the o and i constants allows HMAC's security proof to be a special case of the security proof of a more general construct called NMAC: nmac(k0, k1, m) = hash(k0 + hash(k1 + m)) You can read more about that, along with all the equations you could ever desire, in a PDF with sha256sum 69013ff255612c4f18e9ea07a7a9dd46a3ee8b71553650292777d2774c2046ad. Anyway, that's what a length extension attack is and how the HMAC nested-hash construct avoids it! [1]: It would be a very useful lie to say that this is how Kerberos works, and through a very specific lens it is, but it uses a symmetric cipher (regrettably often something like DES) to provide both integrity protection and confidentiality. Yikes. [2]: Meaning that given hash(m), you cannot easily re-derive m. This is formally called a "first preimage attack". The value of "easily" is, of course, subjective. [3]: This function obviously is unsuitable for real use. Please do not email me with cryptanalyses of it.