Tarballs There is a common archive tool, present on all Unix-like systems and with many ports to Windows, named tar(1). As almost everyone uses it these days, this tool has three command: 'c': which creates a fresh tar archive from some files 't': which lists the contents of a tar archive 'x': which extracts files from a tar archive and then a bunch of modifiers, of which the most common ones are: 'f': use a named file for the tar archive, rather than stdin/stdout 'v': be chatty 'z': use gzip compression 'j': use bzip2 compression It also has two commands that nobody ever uses: 'r': append files to an existing tar file 'u': append files to an existing tar file if they are newer than the existing copies in that tar file Also, tar(1) has a unique syntax among unix tools, in which both the command and all the modifier flags are supplied as a single un-prefixed argument at the start of its argument list[1] - i.e., one can write: tar xvf file.tar What's the deal with this syntax, and the two other odd commands? Time for a bit of a journey into the past... # DECtape / magtape The current man page for gnu tar(1) refers to it as "an archiving utility", but the BSD man page is a bit more honest and admits that it is for manipulating "tape archives". In fact, tar(1) is a descendant of tp(1), which was a prior program for managing tape archives, and tp(1) was in turn a descendant of tap(1), a yet earlier program for managing tape archives. Both tp(1) and tap(1) are now historical footnotes, but thanks to the excellent work of the Unix Historical Society, we still have access to their man pages and some of their source code. Both tap(1) and tp(1) were originally written to shuffle data to/from data tapes, which for a long while were the gold standard of durable data storage - early hard disks, even when available, were extremely prone to failure, so data one wished to keep was committed to tape. There were multiple kinds of tape in wide use but the one that early unixes dealt with most often was called "DECtape". Tapes have a couple of salient physical properties: 1. They can only be efficiently read in order, because seeking around the tape takes time linear in how big the seek offset is - this is in contrast to spinning disks, which have variable but small seek times regardless of distance, and solid-state disks which have no seek times at all. 2. They can only usefully be read and written in whole blocks - this is in common with both spinning disks and solid-state disks, but modern file systems go out of their way to maintain the illusion that individual bytes can be modified. DECtape drives could read and write in 128-byte blocks, and the early Unix tape devices exposed them as "record-oriented devices", in which successive reads, regardless of length, would read from successive tape blocks - i.e., this: read(buf, 16, tapefd); read(buf, 16, tapefd); would read the first 16 bytes of the first block, then the first 16 bytes of the second block[2]. Since tapes were much cheaper than tape drives, and one couldn't simply re-download everything from The Cloud, it quickly became important to be able to keep tape backups of data. The first program for doing this, as far as I know, was tap(1). # tap(1) The tap program (as well as a companion program called 'mt', which worked on magtapes instead of DECtapes) was present as early as Unix V1, and already contains some very familiar elements. One invoked it with a 'key', which was a single-letter command and optionally some modifier letters, and then a list of names to operate on. For example, one might do: tap t to list the contents of the tape (or "table" them, literally producing a table of contents for the tape), and then do: tap x /home/elly/prog.c to extract a single item from the tape to disk. Just doing: tap x would extract the entire contents of the tape into the file system, effectively restoring a backup. To create a tape in the first place, one would do: tap r optionally with a path or paths to include, where "r" means "replace": i.e., any existing entries already on the tape were overwritten with their new versions. By default, both 'x' and 'r' operated on / (the root of the filesystem) so they served to back up or restore the whole filesystem to a tape. There were also some modifier letters, some of which are very recognizeable today: 'v': indicate which files are being operated on and how 'c': create a fresh archive - i.e., wipe the tape "directory" first 'w': prompt for confirmation before every action and a couple that are archaic: 'm': create directories while extracting - tar simply always does this today, since not doing it makes little sense, but as of V1 tap(1) contained a warning that this doesn't always work, which is presumably why it's an option 'f': create fake entries, with only metadata and no data and most interestingly: '0' .. '7': select which tape drive to use which would later become the 'f' option that we all know and love. Blessedly, tap(5) from Unix v2 is reproduced in the June 1972 scans of the manual from Bell Labs. The file format is as follows, using 256-word (aka 512-byte) blocks: Block zero of the tape is not used by tap, so that it can be used as a boot program for bootable tapes - the DECtape drives had explicit support for this, basically by having a special "boot mode" where they simply sent the first block to the host without any special framing. After that, the next 24 blocks (numbers 1 through 24) contain the tape's directory. The directory has 192 64-byte entries, 8 to a 512-byte block. Each entry looked like this: struct tap_entry { char name[32]; uint8_t mode; uint8_t uid; uint16_t size; /* in bytes */ uint32_t mtime; /* seconds since epoch */ uint16_t addr; /* tape block number, NOT bytes */ char unused[20]; uint16_t checksum; }; where the mode has six legal bits: 0x20: setuid 0x10: executable 0x08: owner read 0x04: owner write 0x02: non-owner read 0x01: non-owner write and the checksum is a simple additive checksum: the entire entry, treated as a vector of uint16_ts, must sum to 0. Every file occupies an integral number of contiguous 512-byte tape blocks, and no fragmentation/etc is allowed. Blocks 25 and up are where the file data starts. Over the next couple of releases of Unix, tap(1) gained a couple of new features but was substantially unchanged. In particular, it got more useful as an incremental backup tool[3], gaining two new commands: 'd': delete named files or directories from the tape 'u': update, replacing existing files if newer than the tape version # Enter tp(1) & tp(5) This format has one obvious problem, which is that it predates groups as a concept and so would lose group information. Groups appeared I *believe* in Unix V5, and to support archiving group information, as well as the newer, larger file modes, the format needed to be altered. The tp(5) tape format was very similar to the older tap(5) format, but the directory entries were now: struct tp_entry { char name[32]; uint16_t mode; uint8_t uid; uint8_t gid; uint8_t unused; uint24_t size; uint32_t mtime; uint16_t addr; uint8_t unused[16]; uint16_t checksum; }; Many of the fields are as they were, except that there is now a gid field, the mode field is 8 bits wider, and the size field is now 3 bytes (yikes), broken into a "high byte" and a "low word". Also, block zero is explicitly specified as containing a "stand-alone bootstrap program" (!). By this point, both tap(1) and mt(1) had been unified into tp(1), which knew how to operate on either kind of tape and adjusted its block size internally as needed. There are three practical problems that both the old tap(5) format and the newer tp(5) format share: 1. There is no way to tell if a given tape is in tap(5) format or tp(5) format, and the two formats are not at all interchangeable - trying to read a tap(5) tape with tp(5) will use the uid as part of the mode, the file size as the uid/gid, etc etc; in modern design terms, neither format has any kind of identifying magic number. 2. The fields that are more than one byte long have unspecified endianness, and in fact it was host-dependent, so tapes from little-endian hosts could not be read on big-endian hosts or vice versa. 3. A more severe problem: the fixed format of having 24 blocks for directory entries imposes an absolute limit of 192 separate files or directories on a tape, regardless of how big it is. Solving these two problems required another new format, and another new tool to go with it: tar. # Exit tp, enter tar Tar first showed up in Unix V7 (I believe) and it worked very much like the earlier tools it was descended from; it took a single-character command and some modifiers, and even preserved the 0 .. 7 suffix to specify which tape drive to use. It did promote 'c' to a command that implied 'r', such that: tar c would behave as tp rc had. It also had the first support for the 'f' option, which allowed specifying a filename to use for the archive, or '-' for stdin/stdout. Under the hood, though, things were very different. A tar file has no directory at all - instead, each entry is written with a header for that entry followed by the data for that entry. In V7 Unix, the header looked like this: struct tar_entry { char name[100]; char mode[8]; char uid[8]; char gid[8]; char size[12]; char mtime[12]; char chksum[8]; char linkflag; char linkname[100]; }; which you'll notice is an extremely hefty 257 bytes, compared to the earlier format's 64 bytes per file of overhead. After that were 255 bytes of padding to fill a 512-byte block, then the requisite number of 512-byte data blocks to contain the file's data. There are a couple of very odd things about this format: namely, why the heck is everything suddenly a char array, and what is linkflag/linkname? The reason for various fields becoming char arrays is deeply unfortunate: to deal with the endianness problem mentioned above, and also the prospect of machines with different word sizes in general, the tar authors decided to store the various metadata values *as strings*, so those fields are all... octal strings, representing much the same thing as they did in tp(1), but now requiring string encoding and decoding at either end. Compared to the modern approach of declaring a fixed byte order, and converting to/from host as needed, this is markedly inferior. To make matters worse, the 100-byte buffers for name and linkname are simultaneously too long a lot of the time, and too short some of the time, but they are at least not as egregiously short as the older 32-byte name buffers. As for linkname and linkflag, they were used for storing symbolic links; linkflag just served to indicate whether linkname was valid. This format, while there are many retrospectively-silly things about it, resolves two of the three problems we mentioned above. The various numerical metadata is now stored as octal strings so there are no portability problems per se, and the absence of a fixed-size directory means there's no limitation on how many entries the archive can store... but there's that nagging 100-character length limit still there, and the archives still aren't particularly identifiable. # ustar The answer was a POSIX creation, dating to I think some time in the late 80s, called "ustar" (short for "unix standard tar"). The ustar file header now looked like this: struct ustar_entry { char name[100]; char mode[8]; char uid[8]; char gid[8]; char size[12]; char mtime[12]; char checksum[8]; char typeflag[1]; char linkname[100]; char magic[6]; char version[2]; char uname[32]; char gname[32]; char devmajor[8]; char devminor[8]; char prefix[155]; char pad[12]; }; This structure is, in fact, exactly 512 bytes. This neatly fit into the existing 512-byte tar entry format by making use of the formerly-unused padding section. None of the meanings of the old fields were changed except that 'linkflag' became 'typeflag' and gained more valid values, but in particular, every valid old tar archive was a valid ustar archive as well. The new fields are pretty interesting: * 'magic' is the 6-byte string "ustar", with a null byte after * 'version' is two ASCII digits for what version of the format this is * 'uname' & 'gname' are the *names* of the user/group for this entry, which supersede the uid/gid fields if present - this was to allow for, eg, systems which had 'wheel' map to a different uid or similar to have compatible archives * 'devmajor' & 'devminor' allow for encoding block devices in tar archives * 'prefix' was prepended to the pathname if present, allowing for full 255-char names Note that the prefix hack is pretty clever: old tar implementations could read ustar archives and get the end (the most unique part, generally) of too-long filenames, and because of how old tar was written, the new types that ustar added would turn into plain files. # gnu tar Regrettably, gnu tar extended the tar format before ustar was standardized[4], so it has an incompatible entry format with different features. I won't bother relating it here, but it shares *some* of the same extensions (like uname/gname/devmajor/devminor) and had other incompatible ones. Notably, it allowed headers to span multiple blocks, opening up oodles of complex encoding possibilities, and had support for "manifests" (listings of which parts of a directory were included), sparse files, volume names for multipart archives, and so on and so on. The entire thing is somewhat fatiguing to think about, but it does bear mentioning that the magic number they chose for this format is: "ustar " i.e., "ustar", but with a trailing *space* rather than a null. It's pretty clear that they were working off some pre-standardization version of the ustar format, and now just have an incompatible thing. # pax A "pax archive" is a valid ustar archive, except that two new types of entries are allowed: 'x' and 'g'. These are used to store metadata, and in particular to store such niceties as UTF-8 group/user names, atime/ctime/mtime with fractional seconds (for some reason), UTF-8 path and link names, much larger sizes, extended attributes, and so on. These are just normal "entries" in the tar file, basically plain files with a different type flag, and in fact even the oldest tar file can successfully extract them - it will treat the metadata as an ordinary file, extract it, and the user can inspect it as they desire. The pax format is basically the current standard for tar, and POSIX no longer specifies the older tar format - pax is now the thing that tar(1) utilities generate and read, although it is actually backwards-compatible all the way to Unix V7. Phew! That was a long tour through history. Thanks for reading! Sources: DECtape: the VT103 user guide, PDF with sha256sum a26ee5530c42c7f44e6efb9cd926d677f61a57ef6169835b6b6f85b7d0f843cc tap(5): the Unix V2 manual, PDF with sha256sum d20042daf44fb9420d65ec89742eaf9d18a3aee7e4aafc0b632e138f5eb9f1f7 tp(1) & tp(5): the Unix V5 manual, PDF with sha256sum dd792cba23dc0229b417ef3148eb8db91516d7528ad12c36f29bb93435a2c6a1 tar(5): the FreeBSD tar(5) man page [1]: Some tar implementations, notably gnu tar, also accept a more usual dashed format, like 'tar -cvf ...'. [2]: Even as of V1, the rfo(4) man page contains an apology for this behavior, and it would disappear in later Unixes. [3]: V5 Unix also had commands "dump" and "restor", which were specifically designed for incremental backups. [4]: This has what the teens are calling "big GNU energy".