Tarballs

There is a common archive tool, present on all Unix-like systems and with many
ports to Windows, named tar(1). As almost everyone uses it these days, this tool
has three command:

  'c': which creates a fresh tar archive from some files
  't': which lists the contents of a tar archive
  'x': which extracts files from a tar archive

and then a bunch of modifiers, of which the most common ones are:

  'f': use a named file for the tar archive, rather than stdin/stdout
  'v': be chatty
  'z': use gzip compression
  'j': use bzip2 compression

It also has two commands that nobody ever uses:

  'r': append files to an existing tar file
  'u': append files to an existing tar file if they are newer than the existing
       copies in that tar file

Also, tar(1) has a unique syntax among unix tools, in which both the
command and all the modifier flags are supplied as a single un-prefixed argument
at the start of its argument list[1] - i.e., one can write:

  tar xvf file.tar

What's the deal with this syntax, and the two other odd commands? Time for a bit
of a journey into the past...

# DECtape / magtape

The current man page for gnu tar(1) refers to it as "an archiving utility", but
the BSD man page is a bit more honest and admits that it is for manipulating
"tape archives". In fact, tar(1) is a descendant of tp(1), which was a prior
program for managing tape archives, and tp(1) was in turn a descendant of
tap(1), a yet earlier program for managing tape archives. Both tp(1) and tap(1)
are now historical footnotes, but thanks to the excellent work of the Unix
Historical Society, we still have access to their man pages and some of their
source code.

Both tap(1) and tp(1) were originally written to shuffle data to/from data
tapes, which for a long while were the gold standard of durable data storage -
early hard disks, even when available, were extremely prone to failure, so data
one wished to keep was committed to tape. There were multiple kinds of tape in
wide use but the one that early unixes dealt with most often was called
"DECtape".

Tapes have a couple of salient physical properties:

1. They can only be efficiently read in order, because seeking around the tape
   takes time linear in how big the seek offset is - this is in contrast to
   spinning disks, which have variable but small seek times regardless of
   distance, and solid-state disks which have no seek times at all.

2. They can only usefully be read and written in whole blocks - this is in
   common with both spinning disks and solid-state disks, but modern file
   systems go out of their way to maintain the illusion that individual bytes
   can be modified.

DECtape drives could read and write in 128-byte blocks, and the early Unix tape
devices exposed them as "record-oriented devices", in which successive reads,
regardless of length, would read from successive tape blocks - i.e., this:

  read(buf, 16, tapefd);
  read(buf, 16, tapefd);

would read the first 16 bytes of the first block, then the first 16 bytes of the
second block[2].

Since tapes were much cheaper than tape drives, and one couldn't simply
re-download everything from The Cloud, it quickly became important to be able to
keep tape backups of data. The first program for doing this, as far as I know,
was tap(1).

# tap(1)

The tap program (as well as a companion program called 'mt', which worked on
magtapes instead of DECtapes) was present as early as Unix V1, and already
contains some very familiar elements. One invoked it with a 'key', which was a
single-letter command and optionally some modifier letters, and then a list of
names to operate on. For example, one might do:

  tap t

to list the contents of the tape (or "table" them, literally producing a table
of contents for the tape), and then do:

  tap x /home/elly/prog.c

to extract a single item from the tape to disk. Just doing:

  tap x

would extract the entire contents of the tape into the file system, effectively
restoring a backup. To create a tape in the first place, one would do:

  tap r

optionally with a path or paths to include, where "r" means "replace": i.e., any
existing entries already on the tape were overwritten with their new versions.
By default, both 'x' and 'r' operated on / (the root of the filesystem) so they
served to back up or restore the whole filesystem to a tape.

There were also some modifier letters, some of which are very recognizeable
today:

  'v': indicate which files are being operated on and how
  'c': create a fresh archive - i.e., wipe the tape "directory" first
  'w': prompt for confirmation before every action

and a couple that are archaic:

  'm': create directories while extracting - tar simply always does this today,
       since not doing it makes little sense, but as of V1 tap(1) contained a
       warning that this doesn't always work, which is presumably why it's an
       option
  'f': create fake entries, with only metadata and no data

and most interestingly:

  '0' .. '7': select which tape drive to use

which would later become the 'f' option that we all know and love.

Blessedly, tap(5) from Unix v2 is reproduced in the June 1972 scans of the
manual from Bell Labs. The file format is as follows, using 256-word (aka
512-byte) blocks:

Block zero of the tape is not used by tap, so that it can be used as a boot
program for bootable tapes - the DECtape drives had explicit support for this,
basically by having a special "boot mode" where they simply sent the first block
to the host without any special framing.

After that, the next 24 blocks (numbers 1 through 24) contain the tape's
directory. The directory has 192 64-byte entries, 8 to a 512-byte block. Each
entry looked like this:

  struct tap_entry {
    char name[32];
    uint8_t mode;
    uint8_t uid;
    uint16_t size;          /* in bytes */
    uint32_t mtime;         /* seconds since epoch */
    uint16_t addr;          /* tape block number, NOT bytes */
    char unused[20];
    uint16_t checksum;
  };

where the mode has six legal bits:
    0x20: setuid
    0x10: executable
    0x08: owner read
    0x04: owner write
    0x02: non-owner read
    0x01: non-owner write

and the checksum is a simple additive checksum: the entire entry, treated as a
vector of uint16_ts, must sum to 0. Every file occupies an integral number of
contiguous 512-byte tape blocks, and no fragmentation/etc is allowed.

Blocks 25 and up are where the file data starts.

Over the next couple of releases of Unix, tap(1) gained a couple of new features
but was substantially unchanged. In particular, it got more useful as an
incremental backup tool[3], gaining two new commands:

  'd': delete named files or directories from the tape
  'u': update, replacing existing files if newer than the tape version

# Enter tp(1) & tp(5)

This format has one obvious problem, which is that it predates groups as a
concept and so would lose group information. Groups appeared I *believe* in Unix
V5, and to support archiving group information, as well as the newer, larger
file modes, the format needed to be altered. The tp(5) tape format was very
similar to the older tap(5) format, but the directory entries were now:

  struct tp_entry {
    char name[32];
    uint16_t mode;
    uint8_t uid;
    uint8_t gid;
    uint8_t unused;
    uint24_t size;
    uint32_t mtime;
    uint16_t addr;
    uint8_t unused[16];
    uint16_t checksum;
  };

Many of the fields are as they were, except that there is now a gid field, the
mode field is 8 bits wider, and the size field is now 3 bytes (yikes), broken
into a "high byte" and a "low word". Also, block zero is explicitly specified as
containing a "stand-alone bootstrap program" (!).

By this point, both tap(1) and mt(1) had been unified into tp(1), which knew how
to operate on either kind of tape and adjusted its block size internally as
needed.

There are three practical problems that both the old tap(5) format and the newer
tp(5) format share:

1. There is no way to tell if a given tape is in tap(5) format or tp(5) format,
   and the two formats are not at all interchangeable - trying to read a tap(5)
   tape with tp(5) will use the uid as part of the mode, the file size as the
   uid/gid, etc etc; in modern design terms, neither format has any kind of
   identifying magic number.

2. The fields that are more than one byte long have unspecified endianness, and
   in fact it was host-dependent, so tapes from little-endian hosts could not be
   read on big-endian hosts or vice versa.

3. A more severe problem: the fixed format of having 24 blocks for directory
   entries imposes an absolute limit of 192 separate files or directories on
   a tape, regardless of how big it is.

Solving these two problems required another new format, and another new tool to
go with it: tar.

# Exit tp, enter tar

Tar first showed up in Unix V7 (I believe) and it worked very much like the
earlier tools it was descended from; it took a single-character command and some
modifiers, and even preserved the 0 .. 7 suffix to specify which tape drive to
use. It did promote 'c' to a command that implied 'r', such that:

  tar c

would behave as

  tp rc

had. It also had the first support for the 'f' option, which allowed specifying
a filename to use for the archive, or '-' for stdin/stdout.

Under the hood, though, things were very different. A tar file has no directory
at all - instead, each entry is written with a header for that entry followed by
the data for that entry. In V7 Unix, the header looked like this:

  struct tar_entry {
    char name[100];
    char mode[8];
    char uid[8];
    char gid[8];
    char size[12];
    char mtime[12];
    char chksum[8];
    char linkflag;
    char linkname[100];
  };

which you'll notice is an extremely hefty 257 bytes, compared to the earlier
format's 64 bytes per file of overhead. After that were 255 bytes of padding to
fill a 512-byte block, then the requisite number of 512-byte data blocks to
contain the file's data.

There are a couple of very odd things about this format: namely, why the heck is
everything suddenly a char array, and what is linkflag/linkname?

The reason for various fields becoming char arrays is deeply unfortunate: to
deal with the endianness problem mentioned above, and also the prospect of
machines with different word sizes in general, the tar authors decided to store
the various metadata values *as strings*, so those fields are all... octal
strings, representing much the same thing as they did in tp(1), but now
requiring string encoding and decoding at either end. Compared to the modern
approach of declaring a fixed byte order, and converting to/from host as needed,
this is markedly inferior. To make matters worse, the 100-byte buffers for name
and linkname are simultaneously too long a lot of the time, and too short some
of the time, but they are at least not as egregiously short as the older 32-byte
name buffers.

As for linkname and linkflag, they were used for storing symbolic links;
linkflag just served to indicate whether linkname was valid.

This format, while there are many retrospectively-silly things about it,
resolves two of the three problems we mentioned above. The various numerical
metadata is now stored as octal strings so there are no portability problems per
se, and the absence of a fixed-size directory means there's no limitation on how
many entries the archive can store... but there's that nagging 100-character
length limit still there, and the archives still aren't particularly
identifiable.

# ustar

The answer was a POSIX creation, dating to I think some time in the late 80s,
called "ustar" (short for "unix standard tar"). The ustar file header now looked
like this:

  struct ustar_entry {
    char name[100];
    char mode[8];
    char uid[8];
    char gid[8];
    char size[12];
    char mtime[12];
    char checksum[8];
    char typeflag[1];
    char linkname[100];

    char magic[6];
    char version[2];
    char uname[32];
    char gname[32];
    char devmajor[8];
    char devminor[8];
    char prefix[155];
    char pad[12];
  };

This structure is, in fact, exactly 512 bytes. This neatly fit into the existing
512-byte tar entry format by making use of the formerly-unused padding section.
None of the meanings of the old fields were changed except that 'linkflag'
became 'typeflag' and gained more valid values, but in particular, every valid
old tar archive was a valid ustar archive as well.

The new fields are pretty interesting:

* 'magic' is the 6-byte string "ustar", with a null byte after
* 'version' is two ASCII digits for what version of the format this is
* 'uname' & 'gname' are the *names* of the user/group for this entry, which
  supersede the uid/gid fields if present - this was to allow for, eg, systems
  which had 'wheel' map to a different uid or similar to have compatible
  archives
* 'devmajor' & 'devminor' allow for encoding block devices in tar archives
* 'prefix' was prepended to the pathname if present, allowing for full 255-char
  names

Note that the prefix hack is pretty clever: old tar implementations could read
ustar archives and get the end (the most unique part, generally) of too-long
filenames, and because of how old tar was written, the new types that ustar
added would turn into plain files.

# gnu tar

Regrettably, gnu tar extended the tar format before ustar was standardized[4],
so it has an incompatible entry format with different features. I won't bother
relating it here, but it shares *some* of the same extensions (like
uname/gname/devmajor/devminor) and had other incompatible ones. Notably, it
allowed headers to span multiple blocks, opening up oodles of complex encoding
possibilities, and had support for "manifests" (listings of which parts of a
directory were included), sparse files, volume names for multipart archives, and
so on and so on.

The entire thing is somewhat fatiguing to think about, but it does bear
mentioning that the magic number they chose for this format is:

  "ustar "

i.e., "ustar", but with a trailing *space* rather than a null. It's pretty clear
that they were working off some pre-standardization version of the ustar format,
and now just have an incompatible thing.

# pax

A "pax archive" is a valid ustar archive, except that two new types of entries
are allowed: 'x' and 'g'. These are used to store metadata, and in particular to
store such niceties as UTF-8 group/user names, atime/ctime/mtime with fractional
seconds (for some reason), UTF-8 path and link names, much larger sizes,
extended attributes, and so on. These are just normal "entries" in the tar file,
basically plain files with a different type flag, and in fact even the oldest
tar file can successfully extract them - it will treat the metadata as an
ordinary file, extract it, and the user can inspect it as they desire.

The pax format is basically the current standard for tar, and POSIX no longer
specifies the older tar format - pax is now the thing that tar(1) utilities
generate and read, although it is actually backwards-compatible all the way to
Unix V7.

Phew! That was a long tour through history. Thanks for reading!

Sources:
DECtape: the VT103 user guide, PDF with sha256sum
         a26ee5530c42c7f44e6efb9cd926d677f61a57ef6169835b6b6f85b7d0f843cc
tap(5): the Unix V2 manual, PDF with sha256sum
        d20042daf44fb9420d65ec89742eaf9d18a3aee7e4aafc0b632e138f5eb9f1f7
tp(1) & tp(5): the Unix V5 manual, PDF with sha256sum
               dd792cba23dc0229b417ef3148eb8db91516d7528ad12c36f29bb93435a2c6a1
tar(5): the FreeBSD tar(5) man page

[1]: Some tar implementations, notably gnu tar, also accept a more usual dashed
     format, like 'tar -cvf ...'.

[2]: Even as of V1, the rfo(4) man page contains an apology for this behavior,
     and it would disappear in later Unixes.

[3]: V5 Unix also had commands "dump" and "restor", which were specifically
     designed for incremental backups.

[4]: This has what the teens are calling "big GNU energy".

$#t Tarballs
$#s History and design of the tar(1) file format
$#o history, unix
$#u 73bee03c-6d42-4ef7-802f-ea81c10ff377