Whence Autotools?

If you've ever compiled a program from source (and I hope you have), you've
probably encountered the following build instructions:

  ./configure
  make
  make install

and if you were spectacularly unlucky, the program you were building might not
have had their configure script in their source tree, in which case you would've
instead needed to run something like[1]:

  ./autogen.sh
  ./configure
  make
  make install

Where the autogen script generates the configure script, the configure script
generates the makefile, the makefile actually generates the program binaries,
and the 'make install' step moves the built binaries, man pages, etc to where
they're supposed to go.

Why does this process have so many steps - and why are there scripts generating
other scripts? What's "autogen" autogenning, exactly? What are configure.ac and
Makefile.in?

To answer those questions we need to travel back through time... to the
beginning.

# Machine Code & Assembly

In their original form, programs were simply sequences of bits that were
laboriously fed into the machine by hand, either by setting switches, using
toggles, or (if you were very lucky) reading them off a deck of punch cards. As
persistent storage was developed, programs could be kept for later use - still
in the form of machine code. Of course, machine code was not particularly
friendly for human readers or writers, so assembly languages were developed:
still specific to a given system, but at least somewhat legible by a human. If
you had a program, you could then assemble it like so:

  as program.s

which would produce an "object file". Depending on the system, that object file
might be directly runnable, or it might need a further step called "linking" (or
sometimes, regrettably, "loading"[2]) which converted the object into an
executable file:

  ld program.o

If one then wanted to, one could install it somewhere useful using mv:

  mv program /bin/program

At this time, there were many, many different CPU architectures, since CPUs were
simple, slow, and it was relatively easy to build a new one that was just as
useable as the current best generation. There was also very little software
commonality - extremely often, one would produce a new hardware system, then
write a new bespoke OS for it, with new bespoke software, and the users of that
system would simply get used to whatever that software did.

Over time, these architectures evolved, as did the software, and a more
important shift also happened: people started using computers to do actual work.
Admittedly, this work was still very closely related to the computers
themselves, and still being done by programmers, but people began to use
computers for things other than building operating systems. These people, the
first users (who were, again, themselves programmers, but programmers whose goal
was not operating system development) were reasonably averse to constantly
rewriting their programs every time the CS department invented a cool new
architecture. To make matters worse, they wanted to share their programs among
themselves... but there were dozens of different system types in use. What to
do?

# C

There's one obvious resolution: the high-level language. Instead of writing
programs in assembly language, or machine code, one would write them in a
language that could be translated *into* the appropriate assembly for the system
the program was being built on. There were many, many, many such "high-level
languages", almost all of which are now lost to history. Some (especially
FORTRAN and COBOL) have hung on in their own niches, but there was one language
that emerged absolutely dominant: C[3].

C, a descendant from B (itself a descendant from BCPL) was a small, relatively
weak language, lacking many features that even languages of the time often had.
However, it had a single cardinal virtue which outweighed everything else:
because it was small, a working C compiler (or "translator" as they were
sometimes known) for a new system could be written by one person in a matter of
days - and doing so allowed most C programs to run on the new system with
relatively little effort. Also, because the language was small, a C compiler
could be made to run on even systems with few resources, allowing C to target
virtually everything.

It soon became the case that, to build a program and run it, one did this[4]:

  cc program.c        # this would generate program.s, an assembly translation
                      # of program.c to the local system's assembly language
  as program.s        # this would assemble that into an object file
  ld program.o        # this would link program.o into an executable
  mv program /bin/program

Since C did have a small standard library of functions, it was usually necessary
to have implementations of those in the resulting program - and thus emerged
'libc', the C standard library.

Over time, people wanted to write larger programs that did more things, and it
rapidly became untenable to have the entire program as a single C source file; C
compilers of the day would usually need to read the entire source file at once
and then translate it in memory, which could use a lot of memory, and also
single large files were increasingly difficult for humans to navigate. Building
programs began to look like:

  cc prog0.c
  cc prog1.c
  cc prog2.c
  as prog0.s
  as prog1.s
  as prog2.s
  ld prog0.o prog1.o prog2.o   # producing the final executable
  mv program /bin/program

This rapidly got tiresome and people began to write scripts encapsulating this
logic. However, those scripts often wasted a large amount of computer time: if
you were working on a single file of a large program, your build script would
still re-translate and re-assemble every source file every time. At the time
this was a major waste of resources, and so people began modifying their scripts
to check the last-modified-time of each source file. If the source file was
older than the object file, the reasoning went, then the source file didn't need
recompilation.

This kind of works, but there is one major problem: header files. Header files
are units of code that are "included" textually in other source files, which
means that actually, we need to re-translate and re-assemble a source file if
either the source file has changed *or* any of the headers it includes has
changed... or if any of the headers *those* headers include have changed, etc
etc[5]. This sucks a lot, but header files changed relatively infrequently, so
people were mostly kinda okay with doing a full rebuild if they edited a header.

# Make

However, there is a better way, and here emerges the Makefile. A Makefile
specifies a dependency graph, like this:

  prog: foo.o bar.o
  foo.o: foo.c header.h
  bar.o: bar.c header.h

This file says: if foo.o is older than either foo.c or header.h, rebuild it; if
bar.o is older than either bar.c or header.h, rebuild it. Same for prog: if it's
older than either foo.o or bar.o, relink it. Coupled with a bit of intelligence
in the make program itself, which knows how to turn a .c file into a .o file and
how to link .o files into an executable, we now end up with a very simple build
process:

  make

Perfect! Let's also put the install logic into the makefile, so that we can
automatically rebuild before installing if we've made local changes[6]:

  install: prog
    mv prog /bin/prog

and now we can simply run:

  make install

To build all the changed source files, relink the resulting program, and then
put it into place.

# Editing the Makefile

Over time, C came to dominate the programming landscape for "general-purpose"
use, but there was still very significant variation in operating systems,
ranging from "the binary for install(1) is in /sbin instead of /bin" right
through to "we speak EBCDIC here". Also, users often wanted to tweak programs,
and especially their install process - e.g., a system that was particularly
tight on disk space might not wish to install man pages, or an administrator
might wish to put a program in /usr/local instead of /usr.

A convention emerged of having a stanza of variables at the top of the Makefile,
like so:

  PREFIX=/usr       # where to install
  DOCS=yes          # change to no to skip man pages
  NETWORKING=no     # change to yes if you have a working network stack
  CC=/bin/cc        # set to whatever K&R C compiler you like best

and so on. This was considered mostly okay at the time.

# configure

However... while some of these were true configuration options that were
essentially arbitrary from the program's point of view (like PREFIX), a lot of
them were essentially the Makefile needing to be told about the surrounding
environment. Also, as programs grew larger and systems grew more capable, this
set of options and environment settings got larger and more difficult to
accurately configure. For example, as a system administrator trying to install a
program, you might be well aware that you had a SysV IPC implementation... but
do you know if it has that one bug where shmat(2) can segfault the user process
if passed a bad argument, and which this program has a workaround for?

The top stanzas in makefiles quickly split into two separate parts:

  # Config options
  PREFIX=/usr
  DOCS=yes
  NETWORKING=no

  # Your system
  HAS_FAST_MALLOC=no
  HAS_WORKING_SOCKETS=yes
  NEEDS_SHMAT_WORKAROUND=no

The second section here is pretty thorny for you as the user of a program:
often, these config options are pretty obscure and setting them incorrectly can
have non-obvious bad results. Many of these, too, can be deduced automatically,
or by building a small test program.

Here the configure script is born! What configure does is, deduce what things it
can about the surrounding environment and generate a Makefile with the results
of those deductions already included. Unlike the Makefile, the configure script
only needs to be executed once (ish) in the source tree for a given program, so
it can do more expensive checking, and it also has access to the awe-inspiring
power of the shell.

An example configure script of the era might have included things like:

  for root in /usr/include /usr/local/include /include; do
    if [ -f "$root/stdio.h" ]; then
      echo "STDIO_PATH=$root/stdio.h"
    fi
  done

or even:

  cc -o has-socket-hang has-socket-hang.c
  if ./has-socket-hang ; then
    echo "USE_SOCKET_WORKAROUND=yes"
  fi

and so on. As you can no doubt already tell, this gets repetitive fast, so a
library of helper functions emerged, and configure scripts began to look like:

  emit_header_path STDIO_PATH stdio.h
  emit_needs_workaround USE_SOCKET_WORKAROUND has-socket-hang.c

with a large stanza of helper functions at the top.

Sometimes, these would emit a separate makefile called 'config.mk', which would
then be included into the main Makefile, and the build steps were therefore:

  ./configure
  edit Makefile # (to set more config opts)
  make
  make install

However, sometimes, the legal values for the Makefile config opts actually
depend on the state of the environment - for example, if the system has no X11
install, then setting FRONTEND=X11 simply won't work. It became more common for
the configure script to include these choices and enforce that they were
actually possible, which would look like:

  ./configure --prefix=/usr/local --with-x11 --other-stuff
  # don't need to edit the Makefile any more
  make
  make install

This was, in fact, also more or less alright; the configure script would take a
(hand-written) template file called "Makefile.in", emit a working Makefile with
the correct values substituted in for variables, and then the Makefile would do
its usual thing.

It is a sad fact, though, that most programmers would rather chew their own
arm off than have to think about system-specific minutae like this[7]:

  echo "You don't have Berkeley networking in libc$lib_ext..." >&4
  if test -f /usr/lib/libnet$lib_ext; then
          ( (nm $nm_opt /usr/lib/libnet$lib_ext | eval $nm_extract) ||  \
          ar t /usr/lib/libnet$lib_ext) 2>/dev/null >> libc.list
          if $contains socket libc.list >/dev/null 2>&1; then
          echo "...but the Wollongong group seems to have hacked it in." >&4

and so they began to simply copy configure scripts around from other programs,
adding their own tests for whatever they cared about. This rapidly led to an
explosion of configure script versions of increasing size; for a particular
example, the lines above are from the Configure script that shipped with perl
5.004 in 1999, which is almost 11,000 lines of shell. Such a thing should
inspire fear in virtually anyone, and to make matters worse, shell is a rather
feeble language for actually architecting a significant program. These scripts
also got incredibly slow over time, since they were testing for everything
anyone had ever cared to detect - even things that were long since obsolete or
irrelevant to the program under test.

# Enter Autoconf

A neat solution to this problem is to have a library of reusable "test this
system for this behavior" things, and then have the specific program declare
which of those things it cares about and test only for those. One could write
this as a sufficiently intelligent configure script, but there are two problems
with that:

  1. It would still be absolutely honking massive, and
  2. Making a configure script run under all the various shells that were around
     was within the realm of only the most expert hackers

A neater approach would be to actually generate the configure script as
appropriate for the program, so that it only tests for what the program cares
about in its runtime environment. Ideally, the configure script would be
generated by the original program author and included with the program, so end
users get a (relatively) simple script and do the steps as above.

There are a lot of different ways to do this generation, some of which are still
around (e.g. metaconfig(1)), but one of them is autoconf. Autoconf is a program
written in m4(1), which is a topic for a different time[8]. It takes as input a
file called 'configure.ac', which specifies what needs to go in the configure
script. This ends up with stanzas that look kind of like this:

  AC_PROG_CC_C99

  if test x"$ac_cv_prog_cc_c99" = "xno"; then
    AC_ERROR([charybdis requires a C99 capable compiler])
  fi

(taken from the source of the charybdis ircd). The autoconf program turns these
into a configure script, which then generates a Makefile from Makefile.in. In
fact, since m4 is quite a powerful macro language, it is/was sometimes desirable
to write the original source Makefile in a way that was preprocessed by m4 as
well, leading to:

  Makefile.am -- m4 --> Makefile.in -- configure --> Makefile
  configure.ac -- m4 --> configure

where the m4 steps all run on the original author's machine, and their outputs
(the Makefile.in and configure files) are distributed with the program. As a
result, autoconf provides the "author-time configuration" (declaration of
dependencies, what config options are available, etc) and configure provides
"install-time configuration" (testing for those dependencies, setting those
config options, etc).

As a result, the entire build process looks like this:

  # On the author's machine:
  autoconf                     # generate configure from configure.ac
  automake                     # generate Makefile.in from Makefile.am
  tar -cvzf prog.tar.gz prog/  # produce a source tarball

  # On the user's machine:
  tar -xvzf prog.tar.gz        # unpack the source tarball into prog/
  cd prog
  ./configure --prefix=/some/path --with-x11 --whatever-else
  make
  make install

and this is still, in fact, mostly okay - despite having several steps, each
step more or less has a purpose.

# What We Do Now

As a general thing, this way of configuring and building programs is still
considered basically obsolete, even for plain C or C++ programs which might
actually still use Make. Why?

Three big things have shifted:

1. Virtually all Unix-like systems have converged on being either BSD or Linux,
   and both systems, compared to old OSes, are quite similar to each other and
   very high quality[9], which hugely reduces the amount of cases that a
   configure script might need to deal with. For example, nobody needs to spend
   machine cycles checking for BSD sockets - they are definitely there and
   definitely work - just like nobody needs to waste any time figuring out
   exactly how broken your STREAMS implementation is since you don't have one.
   Your shell is virtually guaranteed to be roughly a POSIX sh(1) and almost
   bug-free, and your C compiler is not going to need arcane workarounds
   applied to the source files to avoid upsetting it. Your OS vendor certainly
   has not done things like "hack in an entirely fake and non-functional
   version of signal(2) so they can claim POSIX compliance". Lucky you!

2. Since systems are a lot more homogenous, the burden of adapting a program has
   shifted a lot from author to user. For example, if I were to go to you and
   say "your program doesn't work for me because shmat(2) on my system segfaults
   if you do this - can you add a workaround?", you would tell me to pound sand
   and feel entirely justified in doing so, because it would be my fault for
   using a broken system when high-quality ones are freely available. If I want
   to hack that workaround in locally, of course I can, but you would most
   likely reject such a patch out of hand if you received it.[10]

3. Nobody knows or cares about m4(1), and it is considered basically a dead
   language except as used in autotools, and enormously inferior to, e.g.,
   Python.

As a result, sprawling configure scripts are more or less obsolescent, and with
them autotools - these days they are generally the sign of an old program, or at
least a program with old ancestors. Modern C/C++ programs generally go one of
these three routes[11]:

1. They are a single source file that you just compile into a binary;
2. They have a Makefile with a stanza at the top you edit to set PREFIX and
   such, or use with `make install PREFIX=...`
3. They have a very small, handwritten configure script that checks for things
   that really do still vary system-by-system, like "do you have openssl?",
   usually using pkg-config instead of guessing paths by hand.

pkg-config deserves a bit of a mention: it is a program that tells you how to
include headers for, or link against, libraries referenced by well-known names.
It is present on ~every modern Unix system and removes a significant part of why
configure scripts existed in the first place, since now I can simply do:

  ld -o thing thing.o $(pkg-config --libs library)

and pkg-config will tell me what file(s) I need to link against to get that
library.

Phew! This is a long post already and I still didn't get to talk about what
libtool does or the various "smarter" build systems like scons. That will have
to be for a different post :) Thanks for reading!

[1]: What would actually happen then is that you'd find out that you had the
     wrong *version* of autotools, so the m4 programs wouldn't work properly,
     and you would then rm -rf the source directory.

[2]: I don't know entirely why but I *think* it is because this predated virtual
     memory, and as a result one of the steps of producing an executable was
     literally deciding where in memory it would go at runtime - effectively
     "loading" the object into memory at that location, even though it wasn't
     there until it was run.

[3]: The description here of C as "high-level" might strike you as funny, but
     compared to PDP-1 assembly language, it definitely is.

[4]: These days, the 'cc' program actually does the compile-assemble-link steps
     for you, so simply running 'cc -o prog prog.c' would suffice.

[5]: Helping Makefiles figure this out is why C compilers have puzzling options
     starting with -M, which don't compile anything but instead emit the
     dependency graph in Makefile format (!) - this is much better than having
     the programmer try to guess what the transitive dependency graph of headers
     looks like.

[6]: A real Makefile would have a hard tab on the second line obviously, but it
     messes up the visual flow so I've elided it.

[7]: And those that *would* rather think about whether your system has the
     4.2BSD socket API or the older, crappier 4.1BSD one are mostly occupied
     writing configure scripts and/or system software in the first place.

[8]: m4 is a fascinating language that I don't yet know, but I would like to
     learn it.

[9]: This is kind of an interesting philosophical point and arguably an
     indicator of the fundamental success of the free software movement:
     high-quality, user-modifiable, free-of-charge implementations are available
     for practically everything you can think of, to the extent that people
     generally do not have to put up with Bad Software any more.

[10]: Or, indeed, if I said "your configure script does not work in csh".

[11]: The fourth route involves things like cmake, imake, scons, gn, etc etc,
      which replace both configure and make. I don't care enough about these to
      mention them in line.