Whence Autotools? If you've ever compiled a program from source (and I hope you have), you've probably encountered the following build instructions: ./configure make make install and if you were spectacularly unlucky, the program you were building might not have had their configure script in their source tree, in which case you would've instead needed to run something like[1]: ./autogen.sh ./configure make make install Where the autogen script generates the configure script, the configure script generates the makefile, the makefile actually generates the program binaries, and the 'make install' step moves the built binaries, man pages, etc to where they're supposed to go. Why does this process have so many steps - and why are there scripts generating other scripts? What's "autogen" autogenning, exactly? What are configure.ac and Makefile.in? To answer those questions we need to travel back through time... to the beginning. # Machine Code & Assembly In their original form, programs were simply sequences of bits that were laboriously fed into the machine by hand, either by setting switches, using toggles, or (if you were very lucky) reading them off a deck of punch cards. As persistent storage was developed, programs could be kept for later use - still in the form of machine code. Of course, machine code was not particularly friendly for human readers or writers, so assembly languages were developed: still specific to a given system, but at least somewhat legible by a human. If you had a program, you could then assemble it like so: as program.s which would produce an "object file". Depending on the system, that object file might be directly runnable, or it might need a further step called "linking" (or sometimes, regrettably, "loading"[2]) which converted the object into an executable file: ld program.o If one then wanted to, one could install it somewhere useful using mv: mv program /bin/program At this time, there were many, many different CPU architectures, since CPUs were simple, slow, and it was relatively easy to build a new one that was just as useable as the current best generation. There was also very little software commonality - extremely often, one would produce a new hardware system, then write a new bespoke OS for it, with new bespoke software, and the users of that system would simply get used to whatever that software did. Over time, these architectures evolved, as did the software, and a more important shift also happened: people started using computers to do actual work. Admittedly, this work was still very closely related to the computers themselves, and still being done by programmers, but people began to use computers for things other than building operating systems. These people, the first users (who were, again, themselves programmers, but programmers whose goal was not operating system development) were reasonably averse to constantly rewriting their programs every time the CS department invented a cool new architecture. To make matters worse, they wanted to share their programs among themselves... but there were dozens of different system types in use. What to do? # C There's one obvious resolution: the high-level language. Instead of writing programs in assembly language, or machine code, one would write them in a language that could be translated *into* the appropriate assembly for the system the program was being built on. There were many, many, many such "high-level languages", almost all of which are now lost to history. Some (especially FORTRAN and COBOL) have hung on in their own niches, but there was one language that emerged absolutely dominant: C[3]. C, a descendant from B (itself a descendant from BCPL) was a small, relatively weak language, lacking many features that even languages of the time often had. However, it had a single cardinal virtue which outweighed everything else: because it was small, a working C compiler (or "translator" as they were sometimes known) for a new system could be written by one person in a matter of days - and doing so allowed most C programs to run on the new system with relatively little effort. Also, because the language was small, a C compiler could be made to run on even systems with few resources, allowing C to target virtually everything. It soon became the case that, to build a program and run it, one did this[4]: cc program.c # this would generate program.s, an assembly translation # of program.c to the local system's assembly language as program.s # this would assemble that into an object file ld program.o # this would link program.o into an executable mv program /bin/program Since C did have a small standard library of functions, it was usually necessary to have implementations of those in the resulting program - and thus emerged 'libc', the C standard library. Over time, people wanted to write larger programs that did more things, and it rapidly became untenable to have the entire program as a single C source file; C compilers of the day would usually need to read the entire source file at once and then translate it in memory, which could use a lot of memory, and also single large files were increasingly difficult for humans to navigate. Building programs began to look like: cc prog0.c cc prog1.c cc prog2.c as prog0.s as prog1.s as prog2.s ld prog0.o prog1.o prog2.o # producing the final executable mv program /bin/program This rapidly got tiresome and people began to write scripts encapsulating this logic. However, those scripts often wasted a large amount of computer time: if you were working on a single file of a large program, your build script would still re-translate and re-assemble every source file every time. At the time this was a major waste of resources, and so people began modifying their scripts to check the last-modified-time of each source file. If the source file was older than the object file, the reasoning went, then the source file didn't need recompilation. This kind of works, but there is one major problem: header files. Header files are units of code that are "included" textually in other source files, which means that actually, we need to re-translate and re-assemble a source file if either the source file has changed *or* any of the headers it includes has changed... or if any of the headers *those* headers include have changed, etc etc[5]. This sucks a lot, but header files changed relatively infrequently, so people were mostly kinda okay with doing a full rebuild if they edited a header. # Make However, there is a better way, and here emerges the Makefile. A Makefile specifies a dependency graph, like this: prog: foo.o bar.o foo.o: foo.c header.h bar.o: bar.c header.h This file says: if foo.o is older than either foo.c or header.h, rebuild it; if bar.o is older than either bar.c or header.h, rebuild it. Same for prog: if it's older than either foo.o or bar.o, relink it. Coupled with a bit of intelligence in the make program itself, which knows how to turn a .c file into a .o file and how to link .o files into an executable, we now end up with a very simple build process: make Perfect! Let's also put the install logic into the makefile, so that we can automatically rebuild before installing if we've made local changes[6]: install: prog mv prog /bin/prog and now we can simply run: make install To build all the changed source files, relink the resulting program, and then put it into place. # Editing the Makefile Over time, C came to dominate the programming landscape for "general-purpose" use, but there was still very significant variation in operating systems, ranging from "the binary for install(1) is in /sbin instead of /bin" right through to "we speak EBCDIC here". Also, users often wanted to tweak programs, and especially their install process - e.g., a system that was particularly tight on disk space might not wish to install man pages, or an administrator might wish to put a program in /usr/local instead of /usr. A convention emerged of having a stanza of variables at the top of the Makefile, like so: PREFIX=/usr # where to install DOCS=yes # change to no to skip man pages NETWORKING=no # change to yes if you have a working network stack CC=/bin/cc # set to whatever K&R C compiler you like best and so on. This was considered mostly okay at the time. # configure However... while some of these were true configuration options that were essentially arbitrary from the program's point of view (like PREFIX), a lot of them were essentially the Makefile needing to be told about the surrounding environment. Also, as programs grew larger and systems grew more capable, this set of options and environment settings got larger and more difficult to accurately configure. For example, as a system administrator trying to install a program, you might be well aware that you had a SysV IPC implementation... but do you know if it has that one bug where shmat(2) can segfault the user process if passed a bad argument, and which this program has a workaround for? The top stanzas in makefiles quickly split into two separate parts: # Config options PREFIX=/usr DOCS=yes NETWORKING=no # Your system HAS_FAST_MALLOC=no HAS_WORKING_SOCKETS=yes NEEDS_SHMAT_WORKAROUND=no The second section here is pretty thorny for you as the user of a program: often, these config options are pretty obscure and setting them incorrectly can have non-obvious bad results. Many of these, too, can be deduced automatically, or by building a small test program. Here the configure script is born! What configure does is, deduce what things it can about the surrounding environment and generate a Makefile with the results of those deductions already included. Unlike the Makefile, the configure script only needs to be executed once (ish) in the source tree for a given program, so it can do more expensive checking, and it also has access to the awe-inspiring power of the shell. An example configure script of the era might have included things like: for root in /usr/include /usr/local/include /include; do if [ -f "$root/stdio.h" ]; then echo "STDIO_PATH=$root/stdio.h" fi done or even: cc -o has-socket-hang has-socket-hang.c if ./has-socket-hang ; then echo "USE_SOCKET_WORKAROUND=yes" fi and so on. As you can no doubt already tell, this gets repetitive fast, so a library of helper functions emerged, and configure scripts began to look like: emit_header_path STDIO_PATH stdio.h emit_needs_workaround USE_SOCKET_WORKAROUND has-socket-hang.c with a large stanza of helper functions at the top. Sometimes, these would emit a separate makefile called 'config.mk', which would then be included into the main Makefile, and the build steps were therefore: ./configure edit Makefile # (to set more config opts) make make install However, sometimes, the legal values for the Makefile config opts actually depend on the state of the environment - for example, if the system has no X11 install, then setting FRONTEND=X11 simply won't work. It became more common for the configure script to include these choices and enforce that they were actually possible, which would look like: ./configure --prefix=/usr/local --with-x11 --other-stuff # don't need to edit the Makefile any more make make install This was, in fact, also more or less alright; the configure script would take a (hand-written) template file called "Makefile.in", emit a working Makefile with the correct values substituted in for variables, and then the Makefile would do its usual thing. It is a sad fact, though, that most programmers would rather chew their own arm off than have to think about system-specific minutae like this[7]: echo "You don't have Berkeley networking in libc$lib_ext..." >&4 if test -f /usr/lib/libnet$lib_ext; then ( (nm $nm_opt /usr/lib/libnet$lib_ext | eval $nm_extract) || \ ar t /usr/lib/libnet$lib_ext) 2>/dev/null >> libc.list if $contains socket libc.list >/dev/null 2>&1; then echo "...but the Wollongong group seems to have hacked it in." >&4 and so they began to simply copy configure scripts around from other programs, adding their own tests for whatever they cared about. This rapidly led to an explosion of configure script versions of increasing size; for a particular example, the lines above are from the Configure script that shipped with perl 5.004 in 1999, which is almost 11,000 lines of shell. Such a thing should inspire fear in virtually anyone, and to make matters worse, shell is a rather feeble language for actually architecting a significant program. These scripts also got incredibly slow over time, since they were testing for everything anyone had ever cared to detect - even things that were long since obsolete or irrelevant to the program under test. # Enter Autoconf A neat solution to this problem is to have a library of reusable "test this system for this behavior" things, and then have the specific program declare which of those things it cares about and test only for those. One could write this as a sufficiently intelligent configure script, but there are two problems with that: 1. It would still be absolutely honking massive, and 2. Making a configure script run under all the various shells that were around was within the realm of only the most expert hackers A neater approach would be to actually generate the configure script as appropriate for the program, so that it only tests for what the program cares about in its runtime environment. Ideally, the configure script would be generated by the original program author and included with the program, so end users get a (relatively) simple script and do the steps as above. There are a lot of different ways to do this generation, some of which are still around (e.g. metaconfig(1)), but one of them is autoconf. Autoconf is a program written in m4(1), which is a topic for a different time[8]. It takes as input a file called 'configure.ac', which specifies what needs to go in the configure script. This ends up with stanzas that look kind of like this: AC_PROG_CC_C99 if test x"$ac_cv_prog_cc_c99" = "xno"; then AC_ERROR([charybdis requires a C99 capable compiler]) fi (taken from the source of the charybdis ircd). The autoconf program turns these into a configure script, which then generates a Makefile from Makefile.in. In fact, since m4 is quite a powerful macro language, it is/was sometimes desirable to write the original source Makefile in a way that was preprocessed by m4 as well, leading to: Makefile.am -- m4 --> Makefile.in -- configure --> Makefile configure.ac -- m4 --> configure where the m4 steps all run on the original author's machine, and their outputs (the Makefile.in and configure files) are distributed with the program. As a result, autoconf provides the "author-time configuration" (declaration of dependencies, what config options are available, etc) and configure provides "install-time configuration" (testing for those dependencies, setting those config options, etc). As a result, the entire build process looks like this: # On the author's machine: autoconf # generate configure from configure.ac automake # generate Makefile.in from Makefile.am tar -cvzf prog.tar.gz prog/ # produce a source tarball # On the user's machine: tar -xvzf prog.tar.gz # unpack the source tarball into prog/ cd prog ./configure --prefix=/some/path --with-x11 --whatever-else make make install and this is still, in fact, mostly okay - despite having several steps, each step more or less has a purpose. # What We Do Now As a general thing, this way of configuring and building programs is still considered basically obsolete, even for plain C or C++ programs which might actually still use Make. Why? Three big things have shifted: 1. Virtually all Unix-like systems have converged on being either BSD or Linux, and both systems, compared to old OSes, are quite similar to each other and very high quality[9], which hugely reduces the amount of cases that a configure script might need to deal with. For example, nobody needs to spend machine cycles checking for BSD sockets - they are definitely there and definitely work - just like nobody needs to waste any time figuring out exactly how broken your STREAMS implementation is since you don't have one. Your shell is virtually guaranteed to be roughly a POSIX sh(1) and almost bug-free, and your C compiler is not going to need arcane workarounds applied to the source files to avoid upsetting it. Your OS vendor certainly has not done things like "hack in an entirely fake and non-functional version of signal(2) so they can claim POSIX compliance". Lucky you! 2. Since systems are a lot more homogenous, the burden of adapting a program has shifted a lot from author to user. For example, if I were to go to you and say "your program doesn't work for me because shmat(2) on my system segfaults if you do this - can you add a workaround?", you would tell me to pound sand and feel entirely justified in doing so, because it would be my fault for using a broken system when high-quality ones are freely available. If I want to hack that workaround in locally, of course I can, but you would most likely reject such a patch out of hand if you received it.[10] 3. Nobody knows or cares about m4(1), and it is considered basically a dead language except as used in autotools, and enormously inferior to, e.g., Python. As a result, sprawling configure scripts are more or less obsolescent, and with them autotools - these days they are generally the sign of an old program, or at least a program with old ancestors. Modern C/C++ programs generally go one of these three routes[11]: 1. They are a single source file that you just compile into a binary; 2. They have a Makefile with a stanza at the top you edit to set PREFIX and such, or use with `make install PREFIX=...` 3. They have a very small, handwritten configure script that checks for things that really do still vary system-by-system, like "do you have openssl?", usually using pkg-config instead of guessing paths by hand. pkg-config deserves a bit of a mention: it is a program that tells you how to include headers for, or link against, libraries referenced by well-known names. It is present on ~every modern Unix system and removes a significant part of why configure scripts existed in the first place, since now I can simply do: ld -o thing thing.o $(pkg-config --libs library) and pkg-config will tell me what file(s) I need to link against to get that library. Phew! This is a long post already and I still didn't get to talk about what libtool does or the various "smarter" build systems like scons. That will have to be for a different post :) Thanks for reading! [1]: What would actually happen then is that you'd find out that you had the wrong *version* of autotools, so the m4 programs wouldn't work properly, and you would then rm -rf the source directory. [2]: I don't know entirely why but I *think* it is because this predated virtual memory, and as a result one of the steps of producing an executable was literally deciding where in memory it would go at runtime - effectively "loading" the object into memory at that location, even though it wasn't there until it was run. [3]: The description here of C as "high-level" might strike you as funny, but compared to PDP-1 assembly language, it definitely is. [4]: These days, the 'cc' program actually does the compile-assemble-link steps for you, so simply running 'cc -o prog prog.c' would suffice. [5]: Helping Makefiles figure this out is why C compilers have puzzling options starting with -M, which don't compile anything but instead emit the dependency graph in Makefile format (!) - this is much better than having the programmer try to guess what the transitive dependency graph of headers looks like. [6]: A real Makefile would have a hard tab on the second line obviously, but it messes up the visual flow so I've elided it. [7]: And those that *would* rather think about whether your system has the 4.2BSD socket API or the older, crappier 4.1BSD one are mostly occupied writing configure scripts and/or system software in the first place. [8]: m4 is a fascinating language that I don't yet know, but I would like to learn it. [9]: This is kind of an interesting philosophical point and arguably an indicator of the fundamental success of the free software movement: high-quality, user-modifiable, free-of-charge implementations are available for practically everything you can think of, to the extent that people generally do not have to put up with Bad Software any more. [10]: Or, indeed, if I said "your configure script does not work in csh". [11]: The fourth route involves things like cmake, imake, scons, gn, etc etc, which replace both configure and make. I don't care enough about these to mention them in line.