The m4(1) macro processor

This blog post is a valid m4 program. You can run it with:

  m4 post.txt

or, if you have gnu m4, with:

  m4 -G post.txt

Also, a headnote: I really dislike m4. I tried to give it a fair shake in
writing this post, but I'm sure this is a pretty negative take on the subject.
Sorry, m4 fans, if you exist.

# Macro Languages & M4

Macro processing languages are mostly historical[1], at this point, subsumed by
either better programming languages or better configuration languages. The
original idea was that one would have a general-purpose "macro processor", which
would be fed a set of macros and a body of text, and would expand those macros
within that text. A distinguishing feature of this approach rather than the more
traditional approach of writing a perl/sed/awk/... script to do the desired
transform on a given piece of text is that macro processors generally take both
the macro definitions and their invocations *from the same place*.

For example, this file is a valid m4 program. If I write:

  define(X, 1)

and then say X, and you ran this file through m4, you'd see a 1 there. Of
course, I could quote the `X', and then m4 wouldn't expand it for you. If I
wanted to avoid that definition applying in the first place, I'd have to quote
the whole thing:

  `define(Y, 1)'

You can probably immediately see something unpleasant coming up: since the macro
invocations don't have any special form to them, it's really easy to
accidentally invoke a macro in line. For example, if I wrote:

  define(the, zhe)

my following text could get really confusing, with the macro matching random
words in my prose and replacing them with the defined right hand side.
Fortunately, nobody would ever do this, but if they did, they'd need to:

  undefine(`the')

to get it to stop. Note that I had to quote the `the' there - macro parameters
are actually themselves expanded, so you can't even reference a macro by name
without invoking it unless you quote it.

# Redefining Macros

Here's a maybe more realistic example:

  define(`pagenumz', 0)
  define(`pagenum', `define(`pagenumz',incr(pagenumz))Page pagenumz')

then I could write:

  pagenum
  pagenum
  pagenum

and you'd see:

  Page 1
  Page 2
  Page 3

How the heck does this work? Well, like this: every time you say

  pagenum

m4 replaces that with:

  define(`pagenumz',incr(pagenumz))Page pagenumz

and since macros are immediately evaluated when they're seen unquoted, *even in
other macro definitions*, if `pagenumz' is 0, that would expand to:

  define(`pagenumz',incr(0))Page 0

# Real Programming

Okay, neat party trick, but numbering pages isn't exactly heavy wizardry. How
about something a bit cooler? Imagine me rolling up my sleeves:

  define(cons,`[$1;$2]')
  define(car,`substr($1,1,decr(index($1,`;')))')
  define(cdr,`substr($1,incr(index($1,`;')),eval(len($1)-index($1,`;')-2))')
  define(map,`ifelse($2,`nil',`nil',`cons($1(car($2)),map($1,cdr($2)))')')

and now:

  define(example, `cons(1,cons(2,cons(3,nil)))')
  map(incr, example)

if you're running this file with m4, you should have seen:

  [2;[3;[4;nil]]]

because the `map' macro, when invoked, expands to:

  `ifelse($2,`nil',`nil',
     `cons($1(car($2)),
           map($1,cdr($2)))')'

i.e., the usual recursive formulation of `map'. Of course, the list encoding
we're using is bizarre, but that's a macro language for you.

# Includes & Diversions

m4, like many macro languages, has a facility for including other files, which
is usually used to have reusable libraries of macro definitions. These included
files also generate output by default, but they can avoid doing so by means of a
"diversion", which causes m4 to stash output in a temporary buffer. For example,
if I do this:

  divert(1)
  Thanks for reading!
  divert(2)
  Goodbye, World!
  divert(0)

you won't see anything, but then I can do:

  undivert(2)

to get one of those buffers back. Anything I haven't undiverted by the end of
this post will get printed at the end anyway, so you can't throw output away
this way - if you want to do that, you need to use `divert(-1)', which is an odd
extension that kinda disables output.

# What Sucks About It?

I mentioned in the headnote that I really dislike m4, despite trying to like it.
On the surface it has exactly the attributes of something I'd like: weird,
crufty old Unix tool with surprising power. However, it has a couple of really
debilitating problems:

1. It's virtually impossible to get predictable whitespace behavior. You might
   have noticed that in some of the macro definitions above I wrote a lot of
   smashed-together code, which you probably know isn't my usual style. I was
   basically forced to do this because m4 expands *whitespace* into the output
   as well, including in places you might not expect. For example, if I write:

     define(Z,100)

   you might expect that to output nothing, but m4 actually interprets it as the
   macro definition `define(Z,100)' and then an *unrelated newline*, which it
   obediently outputs, so the act of defining a function adds whitespace to the
   output. Dealing with this in general is basically impossible; most m4s have a
   nasty hack called `dnl' (which discards output up to and including the next
   newline) that you can use like this:

      define(Z,200)dnl

   but that's pretty darn ugly, even by 1970s Unix tool standards. Also, as soon
   as you start including other files, you are essentially guaranteed to end up
   with a bunch of stray spaces or newlines everywhere.

2. Double-expansion, while very powerful, is very hard to reason about. m4
   expands while reading macro definitions, then again while executing them, so
   in fact the innocuous `define(Z,200)dnl' I wrote above had the earlier
   definition of `Z' expanded on the *left* and was really read as:

      define(100,200)dnl

   which silently does nothing. While you *can* avoid making this mistake by
   diligently quoting the left hand side, it's... ugh.

3. Not separating the macro language from the text being processed makes it
   really, really, really easy to make mistakes which there is, again, no
   warning of. Remember above when I defined `the' and had a bunch of words in
   my text replaced? yeah. Of course, if you hold yourself to only defining
   macros with distinctive names it helps, but the builtins are still lurking
   there, waiting to be invoked with no arguments at any point. Most modern
   approaches to this kind of thing require a distinctive invocation of the
   macro processor, like Scribble's @foo{} syntax.

# What Doesn't Suck About It?

Well, it's surprisingly powerful, for one thing. It has about a dozen primitives
and is based *purely* on text rewriting and you can still throw down and write
some stuff in it. For some reason autoconf is written in it, as well as the
sendmail config system. I'm not sure if that's an endorsement or not, but you
definitely can use it to do things.

Also, the way quotes work is actually pretty nice. Since open and close quotes
have separate characters rather than both being a double-quote, nesting quotes
is really easy and doesn't require any escaping - it's more like balancing
parens in Lisp. I guess that's nice?

I can't seriously recommend you use it these days, though. There are better
templating systems, better text-processing systems, better config systems, and
better programming languages. At this point m4 would be totally lost to history
if not for autoconf and sendmail, and perhaps it should be.

[1]: And honestly not all that historical, to be honest; the history of m4 is
     little documented compared to other early languages and it never had
     widespread adoption. It descended from the sparsely-documented m6 macro
     processor, which was present at least as of V6 Unix, but the man page for
     it there references the "M6 Reference Manual". I have a copy of Bell Labs
     CSTR #2 which I believe is that document, but haven't yet read (or written
     about) it.

$#t m4
$#s A historical overview of the m4(1) macro language
$#o history, unix
$#u b6c57535-a880-4faa-be8f-5236d8e611b3