The m4(1) macro processor This blog post is a valid m4 program. You can run it with: m4 post.txt or, if you have gnu m4, with: m4 -G post.txt Also, a headnote: I really dislike m4. I tried to give it a fair shake in writing this post, but I'm sure this is a pretty negative take on the subject. Sorry, m4 fans, if you exist. # Macro Languages & M4 Macro processing languages are mostly historical[1], at this point, subsumed by either better programming languages or better configuration languages. The original idea was that one would have a general-purpose "macro processor", which would be fed a set of macros and a body of text, and would expand those macros within that text. A distinguishing feature of this approach rather than the more traditional approach of writing a perl/sed/awk/... script to do the desired transform on a given piece of text is that macro processors generally take both the macro definitions and their invocations *from the same place*. For example, this file is a valid m4 program. If I write: define(X, 1) and then say X, and you ran this file through m4, you'd see a 1 there. Of course, I could quote the `X', and then m4 wouldn't expand it for you. If I wanted to avoid that definition applying in the first place, I'd have to quote the whole thing: `define(Y, 1)' You can probably immediately see something unpleasant coming up: since the macro invocations don't have any special form to them, it's really easy to accidentally invoke a macro in line. For example, if I wrote: define(the, zhe) my following text could get really confusing, with the macro matching random words in my prose and replacing them with the defined right hand side. Fortunately, nobody would ever do this, but if they did, they'd need to: undefine(`the') to get it to stop. Note that I had to quote the `the' there - macro parameters are actually themselves expanded, so you can't even reference a macro by name without invoking it unless you quote it. # Redefining Macros Here's a maybe more realistic example: define(`pagenumz', 0) define(`pagenum', `define(`pagenumz',incr(pagenumz))Page pagenumz') then I could write: pagenum pagenum pagenum and you'd see: Page 1 Page 2 Page 3 How the heck does this work? Well, like this: every time you say pagenum m4 replaces that with: define(`pagenumz',incr(pagenumz))Page pagenumz and since macros are immediately evaluated when they're seen unquoted, *even in other macro definitions*, if `pagenumz' is 0, that would expand to: define(`pagenumz',incr(0))Page 0 # Real Programming Okay, neat party trick, but numbering pages isn't exactly heavy wizardry. How about something a bit cooler? Imagine me rolling up my sleeves: define(cons,`[$1;$2]') define(car,`substr($1,1,decr(index($1,`;')))') define(cdr,`substr($1,incr(index($1,`;')),eval(len($1)-index($1,`;')-2))') define(map,`ifelse($2,`nil',`nil',`cons($1(car($2)),map($1,cdr($2)))')') and now: define(example, `cons(1,cons(2,cons(3,nil)))') map(incr, example) if you're running this file with m4, you should have seen: [2;[3;[4;nil]]] because the `map' macro, when invoked, expands to: `ifelse($2,`nil',`nil', `cons($1(car($2)), map($1,cdr($2)))')' i.e., the usual recursive formulation of `map'. Of course, the list encoding we're using is bizarre, but that's a macro language for you. # Includes & Diversions m4, like many macro languages, has a facility for including other files, which is usually used to have reusable libraries of macro definitions. These included files also generate output by default, but they can avoid doing so by means of a "diversion", which causes m4 to stash output in a temporary buffer. For example, if I do this: divert(1) Thanks for reading! divert(2) Goodbye, World! divert(0) you won't see anything, but then I can do: undivert(2) to get one of those buffers back. Anything I haven't undiverted by the end of this post will get printed at the end anyway, so you can't throw output away this way - if you want to do that, you need to use `divert(-1)', which is an odd extension that kinda disables output. # What Sucks About It? I mentioned in the headnote that I really dislike m4, despite trying to like it. On the surface it has exactly the attributes of something I'd like: weird, crufty old Unix tool with surprising power. However, it has a couple of really debilitating problems: 1. It's virtually impossible to get predictable whitespace behavior. You might have noticed that in some of the macro definitions above I wrote a lot of smashed-together code, which you probably know isn't my usual style. I was basically forced to do this because m4 expands *whitespace* into the output as well, including in places you might not expect. For example, if I write: define(Z,100) you might expect that to output nothing, but m4 actually interprets it as the macro definition `define(Z,100)' and then an *unrelated newline*, which it obediently outputs, so the act of defining a function adds whitespace to the output. Dealing with this in general is basically impossible; most m4s have a nasty hack called `dnl' (which discards output up to and including the next newline) that you can use like this: define(Z,200)dnl but that's pretty darn ugly, even by 1970s Unix tool standards. Also, as soon as you start including other files, you are essentially guaranteed to end up with a bunch of stray spaces or newlines everywhere. 2. Double-expansion, while very powerful, is very hard to reason about. m4 expands while reading macro definitions, then again while executing them, so in fact the innocuous `define(Z,200)dnl' I wrote above had the earlier definition of `Z' expanded on the *left* and was really read as: define(100,200)dnl which silently does nothing. While you *can* avoid making this mistake by diligently quoting the left hand side, it's... ugh. 3. Not separating the macro language from the text being processed makes it really, really, really easy to make mistakes which there is, again, no warning of. Remember above when I defined `the' and had a bunch of words in my text replaced? yeah. Of course, if you hold yourself to only defining macros with distinctive names it helps, but the builtins are still lurking there, waiting to be invoked with no arguments at any point. Most modern approaches to this kind of thing require a distinctive invocation of the macro processor, like Scribble's @foo{} syntax. # What Doesn't Suck About It? Well, it's surprisingly powerful, for one thing. It has about a dozen primitives and is based *purely* on text rewriting and you can still throw down and write some stuff in it. For some reason autoconf is written in it, as well as the sendmail config system. I'm not sure if that's an endorsement or not, but you definitely can use it to do things. Also, the way quotes work is actually pretty nice. Since open and close quotes have separate characters rather than both being a double-quote, nesting quotes is really easy and doesn't require any escaping - it's more like balancing parens in Lisp. I guess that's nice? I can't seriously recommend you use it these days, though. There are better templating systems, better text-processing systems, better config systems, and better programming languages. At this point m4 would be totally lost to history if not for autoconf and sendmail, and perhaps it should be. [1]: And honestly not all that historical, to be honest; the history of m4 is little documented compared to other early languages and it never had widespread adoption. It descended from the sparsely-documented m6 macro processor, which was present at least as of V6 Unix, but the man page for it there references the "M6 Reference Manual". I have a copy of Bell Labs CSTR #2 which I believe is that document, but haven't yet read (or written about) it.