XML is a significant advance in the state of practice of getting real work done on a computer, providing a clear, unambiguous methodology for separating content from structure, thus allowing the reuse of common structure-processing code.
A significant number of computer programmers have expressed interest in leveraging XML for a number of applications for which "mini-languages" in text files are currently used, such as configuration files. XML would provide a common syntax (simplifying the life of users), a preexisting file reader, and the opportunity for such applications to use more complex structures than they were willing to when they had to do the parsing "by hand".
However, an explicit design constraint on XML is to reject concerns about human-maintainability of XML files. This does not make XML unusable for the above applications, but it should give one pause. Certainly, the first advantage--giving users a common syntax--is a bit of a myth, since different files will still use different XML tags, and users perceive syntax on that level as well as raw character streams.
MacroXML places a macro processor between the raw character stream and an XML reader. This macro processor allows some simple syntactic sugar processing of the character stream, specifically for the purpose of replacing XML-style markup with some other syntactic form of markup (e.g. punctuation-symbol based). Using MacroXML, an application can define an XML stream and a set of MacroXML macros which a character-based file format that doesn't "look" like XML to the user.
If all of the macros are explicitly defined in the file format, any XML program can still process the resultant character stream, after preprocessing. Thus an XML validity checker can be turned into a MacroXML validity checker by hooking up the preprocessor appropriately.
MacroXML's design attempts to satisfy the following constraints:
While MacroXML satisfies all four constraints, one must realize that the constraints represent trade-offs from one another. Making the language look like XML requires the macro syntax resemble XML. This introduces issues that complexify the substitution rules. Making the macros sufficiently powerful to make human users happy may compromise the desire for a straightforward implementation.
A separate document will one day describe the motivations and rationale for the design of MacroXML.
The focus of MacroXML is on simple syntactic sugar through the use of string substitution. Any character string may be defined as a macro which substitutes some other character string.
There are no parameterized macros, although there are three different techniques which can be used to achieve similar results. This represents one of the fundamental compromises of MacroXML; defining a syntax for parsing parameters out of the character stream would be difficult, and introduce a new set of character-escaping problems (unless such macros were forced to use an XML-like syntax, but that goes against the design goals).
When a macro is bound, the "output" string for the macro is not immediately macro substituted. Macro substitution happens when the macro is used.
For ease of encoding literal text, a special kind of macro is available for which substitution of the output string never occurs.
Because MacroXML provides only simple one-to-one macro substitution, the gross structure of the resulting XML file must be represented in the gross structure of the MacroXML file. If the XML file needs a list of items in a certain order, they will almost always be defined in the exact same order in the MacroXML file, simply with more human-friendly markup.
By default, a MacroXML file is passed through unchanged. Two things can appear in the input stream which affect the processing: MacroXML macros and MacroXML control statements.
MacroXML macros can appear anywhere within the character stream, which is considered a raw, uninterpreted stream for the sake of macro substitution. When a macro is found, the trigger ("name") of the macro is replaced with the body of the macro. Macro processing then restarts from the beginning of the substitution.
For example, suppose the trigger "dog" is mapped to "god". Then xdogz will be converted to xgodz. xgodz will not change. xdogogz will initially be converted to xgodogz and then to xgogodz. xdodogz will be converted to xdogodz; because substitution proceeds left to right, the new dog is not changed.
Note that where the C programming language preprocessor only detects macros at identifier boundaries, the MacroXML macro processor does not distinguish any character as unique. One can even define the trigger "<" to output the body ">". (One would, most likely, regret it.)
The C programming language preprocessor introduces a special syntactic construction to distinguish its control commands from the C source text, laying claim to the symbol '#' and using newlines as delimiters.
MacroXML uses an XML-like syntax to introduce control statements. MacroXML control statements are of the form <control-statement control-arguments>. However, MacroXML control statements can appear anywhere in the input character stream, regardless of the apparent XML parsing state.
For example, the macro definition control statement <macro "dog"="god"> might appear in the middle of some "plaintext" XML commands, e.g.
<a href="<macro "dog"="god">index.html">
Note the lack of escaping of the internal quotes, because the preprocessor is unaware that the statement is "inside a string", and the output stream doesn't contain the statement at all.
Again, you shouldn't really want to do this. However, it makes the rules for conversion simple (and, effectively, independent of the rules of XML), and provides a hook for some sophisticated processing.
The interaction between the two systems can be most clearly stated by describing the canonical parsing algorithm:
After a normal macro substitution, the entire text, starting at the beginning of the newly placed body, is reparsed for macros and control statements. After a "literal" macro substitution, the body is only parsed for control statements, up to the end point of the substituted body.
During the reparsing of the expansion of a macro substitution, that particular macro substitution is disabled. (The trigger is not disabled, since the trigger may be bound to a different body without introducing recursion problems.)
MacroXML control statements are designed to be "pseudo-XML", with a syntax that explicitly deviates from traditional XML, yet "looks friendly" with it. In normal usage, a MacroXML file is not compatible with XML; it doesn't make sense to define MacroXML so that XML readers can parse it but ignore the MacroXML control statements, because in normal use the macros are used to define most of the markup.
The fundamental control statement in MacroXML is the macro statement, which defines one or more macro triggers and binds them to one or more macro bodies. Each macro control statement may be assigned a name by which it can be referred to with other control statements.
Here are the standard forms of the macro statement:
<macro "trigger"="body"> Defines a single macro substitution. <macro "trigger"="body" ["trigger"="body"]*> Defines multiple macro substitutions <macro sub=0 "trigger"="body"> Defines a macro whose body does not undergo macro expansion <macro interp=0 "trigger"="body"> Defines a macro whose body is not scanned for control statements. <macro bind="name" "trigger"="body" ["trigger"="body"]*> Defines an internal name which can be used to refer to this macro binding by other control statements. At any time during parsing, only one macro binding is associated with a given name; the last one bound wins. See below for restrictions on the name. <macro "trigger"="body1"="body2"="body3"> Defines a single trigger with three bodies. When the trigger is encountered, body1 will be substituted. On the next triggering, body2 will be substituted. Following triggerings result in body3, body1, body2, etc.The complete definition:
<macro [bind="name" [defer]] { sub=0 | sub=1 | interp=0 | interp=1 | "trigger"="body"[="body"]* }+ >
The options "sub=0", "sub=1", "interp=0", and "interp=1" can appear anywhere within a macro definition, and affect only the substitutions defined after that option. (The most recent option wins.) Whitespace may only appear between options and trigger-body-sets or within strings, and between the final trigger-body-set and the final less-than sign. (The default is "sub=1", "interp=1".)
It is possible to define a macro whose body is a control statement, which will then be parsed and processed. Among other things, this allows for the possibility of a macro whose body defines other macros. To avoid double-quoting problems, the <macro bind=...> syntax is provided to create a "named" macro binding. This can then be expliciting referred to, rebinding it, using the syntax
<macro name >
A macro binding can be created without actually enabling the trigger/body substitutions by including the option "defer"; an undeferred macro definition binds the name to the macro definition and immediately instantiates the definition, whereas a deferred macro definition binds the name to the macro definition but does not instantiate it.
Name appears unquoted to avoid the need for double-quoting. MacroXML requires that names used for macro bindings not contain whitespace (even whitespace encoded in strings), nor the characters '<' or '>'. Thus, the first and last sequences below are legal, the middle two are not:
illegal <macro bind="<macro" defer "dog"="god"> cat dog <macro <macro> cat dog illegal <macro bind="foo bar" "dog"="god"> cat dog <macro foo bar> cat dog illegal <macro bind=""foo bar"" "dog"="god"> cat dog <macro "foo bar"> cat dog legal <macro bind=""foobar"" "dog"="god"> cat dog <macro "foobar"> cat dog
The usage of bound macro names allows the following sorts of constructs:
<macro bind="bar" "foo"="bar" "Foo"="Bar" "FOO"="BAR"> <macro bind="foo" "foo"="foo" "Foo"="Foo" "FOO"="FOO"> <macro "<bar>"="<macro bar>"> <macro "</bar>"="<macro foo>"> foo foo foobar foo! <bar> foo foo foobar foo! </bar>which converts to
foo foo foobar foo! bar bar barbar bar!
Rebinding a macro binding is semantically identical to reexecuting the original macro binding definition. In specific, each trigger is associated with all of its bodies. Triggers with multiple bodies are returned to their initial state; the first expansion of that trigger will generate the first body.
The include control statement causes some external character stream to be inserted into the input text at the location of the control statement, and then processed normally. It resembles the '#include' statement found in the C-preprocessor.
<include "handle-to-text-to-include">
The rules for getting from a "text handle" to an actual character stream depend on the application and on the platform. Some possible handle formats are platform-specific filenames with application-implied include directories, URLs, or a set of application-specific "type" codes which imply particular macro sets or even DTDs.
The use of application-specific includes precludes the use of generic MacroXML tools (or XML tools plus a generic preprocessor), so includes should be used with caution when interoperability is desired.
One convenient use of include is to define a MacroXML language as a wrapper around an XML language. Humans who want to create files simply specify an appropriate include at the start of their file, which defines the wrapper language. Programs outputting files can output the XML directly, with no include statement. As long as that XML doesn't use the tags <macro> or <include>, it will be translated through unaltered.
The best case for MacroXML is in being given the opportunity to design the XML and the MacroXML language at the same time. The second best case is having existing XML, and designing a MacroXML language for it. A bad case is taking an existing non-XML language and attempting to use MacroXML to output XML from it. A worst case (and almost certainly impossible) is to take an existing non-XML language and use MacroXML to output to an existing XML language.
As a trivial example of the worst case, let's write a MacroXML mini-language which converts plaintext to HTML.
In addition to converting special characters like "<" to "<", we will recognize the character "_" as an emphasis character, as in "I _hate_ you".
To do this, we'll use a macro which triggers on "_" and which outputs "<b>" and </b>":
<macro sub=0 "_"="<b>"="</b>">
This will have cause the appropriate substitutions:
I _hate_ you, I _really_ do. -> I <b>hate</b> you, I <b>really</b> do. -> I hate you, I really do. In his book _I Hate God, _Why_ Don't You?_, -> In his book <b>I Hate God, </b>Why<b> Don't You?</b>, -> In his book I Hate God, Why Don't You?,
It also causes inappropriate subsitutions if "_" is used for things other than emphasis.
number_of_items = size_of_all_items / size_of_one_item; -> number<b>of</b>items = size<b>of</b>all<b>items / size</b>of<b>one</b>item -> numberofitems = sizeofallitems / sizeofoneitem;
Thus, this must be understood as creating a special text minilanguage with this property.
We might also define a macro which recognizes two newlines in a row and replaces them with a paragraph tag "<p>", thus catching one standard approach to blocking things up. Alternately we could wrap the entire text in a "<pre>" and "</pre>" pair. However, MacroXML does not provide a mechanism for automatically adding tail segments, so we wouldn't be able to do this without explicitly altering the file. One workaround for this would be to use another file to include the file to be converted:
<macro sub=0 "_"="<b>"="</b>"> ... other definitions ... <html><head><title>File converted by MacroXML</title></head><body> <include "mytextfile.txt"> </body></html>
Ok, setting that aside, let's proceed with the remaining definitions.
<macro sub=0 "_"="<b>"="</b>"> <macro sub=0 ">"="&gt;"> <macro sub=0 "&"="&amp;"> <macro sub=0 "<"="&lt;">
Note that a clumsy form of double-quoting is necessary because we want to output the string "<" but "&" is an escape character inside strings in macro definitions. (MacroXML uses the exact same string literal rules as XML for consistency, but this leads to this double quoting problem.) Note that this conversion happens when the string is parsed, and the mapping stored in memory is "<" to "<".
The moment we finish the definition of "<", the input text can no longer create any HTML tags, since it can't output a "<"--any "<" in the character stream is immediately converted to "<". Moreover, we can no longer execute any MacroXML control statements, since the "<" to "<" happens before macro control processing. This is intentional in this case, since it's desired that the file to be converted not accidentally invoke any HTML functionality.
One might like to create a MacroXML language for which it's no longer possible to invoke macro control statements, but for which it is still possible to create XML tags. The following macro definition does the trick:
<macro interp=0 "<"="<">
This command binds the character "<" to itself, but guarantees that no macro commands can be detected starting on that "<". Since all commands start with a "<", no command can be detected.
One could use "<macro"="<macro" to disable macros but not includes, or vice versa.
It is also possible to disallow XML tags being coded in the file, but still allow macro definitions. For example:
<macro sub=0 "<macro"="<macro"> <macro sub=0 "<"="&lt;">
Because the longest match is found, the tag "<macro ..." will be passed through unaltered; it is set "sub=0" to prevent the other macro from being invoked on the substituted text. Since (presumably) no other macros match, it will then be scanned as a control statement, and the macro statement matched. If a given "<" is not the beginning of "<macro", then the first trigger isn't invoked, and the second one is.
This would allow one to create a MacroXML minilanguage in which the user can define new macros, but does not explicitly create any XML. For example, one might use this with "Bold Text" to allow a user of Bold Text to create convenience macros, e.g.:
<macro "Microsoft" "_MicroSoft_"> But in general, I think Microsoft sucks.
Note that macro substitution is suspended during processing of a control statement, so the ">" is grabbed by "<macro>" before it ever has a chance to be converted to "<" and screw things up.
Yacc is an existing minilanguage, which we are going to attempt to minimally convert to some sort of XML language. As noted before, existing minilanguages pose many challenges, and I'm not going to attempt to do a perfect job. Please don't criticize my XML style; the point is to show how you can get stuff done with MacroXML. Given that the main application for MacroXML is for programmers who are thinking of switching from an existing file format to an XML file format, they should be willing to use MacroXML even if it requires altering their existing file format. The focus here on converting existing file formats is to show that MacroXML is powerful and convenient for humans; I hope that the fact it can be used to emulate existing mini-languages is evidence of that.
Here is a sample Yacc file taken from the Dragon book:
%{ #include <ctype.h> %} %token DIGIT %% line : expr '\n' { printf("%d\n", $1); } ; expr : expr '+' term { $$ = $1 + $3; } | term ; term : term '*' factor { $$ = $1 * $3; } | factor ; factor : '(' expr ')' { $$ = $2; } ; %% yylex() { int c; c = getchar(); if (isdigit(c)) { yylval = c - '0'; return DIGIT; } return c; }
The Yacc file format consists of three major sections, delimited by "%%".
<macro "%%"="%%end1 %%start2"="%%end2 %%start3"="%%end3"> <macro "%%start1"="<declarations>" <macro "%%end1"="</declarations>" %%start1
Note that the last line of the above causes text to be output to the output stream, namely "<declarations>". I've added support for a final '%%' to appear at the end of the file. Without a terminator, it's impossible to output the necessary end tags.
In the first section, the "%{" and "%}" delimit straight C code, which we will wrap inside the XML <code> and </code>.
<macro sub=0 "%{"="<code>"> <macro sub=0 "%}"="</code>">
The rest of the section uses several other indicators. Because they're not fully parenthesized, we have to leave the XML not fully parenthesized. Similarly, each line can define multiple tokens, with whitespace delimiting them. MacroXML provides no reasonable scheme for handling variably-delimited text (e.g. with varying amounts of whitespace). In general, mini-languages designed for MacroXML should not rely on whitespace. (This does not seem to be a big deal in practice, although it requires a human type in extra punctuation characters.) So we will just prefix token sets with corresponding tags.
<macro "%token"="<token/>"> <macro "%right"="<right/>"> <macro "%left"="<left/>"> <macro "%nonassoc"="<nonassoc/>">
The middle section consists of a set of rules giving a list of possible grammatical expansions for a non-terminal. Each non-terminal has one or more rules, and there is a long list of non-terminals.
When outputting lists in XML, we'd prefer to wrap the entire list, and then wrap each item with a start and end tag, rather than using the prefix tags used in the previous section. If the list is presented as a series of items, with distinguishable punctuation at the beginning and end and between each item, it's straightforward to do this, as follows: the starting punctuation introduces the list tag, and a start tag for the first item; between-item punctuation generates the end tag for the previous item, and a start tag for the next item. On the other hand, if a list just consists of items with terminators, it's impossible for a macro language to determine that there are no more items.
<macro bind="rules" defer sub=0 ":"="<translations><rule>" "|"="</rule><rule>"> ";"="</rule></translations>">
Unfortunately, there's an exception to this rule; within C statements, which can be intermixed inside the rules, the symbols ": | ;" all have meanings and need to be passed through unaltered. To handle this correctly, MacroXML would need to properly parse C, especially handling nesting of "{}" and handling C strings. While it's possible (but rather insane) to write a C-string parser, MacroXML doesn't provide any facility for parsing nesting (since it assumes nesting is going to be passed through to the XML as true tag nesting).
So we'll fake it, ignoring C strings and nested "{}".
<macro bind="norules" defer sub=0 ":"=":" "|"="|" ";"=";"> <macro bind="startrules" defer "{"="{<macro norules>" "}"="}<macro rules>">
Next, we need to do a similar thing on the overall set of non-terminals. We can recognize the %% at either end as the beginning and ending markers. However, there isn't between-item punctuation, only the terminator ';'. So we'll use this to delimit items in the list, creating an empty item at the end. To do this, we alter the rule for ";" above:
<macro bind="rules" defer sub=0 ":"="<productions><rule>" "|"="</rule><rule>"> ";"="</rule></productions></nonterminal><nonterminal>">
And we need to start and end the list, as well as enable the macros.
<macro sub=0 "%%start2"="<macro startrules><macro rules><translations><nonterminal>"> <macro sub=0 "%%end2"="</nonterminal></translations>"
The final section consists of C code, so we want to disable all substitutions. Most of them can't ever appear in C code (e.g. '%token' or '%{' etc.), so there's not much to disable, and we can just use <macro norules> for them:
<macro sub=0 "%%start3"="<support><code><macro norules>"> <macro sub=0 "%%end3"="</code></support></grammar>">
Finally, in every section, it's possible to write C code that uses the symbols "<" and "&", and it's possible to refer to nonterminals like '<' and '&'. Therefore, we force those characters into escaped literals to avoid them being processed as XML characters. (This is why all of the above macros use sub=0, to avoid having this substitution occur on their output.
<macro sub=0 "<macro"="<macro"> <macro ">"="&gt;"> <macro "&"="&amp;"> <macro "<"="&lt;">
Thus, putting it all together into a complete definition converting most of Yacc to an XML format (with the exception of C code in the productions which contain the character '}', either due to nesting or within strings):
<macro "%%"="%%end1 %%start2"="%%end2 %%start3"="%%end3"> <macro sub=0 "%%start1"="<declarations>" "%%end1"="</declarations>" <macro sub=0 "%%start2"="<macro startrules><macro rules><translations><nonterminal>"> <macro sub=0 "%%end2"="</nonterminal></translations>" <macro sub=0 "%%start3"="<support><code><macro norules>"> "%%end3"="</code></support></grammar>" <macro sub=0 "%token"="<token/>"> <macro sub=0 "%right"="<right/>"> <macro sub=0 "%left"="<left/>"> <macro sub=0 "%nonassoc"="<nonassoc/>"> <macro bind="rules" defer sub=0 ":"="<productions><rule>" "|"="</rule><rule>"> ";"="</rule></productions></nonterminal><nonterminal>"> <macro bind="norules" defer sub=0 ":"=":" "|"="|" ";"=";"> <macro bind="startrules" defer "{"="{<macro norules>" "}"="}<macro rules>"> <macro sub=0 "<macro"="<macro"> <macro ">"="&gt;" "&"="&amp;" "<"="&lt;"> <grammar> %%start1
The output of this on the sample file, with reformatted whitespace for readability is: (note this was generated by hand pending completion of the initial implementation of MacroXML):
<grammar> <declarations> <code> #include <ctype.h> </code> <token/> DIGIT </declarations> <translations> <nonterminal> line <productions> <rule> expr '\n' { printf("%d\n", $1); } </rule> </productions> </nonterminal> <nonterminal> expr <productions> <rule> expr '+' term { $$ = $1 + $3; } </rule> <rule> term </rule> </productions> </nonterminal> <nonterminal> term <productions> <rule> term '*' factor { $$ = $1 * $3; } </rule> <rule> factor </rule> </productions> </nonterminal> <nonterminal> factor <productions> <rule> '(' expr ')' { $$ = $2; } </rule> </productions> </nonterminal> <nonterminal> </nonterminal> </translations> <support> <code> yylex() { int c; c = getchar(); if (isdigit(c)) { yylval = c - '0'; return DIGIT; } return c; }
And with a terminating '%%' we'd get:
</code> </support> </grammar>
(I hope the above example shows clearly why I don't think XML is a panacea for human-maintained files; but if we assume that the original Yacc format is reasonable, then obviously MacroXML makes a vigorous attempt at providing the best of both worlds--at the expense of introducing complexity in the design of the macro translation itself.)
For simplicitly, I left all the text as plain text in the output XML. It would be possible to move the name of the nonterminal inside the nonterminal tag by making the macros that currently output '<nonterminal>' instead output '<nonterminal name="' and making ':' change from outputting '<productions><rule>' to outputting '"><productions><rule>'; but this will embed whitespace, potentially including newlines.
On the other hand, by introducing new punctuation to the yacc grammar, we can easily require wrapping the nonterminal name when it is defined, e.g. a syntax like ":expr:"; this can be encoded very directly.
So far, we've already seen two sorts of ways macro substitutions can vary with context: using multiple bodies for cyclic macro substitution, and using rebinding to change macro definitions (typically embedded in the middle of a macro substitution that is doing something else).
The former can be used to simulate a very simple, but common, sort of parameterized macro. For example, a very simple HTML anchor can be built using:
<macro sub=0 "%"="<a href="=">"="</a>"> Click %index.html%here% for something great!
An alternative way of creating parameterized macros is to rely on the "lateness" of macro substitution. Instead of using parameter substitution, one uses macro substitution. However, this doesn't allow the nesting of macros (using one macro as a parameter to another), unless they all use different "parameter names".
For a pointless example, take the C macro
#define min(x,y) ((x) < (y) ? (x) : (y))
which is invoked by saying
min(item, cur_min) -> ((item) < (cur_min) ? (item) : (cur_min))
An equivalent parameterless macro would be something like this:
<macro "min"="((minx) < (miny) ? (minx) : (miny))">
To use this macro, one defines the parameters, then invokes the macro:
<macro "minx"="item"> <macro "miny"="cur_min"> min
One can then make this more concise by defining new triggers that define the appropriate macros.
<macro bind="minbindx" ";"=";" "min<"="<macro minbindy><macro "minx"="> <macro bind="minbindy" "<"="<macro minend><macro "miny"="> <macro bind="minbindx" "<"="<" ";"="((minx) < (miny) ? (minx) : (miny))"> min<"item"><"cur_min">;
However, there are no built-in facilities in MacroXML to simplify it, because such constructions are error prone, but can still be perfectly valid MacroXML; errors are only caught because the output XML is malformed (with the exception of improperly formatted MacroXML control statements). Thus, such constructions are probably best left as pure theory.
(The above syntax was chosen because, since macro substitution is disabled during macro definitions, each macro being defined must have the string "> immediately after the definition in the text. There is no way to get around this without allowing macro substitution to occur during processing of control statements, which is problematic. It seems best to make the string context explicit anyway; the above syntax helps imply that the substituted text is interpreted as XML quoted strings, even if it fails to resemble actual XML.