welcome/

java-mcmc/

software/

papers/

links/

Is the Unix shell ready for XML?

by Laird A. Breyer

In this essay, I shall discuss (informally) some prospects for making the traditional Unix command line environment more XML aware. The ideas below also form the basis for the xml-coreutils project.

XML is a widely used tree like format for representing data in textual form. It is not my aim here to advocate a single approach to working with such formats, nor to dismiss greatly successful existing toolsets such as XML parsing libraries, scripting extensions for popular languages, etc. This essay originated as a simple question: why can't I process XML seamlessly on the command line like other text?

I believe that some insight into this problem can be had at a high level. The tools I shall discuss generally do not exist yet, and I am not going to propose exactly how to implement them. The traditional Unix core utilities are lightweight and fast, close to optimal in both memory and cpu for their respective tasks, and I expect a usable XML command line suite to offer the same guarantees over time. As you will see, there are enough high level issues to consider before implementing.

At this point, I should explain why there is an issue at all. After all, the Unix shell excels at text processing, and XML documents are just text.

Fundamentally, the core shell commands are line oriented. This means that they generally read input in the form of zero or more lines of text, and in turn output zero or more lines of text. The power of the shell is largely due to its ability to take the output of one program and give it as input to another program. In this way, complex results arise out of the specialized abilities of many programs.

Fundamentally also, XML is a tree structured format. While it could be represented as a single long line, and it can also be converted to a line oriented form (e.g. PYX format), this is not how it arises naturally in the wild. A program that reads XML must navigate this structure, and if it writes XML as output, it must make sure to format its output correctly. In XML, this is formally defined by the concepts of well formedness and validity. The existing shell utilities don't know how to navigate XML, and do not enforce output constraints. All this is left to the shell operator or programmer, who can easily forget to output a tag or navigate the XML incorrectly.

As a result, XML processing on the shell is brittle. What might start as properly XML formatted data is passed from program to program, any one of which can break the structure, ending up with a fancy line oriented result and pieces of strings which are tedious to recombine into XML at the end. A seamless shell experience should effortlessly preserve the XML structure from command to command, at least for those commands which are designed to work on XML.

A second problem is that XML is designed for modern international character sets, while the shell tools vary tremendously in their ability to process non ASCII text, and in fact to pass it along undamaged. Any XML shell commands should cope gracefully with these issues, making sure to not damage the data entrusted to them.

I also want to say a quick word about existing approaches. Many modern computer languages offer a more or less closely integrated XML component which turns an XML document into an object, and lets the programmer navigate and modify the object in clean ways, using systems such as DOM or SAX or XSLT etc. This will not change even if the shell becomes XML aware.

What I hope to find out by writing this essay is how far the shell can become a natural environment for interactive, quick and dirty (but well formed!) XML processing, very much like it is for text oriented tasks today.

You might say, what's needed is a new shell with completely new facilities specifically designed for XML and ordinary text. That's of course one way to do it, but the very success of the Unix shell (pick your favourite flavour) makes it unlikely that a new shell would widely replace it on existing systems in a short time. So we are naturally constrained to looking at ways that XML fits into the current paradigm, i.e. as standard Unix programs with a standard input (STDIN), a set of optional parameters (ARGV), and separate outputs for data and out of band error information (STDOUT and STDERR).

Another natural question is to ask what the scope of shell-based XML manipulations ought to be. This is generally a hard question to answer, as it depends on what people who work with XML find useful to do.

We could simply wait until a complete set of tools and operations emerges by natural selection and advocacy, as people gravitate towards what helps them and shun unworkable ideas. This has happened with libraries and APIs such as DOM and SAX, but hasn't generally happened yet at the level of the shell. For natural language processing, a well thought out collection of shell tools worth mentioning is LT XML.

We could alternatively take the XML format as a given, and work out theoretically all the common operations which allow everything to be done (after all, XML is a tree structure, one of the simplest and best studied in computer science). A step along this direction is the XmlStarlet project.

I will do neither of these here, and opt for a shortcut instead: the core Unix utilities have existed for a long time and proved both useful and versatile. Why not take these utilities as the basis for a set of XML utilities? This can be done by first working out just what each core tool does on ordinary text, and then see if and how it makes sense for an XML document.

This then, is my goal in this essay.

The core Unix shell utilities

To begin with, here is a list of core Unix commands. Obviously, I'm interested in commands which process text, so many commands which do something else are simply ignored. I've also listed a few directory and file manipulation commands because XML has a lot in common with Unix file system hierarchies.

No doubt, I've also missed a few or picked some which aren't really important. I made the list by searching the contents of packages called something like coreutils, and also looking at my copy of "Linux In a Nutshell" for interesting candidates, you could do the same. I probably won't get to discuss each command fully anyway.

The list of commands below will be referred to as "coreutils" in this essay. I will then pick a command such as "cat", and discuss a new program called "xml-cat" which tries to do for XML what "cat" does for text. The complete set of XML commands will be referred to as "xml-coreutils". The discussions will be mainly about interoperability.

awk		cat		 cp			csplit
cut		diff		 echo			find
fmt		grep		 iconv			join
ls		mkdir		 mv			paste
printf		rm		 sed			seq
sort		strings		 tr			uniq

The shell universe consists of lists of strings

It's important to realize that (generally speaking) the coreutils commands work within a universe consisting of lists of strings organized into lines, and stay within this universe. Some tools can also work with binary data, and shells can also redefine the exact meaning of lines through their separators, but I'll leave all that aside and concentrate on the basics.

A line is normally a string of text ending with the special '\n' character called the newline. It's not hard to abuse this terminology and in turn think of a string as a single line even if it doesn't end in '\n', and also as a list of lines which happens to have only one element. A list of lines is in turn a list of strings linked together by the '\n' character.

From this mental gymnastics, a simple idea arises: all coreutils commands read lists of strings, eiter on STDIN or as separate command parameters, and output lists of strings on STDOUT and STDERR, even if some of those lists are strictly speaking empty. Try picking your favourite command and seeing it in this light.

Moreover, lists (of strings) have nice properties: placing two lists (of strings) end on end gives another list (of strings), taking a fragment of some list (of strings) gives another list (of strings), even making a list of lists (of strings) gives simply a list (of strings) in a natural way, etc. In other words, lists form a closed universe for the shell commands to operate in.

The XML universe consists of trees

A tree, in computer science terminology, is the next simplest data structure after the list. Whereas a list is one dimensional with a simple ordering, a tree allows more complex hierarchical orderings.

An XML document has a tree structure. Here is a simple document, which I am going to call simple.xml (the name is there so I can refer to it later):

<?xml version="1.0"?>
<root>
	<salutation>
		<greeting>hello</greeting>
	</salutation>
	<transport>
		<engine>
		car
		</engine>
		<engine>bus</engine>
		<muscle>bicycle</muscle>
	</transport>
</root>

While it can always be viewed as a list of individual lines, treating the lines of simple.xml independently doesn't capture the overall structural aspects, such as when and why opening tags such as <transport> must be associated with closing tags such as </transport>. That is, lines don't naturally represent full and complete chunks of information for processing.

Here's a line oriented picture of the simple.xml document, as a shell command might see it. With the visual layout removed, it is hard to see how to make sense of the information. For example, what is the relationship between line 11 (muscle...) and line 7 (engine)?

1.  ?xml version="1.0"?
2.  root
3.  salutation
4.  greeting hello /greeting
5.  /salutation
6.  transport
7.  engine
8.  car
9.  /engine
10. engine bus /engine
11. muscle bicycle /muscle
12. /transport
13. /root

Here's a tree oriented picture of the simple.xml document, again as a shell command might see it. The spaces and line breaks have been removed as before, but the structural information such as closing tags etc. has been summarized in the tree coordinates (the command is tree oriented, so understands tree coordinates). As a result, any tree relationships between nodes are visible. For example, 2.2.3 (muscle) is the sibling of 2.2.1 (engine).

1.              ?xml
2.              root
2.1.            salutation
2.1.1.          greeting
2.1.1.1.        hello
2.2.            transport
2.2.1.          engine
2.2.1.1.        car
2.2.2.          engine
2.2.2.1.        bus
2.2.3.          muscle
2.2.3.1.        bicycle

Thus I'm proposing here to mimic the list processing ability of coreutils, in a tree oriented way. But can this even make sense? Is the tree oriented world of xml-coreutils rich enough to contain the same number and variety of operations that make coreutils so useful? Let's find out.

As data structures, both the tree and the list consist of natural building blocks which are themselves trees or lists respectively. This means that combining lists appropriately gives a list, and combining trees appropriately gives a tree. We need never leave the universe of trees as long as the xml-coreutils follow certain rules.

But staying inside the universe of trees is one thing if you're already in it, there is also the question of how to get there from the universe of lists of strings? The latter is the universe that the Unix shell lives in today.

In coreutils, there is a similar problem, namely: how does one get to lists of strings from nothing? Typical examples of how this works in practice are the cat and echo commands, and I will shortly describe the corresponding xml-cat and xml-echo commands. As a rule, I must be able to communicate instructions for creating such lists, and this occurs on the input side. Moreover, the simplest instruction is to quote an example, ie to quote a string (for coreutils) or a tree (for xml-coreutils). Both cat and echo use this method.

However, it is important to remember that there are two sources of input for a shell program, namely the STDIN and the command line options, which I'm referring to as ARGV. While STDIN can easily contain XML data, it might contain a consecutive list of XML trees, which strictly speaking is a forest. The options command line ARGV cannot easily contain XML data: in its natural form, it is really designed to hold a list of (often small) single strings.

So to proceed (to define xml-cat and xml-echo), I need a principle for converting a (conceptual) string into a (conceptual) tree, and for converting a (conceptual) list of trees into a (conceptual) tree. The converse, namely converting a textual tree into a string or a list of strings is much easier since an XML document is already a single string (containing one or more '\n') as well as a list of strings (none of which contain '\n'), except that there is no preferred single way of representing the document in this way, precisely because line breaks can occur arbitrarily.

One more issue is the question of validity. An XML document is well formed if it looks like a tree, but to be valid it also has to have the correct tag names in all the correct places. So the universe of valid trees is both smaller, and inside, the universe of (well formed) trees. Which of these two universes is more desirable for xml-coreutils?

Because validity is related to meaning, it is not a desirable requirement for xml-coreutils. To see this, let's look at coreutils.

In the line oriented shell universe, validity would mean that all lines must be coherent. For example, if each line was supposed to contain English words to be valid (say), then every coreutils command would have to verify that it didn't introduce a French word by mistake. Otherwise, coreutils would be breaking valid lists. In this case, it would be impossible to construct a French/English dictionary without destroying validity.

Now let's consider xml-coreutils. A valid tree must have certain tags in certain places, just like the previous English vocabulary requirement. This makes it hard to split or combine XML documents which may not have anything meaningful in common. For example, an SVG image document could not be mixed easily with an XML spreadsheet document. Clearly, this is undesirable.

While validity is an unacceptable burden on xml-coreutils, that doesn't mean that a single xml-coreutils command couldn't formally test the validity of an input document. In coreutils, the analogue would be a spell checker.

`xml-cat`

The cat command is perhaps the simplest one to generalize, as it simply copies the contents of one or more files specified on the command line, or the contents of STDIN, to STDOUT in order. For text files, this gives us a ready source for producing a list of strings suitable for shell processing. In other words, it is an entry point into the shell universe from nowhere (here nowhere stands for something completely outside of the line oriented shell universe, eg a file on disk).

Working backwards, I want xml-cat to produce a true XML document which it reads from one or more files, i.e. from nowhere as far as the tree oriented universe is concerned. Clearly, the files it reads should contain well formed XML to begin with, or else xml-cat will have to do all sorts of work to create the XML.

There is no guarantee that any one input file is well formed or properly valid unless it is completely read first. It's often clear if a file claims to be XML, because it starts with "<?xml". However, this string is part of the optional prolog, and isn't 100% reliable. What is true at a minimum is that an XML document must begin with the character '<'.

As I don't want xml-cat to pollute its output with non-XML file contents (since that would take us immediately out of the XML tree universe), it seems natural that xml-cat should refuse to copy all input files which don't start with '<'. Alternatively, it can scan the text and discard everything until it encounters the first '<'. If the next character after that is a valid XML character, then the program considers that the document has started.

This principle means that the operator doesn't need to worry about which files are really XML and which are not, or even if a file only contains a snippet of XML. If during copying xml-cat finds that its input is not well formed, it should stop with an error. This is not much different from being unable to continue due to a disk error, but is necessary to preserve the trust of subsequent processors which expect well formed XML.

It is debatable whether xml-cat, when unable to finish due to well formedness errors in input, should perhaps output the appropriate closing tags to ensure that whatever ends up on STDOUT is well formed XML. It can't erase partial output after all, and subsequent processes must be able to cope. While this is plausible behaviour, there are always going to be situations where xml-cat must end so quickly that it can't finish properly. Thus instances of incomplete XML will always exist in the world of xml-coreutils.

In the world of coreutils, incomplete text files are still text files, and incomplete lines are still lines, so the universe of lists of strings is robust to this particular predicament. In xml-coreutils however, perhaps a fundamental principle should be that incomplete XML input causes whichever program reads it to simply end with an error. This mimics a principle followed by XML parsers generally: when an error occurs, stop immediately.

The major design issue for xml-cat is how to convert several XML documents into a single one. For text files, which are lists of strings after all, this is easy enough, since two lists of strings placed end to end form a single list of strings. But two trees placed end to end represent a forest; to turn this into a tree, there needs to be a common root.

In XML, this can be done by adding a header and a footer. The header contains firstly a line beginning with "<?xml" and ends with a root tag such as "<root>". The footer consists of "</root>". Together, I shall call this the root wrapper.

But is this the best way? After all, I want xml-cat to create a single stream from several XML input files, probably to treat them as a single hierarchical source of data. Moreover, what happens if I xml-cat a single file? I'll end up with an extra root wrapper. And if I xml-cat twice, three or four times using the previous result as input, I'll end up with many extra root nodes, which makes it hard to know at what level the true document exists.

It's possible for xml-cat to copy the first file as-is, and simply remove the existing root wrappers around the second file, third file, etc. This has the nice side effect that xml-cat becomes idempotent just like the cat is in coreutils. While this idea is seductive, it has one small problem, namely it can destroy the validity of an XML document. This may not matter to our shell utilities, but humans won't like it.

Suppose I xml-cat together two XML documents with different DTDs. If I preserve the root wrapper of the first document and simply splice in the second document, there's no guarantee that the XML tags in the second document are compatible with the DTD. The result is still well formed, but not valid. It follows that I can't keep the first file's root wrapper either, and must replace it with an artificial "<root>" wrapper, so that no DTD is imposed. Note that this change still preserves the idempotent property of xml-cat.

So we see that xml-cat must remove DTDs, but perhaps it should be able to combine documents either way, with the first approach selected by a switch? Here is a prototype usage signature for xml-cat:

xml-cat [OPTION] [FILE]...

	Concatenate FILE(s), or standard input, to standard output.
	If FILE is not in XML format, it is ignored. If a FILE is
	not well formed, xml-cat exits with an error. The root
	wrapper of the first FILE is used, subsequent wrappers are
	discarded.

I will gloss over other technical issues, such as what to do when concatenating several documents which all use different character sets.

`xml-echo`

In coreutils, the echo command is, like cat, a convenient way of creating lists of strings which can be processed later. While cat opens files given on the command line, echo directly converts one or more strings, given in ARGV, into a list of strings on STDOUT.

Naturally, xml-echo should therefore take a string and produce an XML document. Of course, the existing coreutils echo command can, with some effort, already produce any XML document I care to produce, but this is tedious and error prone, quite the antithesis of what I am aiming for in this essay. Below is an example of what I have in mind (the % represents the shell prompt):

% xml-echo "hello"
<?xml version="1.0">
<root>
hello	
</root>

Like its coreutils counterpart echo, the command xml-echo becomes truly useful when embedding control characters. In coreutils, I can write

% echo -e "Hello\nThis is a test\nThird line"
Hello
This is a test
Third line

By embedding the '\n' character, I can control the output and generate multiple lines easily. This deceptively simple feature allows the creation of potentially complex structures in the shell's line oriented universe from a single string.

The analogous task for xml-echo is obviously to create potentially complex tree universe structures from a single string. Unlike echo, here a single special character '\n' is insufficient to create general hierarchies.

In XML, individual nodes in a document can be referenced by XPath expressions, which are string expressions very similar to a traditional Unix file path. One way to achieve echo's desired behaviour for xml-echo is by embedding such expressions into the command parameters, much like '\n' is embedded. xml-echo only needs a subset of XPath to work well. An example should illustrate the idea. Here is how to recreate the simple.xml document:

% xml-echo -e "[/root/salutation/greeting]hello" \
            "[../../transport/engine]car[../engine]bus[../muscle]bicycle"
<?xml version="1.0"?>
<root>
	<salutation>
		<greeting>hello</greeting>
	</salutation>
	<transport>
		<engine>car</engine>
		<engine>bus</engine>
		<muscle>bicycle</muscle>
	</transport>
</root>

Firstly, I've surrounded the XPath expressions by square brackets []. Unlike the case of '\n', both the beginning and end must be marked, because otherwise it is hard to tell where the path stops and the echo data begins. Also, unlike '\n' which is often imagined at the end of a line, here the XPath expression within [] is at the beginning of the corresponding data.

Secondly, you'll note that while some paths are absolute (they start with a '/'), other paths are relative (they don't start with '/'). How does xml-echo know the correct tree structure which is being navigated? The answer is it doesn't, instead the tree is constructed from scratch at the same time that the string is being read, and the current path is updated accordingly. The initial path is the root node, and if a path refers to a nonexistent node, it gets created automatically.

Thirdly, by setting the paths apart and starting with an "empty" XML document consisting only of the root node, the behaviour of xml-echo stays compatible with the simpler "hello" example discussed at the beginning of this section.

xml-echo [OPTION]... [STRING]...

	Echo the STRING(s) to standard output in the form of
	an XML document. With option -e, interpret embedded XPath
	expressions as a structural blueprint.

There are other issues which should be thought through at this point, such as how to easily fill in the attributes of tags. I'll leave this to you, and instead talk about the problem of multiple invocations.

In the coreutils environment, it is common to invoke the echo command several times in succession as another way of obtaining multiple lines of output. In xml-coreutils, this is slightly unwieldy, because each command must produce a well formed XML document rather than a snippet. If several xml-echo commands each produce an XML output in succession, then I end up with a list of XML documents rather than a single XML formatted output.

Fortunately, this problem can be addressed easily. The simplest way is to note that xml-cat, which I looked at earlier, already converts multiple XML documents into a single XML document. I originally discussed this for multiple XML files on the command line, but the principle is the same for a list of separate XML documents on STDIN. This is a general way of solving this problem, (of combining the outputs of several xml-echo commands) which doesn't exist in coreutils.

There is another obvious way to prevent a forest from being formed, namely to simply not output the full root wrapper if it is inconvenient. The xml-echo command could have a pair of switches, say -h and -t, which could prevent the header or the footer of the root wrapper being printed to STDOUT (or even both). Then several invocations of xml-echo could combine their output into a single tree. This is a very bad idea, because it encourages broken XML to be produced. And even if people are careful to always print the header and footer correctly, it quickly becomes a maintenance nightmare in a script.

`xml-iconv`

The iconv command converts a file's character set encoding into another encoding while preserving the contents as far as possible.

In XML, documents can choose among several encodings for representing content, and the xml-iconv command is there to perform the conversions using knowledge of character sets and entities.

xml-iconv [OPTION] [FILE]

	  Convert the encoding of FILE or STDIN while preserving the content.

Interlude

I've described two commands so far, xml-cat and xml-echo, whose main attraction is to easily create XML documents for processing. In other words, these commands are standard ways of entering the tree universe of XML from within the Unix shell's list of strings universe or even outside of it.

These commands are clearly not the only way, and nothing stops us from creating a whole panoply of conversion commands which take existing files and printouts and turn them into well formed XML.

For example, you can take the date command, whose purpose in coreutils is to print a single line containing the current date and time, and create a corresponding xml-date. Another conversion command might take a legacy HTML file and convert it into well formed XML.

But before we go ahead and rewrite an XML version of every useful program in the world, it is worth noting that with xml-cat and particularly xml-echo, we have a good deal of the existing Unix shell universe at our fingertips. There's no need to write xml-date if a command invocation such as

% xml-echo `date`

will do the trick. Once XML is seamlessly integrated with the Unix shell, there can be several ways to get the same results.

`xml-ls`

So far, I've discussed simple ways of creating XML documents which are ready to be processed by Unix commands. Now I want to talk about actual processing.

One of the most useful coreutils commands is ls, whose job is to list file names. Strictly speaking, this isn't a text processing task at all. However, it is often useful to create a list of files to process further. For XML documents, an analogous task consists in obtaining a list of nodes.

I've already mentioned XPath expressions in the section on xml-echo. The XPath naming conventions are explicitly modelled on the Unix filesystem naming conventions, so perhaps it makes sense in turn to model certain XML operations on familiar Unix filesystem commands.

To make this idea concrete, let's examine ls more closely. Ignoring the various switches, ls takes one or more directories and file names on its command line, outputting the name of each such file and the names of all the files inside each directory. The various switches print out extra information about each file, which typically doesn't involve opening the file.

Similarly, xml-ls can be given a list of XML file names on the command line, and it will list the first level nodes below the root of each such XML document, collecting them into one single XML output. With switches, the output tree can also contain simple information about the listed nodes. In effect, xml-ls treats each XML document on the command line as if it was a directory given to ls, and if nothing is on the command line, then it looks at what's available on STDIN.

xml-ls can also output deeper nodes when given an XPath expression right after an XML file name on the command line. Since XPath and Unix both use the same separator '/', this might cause some ambiguity for certain file hierarchies. A ':' is typically not present in file paths, so this can serve to separate the two types of paths as is already done by some unrelated Unix commands.

xml-ls [OPTION]... [FILE][:XPATH]...

	List information about the nodes in each FILE to standard output,
	using XPATH to select the nodes to display. If no FILE(s) are 
	present, operate on the XML document in STDIN with the first 
	available XPATH.

Here is an example:

% xml-ls simple.xml:/root/salutation
<?xml version="1.0">
<root>
	greeting
</root>

`xml-mv`, `xml-cp`, `xml-rm`, `xml-mknode`

The coreutils commands mv, cp and rm respectively move (rename), copy and delete files whose paths are listed on the command lines. Both mv and cp require at least two arguments, because the last is used as a destination.

By analogy, xml-mv moves a subtree from one XML document to another, xml-cp copies a subtree and xml-rm removes subtrees.

The command line invocations mimic the model of xml-ls, namely:

xml-mv [OPTION]... SOURCE[:XPATH]... DEST[:XPATH]
xml-cp [OPTION]... SOURCE[:XPATH]... DEST[:XPATH]

       Move or copy the nodes specified by XPATH in the XML document
       SOURCE into the XML document DEST at the node XPATH. If SOURCE
       is missing, STDIN is used. If DEST is missing, STDOUT is written.

xml-rm [OPTION]... [FILE][:XPATH]...

       Remove the nodes of FILE(s) specified by XPATH. If XPATH
       is absent, FILE is emptied (i.e. only the root node is left),
       but not deleted.

Besides being used on XML documents stored in the filesystem, it may also make sense to apply these xml-coreutils commands on an XML document given on STDIN.

Conceptually, one needs only to replace SOURCE with STDIN and DEST with STDOUT in the case of xml-mv and xml-cp, although in practice this may force a full copy of the STDIN document to be kept for referral.

xml-rm too can take its input from STDIN and print a pruned document on STDOUT.

Note that for performance reasons, it is undesirable to ever save a full copy of STDIN just so we can move a subtree. Therefore, the xml-mv, xml-cp and xml-rm commands should really be implemented in a stream friendly way, i.e. not assume that an XML file is randomly seekable.

The xml-mknode command is modeled after mkdir, and creates one or more empty nodes based on an XPath expression.

xml-mknode [OPTION] [FILE]:XPATH...

	   Create a node in FILE corresponding to XPATH. If FILE
	   is absent, copy STDIN to STDOUT while adding the nodes
	   specified by XPATH.

Here is an example:

% xml-cat simple.xml | xml-rm :/root/transport
<?xml version="1.0"?>
<root>
	<salutation>
		<greeting>hello</greeting>
	</salutation>
</root>

Interlude

We have now seen several commands which accept an XPATH and try to mimic basic directory and file operations. It isn't hard to carry this analogy further and try to simulate the shell concept of working directory.

For example, an xml-coreutils command called xml-cd could set an environment variable called PWXD (for the current working XPATH), which is then used as a default prefix for relative XPATH expressions if and when it makes sense. Another environment variable might contain a list of default XPATH expressions in a similar way to the classic shell variable PATH.

Unfortunately, this simple idea fails in the Unix shell, because processes (such as xml-cd) cannot modify their parent's environment. This therefore requires either tight cooperation with the shell (which shell?) or shared memory programming.

Moreover, such extensions don't actually do anything in the tree oriented XML universe, rather they provide a support role traditionally offered by a shell. This means that they are not really a central part of xml-coreutils, and I won't develop this aspect further in this essay.

`xml-seq`

This command builds a list of XPath node addresses. This may be particularly useful in for loops, or together with xml-echo, since the latter can use XPath node addresses as control elements. The original coreutils seq command builds a list of integers.

`xml-find`

In coreutils, find searches files in a directory hierarchy and either prints them in a list, or performs other specified actions on each file.

Similarly, xml-find will operate on one or more input XML documents, whether specified as file names on the command line or given on STDIN, looking for nodes and printing them on STDOUT or performing other standard actions.

However, unlike the commands discussed so far, xml-find does not write an XML document on STDOUT, but instead writes a list of XPath expressions. This may seem strange at first: haven't I argued for always staying inside the tree universe of XML? Why should the ouput of xml-find belong to the list of strings oriented universe of coreutils?

Interlude

I've discussed now several commands which can use an XPATH expression on the command line, and more such programs will be presented below.

While such expressions can often be written a priori, there are cases where it makes more sense to extract a number of suitable XPaths from an existing XML document.

This is a difficult choice. Since xml-coreutils programs do take their input from both STDIN and the command line, it's sometimes necessary to use both the tree format of XML documents and the string format of ARGV to get things done.

Consider the alternative: there are now several xml-coreutils commands which output XML, and there is the original Unix shell, including coreutils, which output lists of strings and even single strings.

This ought to be enough to do serious work, i.e. we can use the shell and coreutils as before, and when dealing with XML data we use the xml-coreutils. Moreover, we can even wrap the line oriented output of many coreutils commands into XML.

The problem that won't go away, however, is that the shell still isn't aware of the XML universe. The interaction is only one way, i.e. we can use the coreutils as building blocks in the tree oriented XML universe, but not the other way around.

So there is a blind spot for generating strings suitable as input to coreutils commands, and more importantly, as command line options. Command line options are naturally string and list oriented, whether I am discussing coreutils or xml-coreutils commands.

So if I want to maximize the power of reusable components, I am forced to have some xml-coreutils commands, such as xml-find, which can understand XML related information and convert it into simple string form. Whether this is called xml-find or xml-somethingelse is unimportant. I like xml-find.

Here is a sample output for xml-find:

% xml-find simple.xml
/root
/root/salutation
/root/salutation/greeting
/root/transport
/root/transport/engine
/root/transport/engine
/root/transport/muscle

It's easy to take this list and use it in ordinary shell expressions. You'll note that an analogue of xml-find for tree universe type output was already discussed: it's xml-ls. See also the xml-strings command defined later.

xml-find [FILE][:XPATH]... [EXPRESSION]

	 Search the nodes of each FILE and each XPATH, or STDIN,
	 and evaluate EXPRESSION on each. If no EXPRESSION is given,
	 the default action of printing the XPath of the node is
	 performed.

There is another reason why xml-find ought to produce line oriented output. In coreutils, find can execute one or more shell commands on every file that it extracts, using the -exec switch. We might want to allow the same functionality in xml-find, i.e. executing shell commands on full XML subtrees. If xml-find had to output well formed XML, then the shell commands which can be executed would have to work together to output well formed XML, and in fact couldn't run in parallel. But this way, the -exec functionality of xml-find is not restricted to XML aware commands, and any type of string output is acceptable.

`xml-cut`

Like its coreutils counterpart cut, the xml-cut command prints only certain parts of each node. For example, xml-cut can be used to print the input document with all tags containing only certain attributes, or textual data truncated to a certain length, etc.

`xml-join`

The xml-join command takes two XML files and interleaves compatible subtrees according to the specified level. This is a tree oriented version of the coreutils join command.

`xml-csplit`

The xml-csplit command is a kind of converse to the xml-cat command. Its purpose is to split a large XML file into a series of smaller ones with identical headers. The splitting is determined by command line options.

`xml-paste`

This command merges two XML trees, leaf by leaf.

`xml-uniq`

This command merges sibling nodes that have the same name.

`xml-sort`

This command sorts sibling nodes. The output is an XML document which resembles the original with rearranged nodes.

Interlude

I have now defined most of the initial commands in xml-coreutils. Some of the commands allow us to create an XML stream or file from "nothing", while others simply operate on existing streams or files.

What I haven't discussed fully is how to turn an XML stream back into a textual list of strings. Obviously, one can always use the coreutils commands on any of our XML streams, since they are just fancy text files. But this is tedious, because the XML markup ends up being more noise than signal, and a better class of tools should save us time and frustration.

Tools like the xml-find command presented earlier already output line oriented strings, and are designed specifically to feed back information into the conventional shell universe with minimal effort.

The commands defined below are other natural ways of stepping out of the universe of trees into the universe of lists of strings. While xml-find had an emphasis on being compatible with command line semantics, below I want to make it easy to output human readable text.

`xml-strings`

The xml-strings command simply extracts all the text data in an XML document and prints it with minimal formatting. In other words, it removes the XML markup and forms simple paragraphs.

xml-strings [FILE]...

	    Extract and flatten text to standard output.

`xml-printf`

In coreutils, printf operates like the C function of the same name, printing to STDOUT a format string containing placeholders which are filled in by evaluating the remaining parameters.

As every modern shell has variable interpolation facilities and the XML format doesn't respect whitespace, a command which simply mimics printf but outputs XML offers little value over the existing xml-echo command. Therefore, in xml-coreutils, the behaviour of xml-printf generalizes in a different direction.

In the C language, the printf function is often used to print the values of program variables and data structures. Since an XML document can be naturally viewed as a complex datastructure, it makes sense to define xml-printf as a simple way to print the values of XML nodes.

% xml-printf "%s you!" simple.xml:salutation/greeting
hello you!

Moreover, since this is new behaviour for the Unix shell, and since xml-echo can already construct XML trees, the greatest value is obtained when xml-printf produces free form text rather than XML output.

xml-printf FORMAT [[SOURCE]:XPATH]...

	   Print to STDOUT a formatted string FORMAT where
	   the placeholders are filled by the values of the
	   XPATH expressions, relative to the document in STDIN or
           SOURCE as appropriate.

`xml-fmt`

The coreutils fmt command rearranges the paragraphs of text in its input to make them easier to read for humans. Similarly xml-fmt is a pretty printer, whose purpose is to make its input XML documents clean looking, but won't change the interpretation. This isn't necessary for other xml-coreutils commands, but the formatting can be useful for other shell utilities.

xml-fmt [OPTION]... [FILE]...

	Reads each XML file in its input and reformats them visually
	to STDOUT. If no FILE(s) are given, reads STDIN.

`xml-awk`

In coreutils, awk is a lightweight scanner which operates on the lines of an input text, one at a time. Each line can be processed through instructions written in the awk programming language. awk's output is again a sequence of lines.

In xml-awk, the awk programming language is largely unchanged, but instead of acting on lines, xml-awk acts on the data of each node. Like awk, xml-awk supports scripting blocks which are executed only when a regular expression matches the data, but xml-awk has extra awareness of the current XPath node, and is also able to access the node's attributes. This allows more sophisticated conditional blocks suited for tree structures.

Unlike awk, xml-awk outputs an XML document. This is to foster reusing the output by other XML tools. It is easily possible to output plain text by simply passing the resulting XML document to xml-strings.

% xml-cat simple.xml | xml-awk '/salutation:{print}'
<?xml version="1.0">
<root>
	<greeting>hello</greeting>
</root>

In coreutils awk, an action operates on a single line of text at a time, which is helpfully parsed into several variables named $1, $2, etc. This is probably the biggest reason for awk's power, since the tedious processing of tokens is completely hidden behind very simple objects.

Since an xml-awk action block operates on an XML subtree rather than a single line of text, there will have to be a much richer way of referring to subelements within the subtree, not just relative XPath expressions for nodes and attributes, but possibly a blend with the $1, $2 formalism which allows extracting individual words in freeform text surrounded by XML tags. For example, an expression such as $1.$2 might refer to the second token within the first subtree. This question is much too big and delicate to address here.

`xml-grep`

The coreutils command grep searches the lines of input documents for string matches and prints the discovered lines if any. The xml-coreutils command xml-grep searches the nodes of input documents for string matches and prints them. The result is again an XML document, which can be fed to another xml-coreutils command etc.

`xml-diff`

The coreutils diff command is a very useful tool which compares two text files and displays the differences in an intelligent way. An important aspect of this is that diff produces output which can be directly fed into an editor such as ed to convert one file into the other.

Naturally, xml-diff should output an XML representation of the difference between two XML documents. Due to its hierarchical nature, XML is also ideal for adding extra information which might be used to recover one file from the other.

`xml-tr`

In coreutils, the tr command transliterates certain characters. In xml-coreutils, the xml-tr command has a similar, but more complicated task. It must transliterate characters without modifying the XML tags themselves, respecting the document's character set, and it can optionally transliterate XML tags and attributes, to perform a clever kind of structural surgery.

`xml-sed`

The xml-sed command is the last command I discuss in this essay, and the most tentatively. The other commands above already give a bird's eye view of the issues and complications that occur naturally when trying to design xml-coreutils.

The sed command is one of the most powerful commands in coreutils. It reads text files one line at a time and allows arbitrary editing to take place. It stands to reason that its xml-coreutils counterpart xml-sed should therefore allow arbitrary editing of XML subtrees and be particularly useful for simple substitutions. A well designed xml-sed must cope with several issues.

Like xml-tr, there is the question of whether editing occurs in between the XML markup tags, or whether the tags and attributes themselves are to be edited. This is more than mere convenience, since xml-sed must always output a well formed XML document, regardless of the editing operation performed. Well formedness is therefore an invariant.

Another issue is the coexistence of regular expressions and XPath expresssions, which are both natural ways of navigating the XML tree structure and the unstructured embedded data. See the discussions of xml-echo and xml-awk above for some ideas. It may well be that designing a usable xml-sed will first require some experience with developing these other commands.

Summary

I've asked in this essay whether it is possible to make the Unix shell XML aware, and I've sketched a possible answer that I've called xml-coreutils.

One of the most important properties of the Unix shell is that it allows the combination of small modular programs through pipes and variable substitution. I've taken this and placed it at the heart of the conception of XML awareness.

While the exact functionality of the various xml-coreutils commands is interesting as well, true power can only be gained by making sure that all these commands both work well together, and work well with the common Unix commands. This is accomplished here by having some xml-coreutils commands take (well formed) XML as input, and produce (well formed) XML as output, having other commands take XML as input and produce simple line and list oriented output, and having other commands take strings as input and create XML as output.

A Unix command line program, when viewed as a modular building block, has two standardized slots for input (the stream oriented STDIN, and the list of options strings, called ARGV) and two standardized slots for output (the STDOUT and STDERR streams). To make the xml-coreutils useful, each command must do the right thing in each slot. Like LEGO building blocks, any two compatible programs should be connectable on any of these slots.

The STDIN and STDOUT slots are suitable for either text oriented or XML oriented data. The coreutils tools assume text oriented data, so there are commands which convert text to XML (e.g. xml-cat) and XML back to text (e.g. xml-strings). Since XML processing tends to require XML type input, text to XML conversion is the exception rather than the rule.

The ARGV slot is only naturally suitable for a list of strings. There are xml-commands which convert ARGV to XML (e.g. xml-echo) and XML to ARGV (e.g. xml-find).

The STDERR slot is nominally suitable for any type of text, but is only really intended for out of band diagnostic information. Such information tends to be small and is not usually part of the subsequent processing flow. The information is sometimes collected in log files containing the error output of several unrelated programs in random order. It therefore makes no sense for xml-commands to output XML data on STDERR. All xml-coreutils commands should output line oriented string data on STDERR.

I've chosen the xml-commands to mimic the Unix coreutils commands in functionality. There is no reason why other kinds of commands can't be invented, except that the coreutils are already proven and familiar. Other interesting commands might include an XML oriented replacement for less(1), and a tool for manipulating (small) XML documents using DOM semantics. A validator is useful too.

Other approaches

The xml-coreutils concept follows the Unix tradition of creating small single purpose tools. There have of course been other projects to fit XML processing into the Unix way of doing things. The projects below have evolved to fit various needs, and can be better or worse adapted to any given project.

The XML shell (XSH) project (xsh.sourceforge.net) follows the complementary idea of extending the shell program itself to be XML aware. This is a powerful way of manipulating the DOM interactively, but has the disadvantage that operators must learn a new shell.

The XmlStarlet XML Shell Toolkit (xmlstar.sourceforge.net) has the same goal as xml-coreutils, but implements a different cross section of commands. Commands output either text or XML as required, but it seems difficult to mix ordinary shell commands as part of the XML processing.

XMLTK (xmltk.sourceforge.net) is another set of command line utilities designed to perform simple operations on XML files. The emphasis is on fast and scalable streaming of documents, achieved in part by compressing and decompressing XML into binary data on the fly. Unfortunately, this is bad for interoperability, since it makes it impossible to casually insert third party filters into a pipeline, and requires other programs to learn to read the compressed binary format.

LT XML (http://www.ltg.ed.ac.uk/software/xml/) is both an XML parsing and manipulation library, and a set of command line tools developed using it. Like XmlStarlet, the tools cover a large number of operations. This toolkit has an emphasis on linguistic processing and SGML applications.

PYX format (see here) is a simple line oriented text representation format. Rather than implementing XML aware programs, the idea is to convert the tree like form of an XML document into a list of PYX encoded lines, apply ordinary line oriented Unix filters, taking care to preserve PYX format, and re-encode the result into XML. While it is simple, this approach puts the burden of structural bookkeeping squarely on the operator.

Perl-XML (perl-xml.sourceforge.net) is a collection of modules and add-ons for the Perl scripting language. This allows Perl scripts to handle XML in a very simple way, while taking advantage of the languague's other strengths. However, Perl isn't ideal for connecting many small single purpose programs together, and is slightly awkward for routine interactive use in a shell. Similar remarks apply to other scripting languages.

XSLT is a powerful transformation language for XML documents, that can be used on the command line. Unfortunately, it only overlaps partially with the scope of XML awareness as described in this essay.

DOM and SAX are standardized APIs for reading and manipulating XML documents from programming languages. See the remarks regarding Perl-XML.

Is the Unix shell ready for XML?

The core Unix shell utilities

The shell universe consists of lists of strings

The XML universe consists of trees

xml-cat

xml-echo

xml-iconv

Interlude

xml-ls

xml-mv, xml-cp, xml-rm, xml-mknode

Interlude

xml-seq

xml-find

Interlude

xml-cut

xml-join

xml-csplit

xml-paste

xml-uniq

xml-sort

Interlude

xml-strings

xml-printf

xml-fmt

xml-awk

xml-grep

xml-diff

xml-tr

xml-sed

Summary

Other approaches

`xml-cat`

`xml-echo`

`xml-iconv`

`xml-ls`

`xml-mv`, `xml-cp`, `xml-rm`, `xml-mknode`

`xml-seq`

`xml-find`

`xml-cut`

`xml-join`

`xml-csplit`

`xml-paste`

`xml-uniq`

`xml-sort`

`xml-strings`

`xml-printf`

`xml-fmt`

`xml-awk`

`xml-grep`

`xml-diff`

`xml-tr`

`xml-sed`