welcome/
java-mcmc/
software/
papers/
links/
email me

XML-COREUTILS

NAME
DESCRIPTION
COMMANDS
COMMON UNIFIED COMMAND LINE CONVENTION
XPATH SPECIFICATION
WHITESPACE HANDLING
ECHO-LEAF
AUTHORS
BUGS
SEE ALSO

NAME

xml-coreutils − shell commands for XML processing

DESCRIPTION

xml-coreutils(7) is a collection of shell commands, similar to the traditional core utilities ("coreutils") shipped on many Unix systems, but intended to operate on XML files rather than text files. An important design goal is to keep the "look and feel" of the traditional core utilities, and to minimize the learning curve for experienced shell users.

While the current version of xml-coreutils(7) is likely to have evolved somewhat from its initial design, the fundamental ideas are described in detail in the following essay and tutorial:
@PKGDATADIR@/doc/unix_xml.html
@PKGDATADIR@/doc/xml_coreutils_tutorial.html

This manpage lists the available COMMANDS, and describes the COMMON UNIFIED COMMAND LINE CONVENTION as well as the concept of an ECHO-LEAF.

COMMANDS

The following list of commands is available:

xml-cat(1)

concatenate XML files and print XML on the standard output.

xml-cp(1)

copy nodes from XML files into an XML file.

xml-cut(1)

print selected parts of an XML file as an XML file.

xml-echo(1)

generate an XML file on the standard output.

xml-file(1)

determine type of XML files.

xml-find(1)

search for nodes in XML files and execute actions.

xml-fixtags(1)

convert HTML into XML on the standard output.

xml-fmt(1)

reformat an XML file, writing to the standard output.

xml-grep(1)

print matching fragments as an XML file on the standard output.

xml-head(1)

truncate the parts of an XML document.

xml-less(1)

interactively display an XML file on a terminal.

xml-ls(1)

list the contents of an XML file.

xml-mv(1)

move nodes from XML files to an XML file or the standard output.

xml-printf(1)

format and print data in an XML file to the standard output.

xml-rm(1)

remove nodes from XML files.

xml-sed(1)

stream editor for filtering and transforming an XML file.

xml-strings(1)

print the strings of data in an XML file to the standard output.

xml-unecho(1)

ungenerate an XML file into an xml-echo(1) expression.

xml-wc(1)

print height, depth and number of tags for each XML file.

COMMON UNIFIED COMMAND LINE CONVENTION

Since it is often desired to work with only a small part of an XML file, most (but not all) commands in xml-coreutils(7) accept both a filename and one or more special strings called XPATHs. The latter type of string matches a small subset of the W3C XPath 1.0 standard and represents a set of nodes within an associated XML file.

A command which uses the common unified command line convention typically has the following synopsis:

xml-command [OPTIONS] [ [FILE]... [:XPATH]... ]...

This indicates that after any OPTIONS, the remainder of the command line consists of zero or more FILE(s), followed by zero or more XPATH(s), followed again by zero or more FILE(s) and zero or more XPATH(s), etc. Each XPATH is preceded by a colon (:), which is not part of the XPATH but serves to distinguish the argument from a generic operating system FILE. This method is unambiguous, since a FILE whose name happens to start with a colon can always be preceded by an absolute or relative path which doesn’t.

The convention is that every unbroken series of XPATH(s) is associated with each FILE that forms part of the preceding unbroken series of FILE(s). Stated another way, every FILE is unambiguously associated with each of the XPATH(s) within the immediately following unbroken series. It is possible that the first unbroken series of XPATH(s) is not preceded by any FILE, in which case the standard input is taken to be the missing FILE. If the last unbroken series does not contain an XPATH, then the special XPATH "/" is assumed.

In the example below, the command operates on the following associations: (stdin,xp1), (file1,xp1,xp2,xp3), (file2,xp1,xp2,xp3) and (file3,"/"). Each such association may also be called a bundle.

xml-command :xp1 file1 file2 :xp1 :xp2 :xp3 file3

The generic meaning of a bundle such as (file2,xp1,xp2,xp3) is that xml-command is performed on (or using) the set of nodes in file2 which match any one of xp1, xp2 or xp3.

XPATH SPECIFICATION

An XPATH is a string which represents a subset of XML nodes in an XML document, using syntax similar to W3C XPath 1.0. Only a (very small) part of XPath semantics is actually supported, which neither includes axes, namespaces, functions or complex predicates.

The XPATH matching algorithm operates on path prefixes. This ensures that whenever a node is selected, all its children will be selected as well. The following examples are normative.

/

selects the whole document.

/*/abc

selects the nodes which are descendants of the tags named abc, which are children of the top level tag.

//abc

selects a tag named abc (and all nodes which are descendants of it) which can occur anywhere in the whole document.

//abc/

selects all nodes which are descendants of a tag named abc which can occur anywhere in the whole document, but does not select the tag abc itself.

/xhtml/body/p/

selects each node which is a descendant of one of the top level <p> nodes within the body of an xhtml document.

/xhtml/body/p/*

selects each node which is a descendant of a tag which is a child of a <p> tag which is a child of a <body> tag which is a child of the root tag <xhtml>.

/abc@def

selects the attribute named def of the top level tag named abc.

//abc@*

selects each of the attributes of any tag named abc within the document.

/*/abc[2]

selects the descendants of the second tag named abc that is a child of the top level tag.

WHITESPACE HANDLING

Every xml-coreutils(7) command outputs either well formed XML, or traditional unix text. All whitespace text nodes in input XML documents are preserved verbatim in case they are being output directly as part of an XML document, but no such guarantee is made if the command merely outputs text.

If a commands generates its own XML fragments, then indenting is performed using TAB characters, rather than spaces, since this is simpler to process subsequently. The visual layout of XML documents for human consumption is delegated to xml-fmt.

ECHO-LEAF

The name echo-leaf refers to a character string of the special form "[PATH]TEXT" that is used by several commands including xml-echo(1) and xml-sed(1). Each echo-leaf represents a minimal XML fragment consisting of a text node and a hierarchical path, delimited by square brackets, which leads to the XML tag surrounding this text node.

A sequence of echo-leaves of an XML file plays a similar role to a sequence of lines of a text file. Whereas it is customary to think of a line ending with a line break specification ’\n’, here we think of a text node as preceded by a path specification ’[PATH]’.

In an echo-leaf, the PATH is optional, but if it is present it must be enclosed in square brackets ([]). The TEXT is optional too.

The PATH contains an absolute or relative path of an XML tag, and can optionally include attribute specifications. The TEXT contains ordinary text and may also contain escaped sequences representing special XML constructs, but not another "[PATH]".

An attribute specification is a string of the form "@NAME=VALUE", where NAME is the name of the attribute and VALUE is the associated string value. VALUE should not be surrounded by quotation marks (neither " nor ’), but if it contains the special characters []@= these must be preceded by a backslash.

The following PATH examples are normative.

[/abc]

represents a root tag named "abc".

[abc]

represents a tag named "abc" which is a child relative to the current context.

[.]

represents the tag that is currently in context.

[..]

represents the parent of the tag that is currently in context.

[/abc/def@importance=Earnest/../ghi]

represents a tag named "ghi" which is a sibling of a tag named "def" whose attribute "importance" has the value "Earnest", and whose parent tag is the root tag named "abc".

More details about the typical contents of PATH and TEXT in an echo-leaf can be found in the xml-echo(1) manpage.

AUTHORS

Laird A. Breyer is the original author of this software. The source code (GPLv3 or later) for the latest version is available at the following locations:
http://www.lbreyer.com/gpl.html
http://xml-coreutils.sourceforge.net

BUGS

The xml-coreutils collection is still incomplete, but already usable for limited tasks. The behaviour of command options and output formats are subject to change without warning prior to v1.0 of the software.

SEE ALSO

xml-cat(1) xml-cp(1) xml-cut(1) xml-echo(1) xml-find(1) xml-fixtags(1) xml-fmt(1) xml-grep(1) xml-head(1) xml-less(1) xml-ls(1) xml-mv(1) xml-printf(1) xml-rm(1) xml-sed(1) xml-strings(1) xml-unecho(1) xml-wc(1)