welcome/
java-mcmc/
software/
papers/
links/
email me

Extracting the structure of an XML file

The simplest structural information about an XML file is its type or file format. If this is all you wish to know, use xml-file(1):


% xml-file food.xml xml_coreutils_tutorial.html 
food.xml:                    XML text
xml_coreutils_tutorial.html: HTML text fragment

Just as file(1) uses heuristics to identify a file type from its binary contents, xml-file(1) uses various pieces of data, such as the DOCTYPE and the name of the root tag to (attempt to) identify an XML file. However, xml-file(1) is not a replacement for file(1), and will output "unrecognized file" if the file is anything other than XML. Moreover, it will not recognize broken (malformed) files if the break is below the tags it looks for.

Every shell user knows how to navigate their home directory using ls(1) and cd(1). In xml-coreutils(7), the command xml-ls(1) lets you navigate and list the "directory" structure of an XML file using XPATHs. Here's an example using the People.xml file we discussed earlier.


% xml-ls People.xml :/
<?xml version="1.0"?>
<root>
	<People>
		<Person/>
	</People>
</root>
% xml-ls People.xml :/People/Person
<?xml version="1.0"?>
<root>
	<Person>
		<Address/>
		<TelNo/>
	</Person>
</root>
% xml-ls People.xml :/People/Person/Address
<?xml version="1.0"?>
<root>
	<Address>
		<LineOne/>
		<LineTwo/>
		<County/>
		<Country/>
	</Address>
</root>
% xml-ls People.xml :/People/Person/Address/Country
<?xml version="1.0"?>
<root>
	<Country>
		Ireland
	</Country>
</root>

The output of xml-ls(1) is XML. This makes sense if you recall that ls(1) prints both directory names and file names together. If we think of a tag as analogous to a directory, then text (such as the string "Ireland" in the last example) could be analogous to an ordinary file. To support well formed XML output, there must be some constraints, such as wrapping the output in a root tag. After all, the original doctype is not directly relevant.

To extract a structure based upon the presence or absence of textual contents, use xml-grep(1). The output will again be an XML file (so it can be xml-grepped again!), but containing only the structure necessary to access the text. The following examples give an idea of how this works.


% xml-grep 'Green' People.xml 
<?xml version="1.0"?>
<root>
	<Person Name="Fred Davis">
		<Address>
			
			<LineTwo>Green Road</LineTwo>
			
			
		</Address>
		
	</Person>
</root>
% xml-grep -E '(Fred|Ire*)' People.xml 
<?xml version="1.0"?>
<root>
	<Person Name="Fred Davis">
		<Address>
			
			
			
			<Country>Ireland</Country>
		</Address>
		
	</Person>
</root>
% xml-grep -i --subtree 'fReD' People.xml 
<?xml version="1.0"?>
<root>
	<Person Name="Fred Davis">
		<Address>
			<LineOne>4 Bushy Street</LineOne>
			<LineTwo>Green Road</LineTwo>
			<County>Mayo</County>
			<Country>Ireland</Country>
		</Address>
		<TelNo>+353 96 45232</TelNo>
	</Person>
</root>
% xml-grep -v 'o' People.xml 
<?xml version="1.0"?>
<root>
	<Person Name="Fred Davis">
		<Address>
			<LineOne>4 Bushy Street</LineOne>
			
			
			<Country>Ireland</Country>
		</Address>
		<TelNo>+353 96 45232</TelNo>
	</Person>
</root>

Last but not least, there is xml-find(1), which we already mentioned earlier. Just like its namesake find(1) traverses a directory, looking for interesting files, and executing actions, xml-find(1) actually traverses an XML file one node at a time, looking for (selecting) interesting tags, and executing actions. This makes xml-find(1) into an iterator. Before we can illustrate this properly, we'll build up with a series of rather boring examples.

The simplest action is to search for a tag name and print it:


% xml-find People.xml -name 'Tel*' -print
/People/Person/TelNo

The tag name can also be passed to a program (or a script), for example echo:


% xml-find People.xml -name 'Tel*' \
        -exec echo 'The tag is ' '{}' ';'
The tag is  /People/Person/TelNo

If this were a tutorial on find(1), then the placeholder {} would be the name of a file, which the -exec'd program could open and read. However, this is not possible here because {} is only a tag name. So in xml-find(1), there are two more placeholders, {@} which expands to a list of attributes of the selected tag (if any), and {-} which expands to the name of a temporary XML file which contains everything that belongs to the current node. Thus:


% xml-find People.xml -name 'Tel*' \
        -exec cat '{-}' ';'
<?xml version="1.0"?>
<People>
<Person>
<TelNo>+353 96 45232</TelNo></Person>
</People>

It's time to combine all these ideas into a final example. We'll iterate through the food.xml file using xml-find(1) to stop at each product, and printing the data we find using xml-printf(1).


% xml-find food.xml -name 'product' \
        -exec xml-printf 'Price of %-20s: %5.2f\n' \
        {-} ://product ://product@price ';'
Price of Chicken             :  3.00
Price of Lobster             : 11.50
Price of Apple               :  0.20
Price of Milk (2 litres)     :  1.09