Saturday, April 6, 2013

Processing XML with .NET while preserving your sanity


XML is facing quite a bit of criticism, raging from accusations of being too complex to just being on of Micrsoft's attempts to take over the world (it's actually an open W3C standard but nevermind). The general consensus is however that it does what is says on the tin, which is to transfer data between applications, allowing humans to take a peak in-between. The focus of this post is the last part involving human beings (stupid humans, always making programming more difficult than it needs to be!)

Consider this piece of XML:
<?xml version="1.0" encoding="UTF-8"?>
<xmldata>
  <element>value</element>
</xmldata>
Nice and readable, isn't it? Well, most times at least. Now look at it without the new lines and identation:
<?xml version="1.0" encoding="UTF-8"?><xmldata><element></element></xmldata>
Much less so, indeed!
Everyone who has had their fair share of XML programming has faced this situation - for parsers the two XML documents are identical, however when we need to take a look at our data when debugging we get confused by the ugly, unformatted XML. This is made worse by the .NET Framework's tendency to produce the latter variety of XML by default.

There are, of course, simple ways to make our lives easier by formatting XML the way we want it - it's just less obvious than it should, so I decided to put this post together and shed some light on a few tricks and subtleties.

Dumping XML to a file & XmlWriter

It's a natural scenario to grab an XML file, do some processing, and save it. Here is a trivial piece of code that does that:
XmlDocument xml = new XmlDocument();
xml.Load(INPUT_PATH);
// ... XML processing code
xml.Save(OUTPUT_PATH);
This of course can easily be adapted to save the XML to a database, over the network or wherever else we need it to be. Most times it would appear to work fine, until you try it on our simple input. Here is what we get:
<?xml version="1.0" encoding="utf-8"?>
<xmldata>
  <element>
  </element>
</xmldata>
Looks almost the same - but not quite, the closing tag is on a new line! Curiously though, when there is an actual value between the tags we don't get a new line added to it, so it might be pretty hard to spot in a complex XML document with most elements containing values. That's actually what caused the bug I was troubleshooting when I got the inspiration for this post - it might look like it's not a big deal but what's between the opening and closing tag is our value and we just had a new line inserted into it - for most applications a new line is quite different than an empty string!
So then, what do we do about it?

Using XmlWriter

We have a class in the .NET Framework that's meant to give us more control over XML-exporting operations - XmlWriter. Here is the simplest possible way to use it, without supplying any explicit settings:
XmlDocument xml = new XmlDocument();
xml.Load(INPUT_PATH);
// ... XML processing code
XmlWriter writer = XmlWriter.Create(OUTPUT_PATH);
xml.Save(writer);
writer.Close();
And the result:
<?xml version="1.0" encoding="utf-8"?><xmldata><element></element></xmldata>
... i.e. exactly what we were trying to escape from. Looks like the XmlWriter is of not much use by itself when it comes to formatting XML for human-readability.

XmlWriterSettings.Indent

Luckily, there is the Indent property in XmlWriterSettings which is false by default but can very easily bet set to true:
// ... XML processing code
XmlWriterSettings writerSettings = new XmlWriterSettings();
writerSettings.Indent = true;
XmlWriter writer = XmlWriter.Create(OUTPUT_PATH, writerSettings);
xml.Save(writer);
Leading to, finally, a well-formatted XML output!

XmlWriter with it's settings object is nice and all, but there is one caveat that you could hit before getting to the Indent property - there is also the NewLineHandling property, which you might be tricked into thinking would achieve our goal. In fact, it only affects the new lines within actual values between tags and doesn't apply to the new lines in the markup.

Juggling with XML formats in-memory

But wait a second, this neat solution relies on XmlWriter, which - in this example, at least - only writes to a file. What if we don't actually need to write the file to the FS and we'd rather have a string, byte array or some other in-memory structure? We have a few ways to achieve this using the same XmlWriter-based logic, presented here from most to least atrocious.
One option is to just save the file and read it back like a text or binary file - this will load the well-formatted XML in memory as string/bytes/whatever. In case this solution looks attractive, I have some advice for you - don't do it. Even with an SSD drive read/write operations are expensive, and also an unnecesarry risk - the HDD might be full, we might not have access to the folder, etc. I can think of only one situation where this 'solution' would be advisable - if you actually need the physical files, e.g. for debugging or logging purposes.
Another way to harness XmlWriter for this task is to combine it with .NET's flexible, polymorphic stream architecture and create your XmlWriter around a MemoryStream. Now if using something called MemoryStream in order to just sanitize your XML doesn't sound like an overkill to you then I guess I can't argue further - if you haven't been featured on TheDailyWTF you probably will soon be.
But fear not - there is another way. Enter LINQ...

XElement/XDocument

This solution doesn't actually use LINQ itself - it just taps on the XElement-based infrastructure that LINQ to XML uses to objectify XML documents. Apart from making it possible to use LINQ on XML, these classes also use some of the more recent .NET framework additions, like object initialization and anonymous types, in order to make dealing with XML in .NET less cumbersome.
Here is how to beautify an XML document in-memory and assign it to a string:
XDocument xDoc = XDocument.Load(INPUT_PATH);
// ... XML processing code
String xmlString = xDoc.ToString();
And here is what we get:
<xmldata>
  <element></element>
</xmldata>
Almost but not quite - it's missing the XML declaration (<?xml version...). Here is how add it - we just need to change the last line to:
String xmlString = xDoc.Declaration.ToString() + Environment.NewLine + xDoc.ToString();
And voila - we have a well-formatted XML in-memory, in two lines!

Further notes on XML and strings in .NET

But why did we need to change that line in order to add the XML declaration, why doesn't it get included automatically? Upon further observation, we can see that there is the XDocument.Save(String path) method, which saves the document to disk and does include the declaration - so it starts to look like an unintentional omission to not include it in ToString()?

As it turns out, not only is there a reason for that, but there are in fact at least two good reasons to implement ToString() in such a way, and each of them reveals something interesting about the way .NET handles XML and strings in general

Handling chunks of XML in-memory

The traditional XmlElement-based approach is a traditional DOM implementation - it builds a tree of the document in memory, and deals with all pieces of XML as documents - even if it's a single element, a dummy XML document object will be created around it. That's not the case with XElement - there an XElement instance can represent just one XML element without any context.
This point of view is taken further by considering the XML declaration to not be a part of the document - it's just a header of the .xml file format to mark the content as valid XML, to indicate the version and encoding (although you need to know what's the encoding in order to read the header that gives you the encoding but nevermind). That's why you only get the XML declaration inserted when you save the thing to a file - before that it's just a document in memory that holds a piece of XML markup.

String encoding in .NET

First, a quick refresher on encodings, as I suspect that developers that work for the western market only don't deal with them in-depth on a daily basis. Encodings are ways to map characters to sequences of bits so that they can be stored in binary media. Everyone knows about ASCII, which assigns one byte per character and fits just the Latin alphabet and a bunch of funny symbols, and Un In the average .NET developer's practice, encodings are used explicitly in order to convert strings to bytes and vice versa. Let's extend our example in order to get that neat XML in a byte array, e.g. to be sent over a socket:
XDocument xDoc = XDocument.Load(INPUT_PATH);
String xmlString = xDoc.Declaration.ToString() + Environment.NewLine + xDoc.ToString();
byte[] xmlBytes = Encoding.UTF8.GetBytes(xmlString);
Now, we can of course call our good friends XmlWriter and MemoryStream but as was demonstrated we have a better way to deal with the task at hand - the only change from the previous example is the addition of the last line that uses the UTF8 encoding to convert the sequence of symbols that is represented by our string to a sequence of bytes. I bolded this because it's crucial for my next point - to understand the situation here, we need to think of strings abstractly. When we have a string object it is of course just a point to a region in memory that is filled with bytes but that's of no concern to our encoding object - it only looks at the sequence of charters, regardless of how they are represented in memory. It then generates the matching bytes for each character to give us our byte array - that's it.

OK, but what does this have to do with XML and the reason why XDocument doesn't include the XML declaration? Well, it's the same principle - XDocument is a pure soul without a body (i.e. a physical file), and it treats the encoding as a bodily concern - it's a mere physical representation of the information in the XML document. That's why the pure information contained in the XDocument object shouldn't contain the XML header, and with it - the encoding, when in fact it isn't associated with any encoding at all.

.NET does store strings as bytes in memory, so there is one more encoding operation going on all the time - the mapping of our sequence of symbols to the bytes in the managed heap. For this purpose, CLR uses UTF-16 - hence the 2 byte size of the char datatype and 2 bytes per symbol for string. What is a little bit confusing at first is that there is no specific UTF-16 encoding option, although we have UTF7, UTF8, UTF32 and Unicode - which is not even an encoding but the overall standard that covers them all. The System.Text.Encoding.Unicode encoding is in fact UTF-16, which is a glimpse into how in the .NET architects' team they assume UTF-16 to be the default encoding.

For more details and fun into how strings, and objects in general, are stored in memory in .NET you can always count on the guru Jon Skeet: http://msmvps.com/blogs/jon_skeet/archive/2011/04/05/of-memory-and-strings.aspx

Bonus: The code

Here is a simple Visual Studio 2010 solution that demonstrates all the code in this post for download

Also, in case you don't like downloading files from strangers - here is the same solution shared on CodePlex:
https://xmldemo.codeplex.com/

No comments: