If you're writing text files in an encoding that supports a Byte-Order Mark, you should always try to make sure that you include it, unless you have a protocol in place that precludes you from doing so (such as a legacy application that doesn't know how to deal with them).

One of the reasons you should always remember the BOM is that many applications can use it to try to guess what encoding they should use when trying to read the text you're feeding them.

Encoding detection based on the BOM is not foolproof, but it's better than having nothing at all, particularly in cases cases where simply assuming blindly a predefined encoding such as UTF-8 or UTF-16 might not be an option at all.

One particular case where remembering the BOM is very important is with UTF-8. Let me tell you a story to illustrate why:

We've been working the past few weeks on testing and improving a BizTalk-based solution for a client that some other consulting company had created. One particular piece had been working fine until, suddenly, we started getting an error when BizTalk tried to parse an incoming message with an error stating that "The character is not valid on the specified encoding".

Looking at the message, it was supposed to be UTF-8 encoded, and for the most part looked OK. The character causing trouble was, in fact, a 0xA0 char (non-breaking space) inside an element value. While this was not good, it wasn't clear why it was causing trouble.

Since it was an XML message, we opened it up in Internet Explorer: Yep, that too parsed it incorrectly and got stuck when it reached the problematic character.

Looking a bit further , we found that in this particular case, the original developer had written a piece of code that created a Stream object with the message contents and then fed that to BizTalk. The code looked a bit like this:

public static Stream CreateStream(String msg) {
   MemoryStream stream = new MemoryStream();
   byte[] bytes = Encoding.UTF8.GetBytes(msg);
   stream.Write(bytes, 0, bytes.Length);
   stream.Position = 0;
   return stream;
}

The message text itself was a piece of XML that included an declaration with the encoding attribute specifying UTF-8. This seemed OK, even if the code above just seemed like a pretty uncomfortable way of creating the stream.

However, this gave us a clue: UTF8Encoding.GetBytes() won't give you a BOM. Looking at the message in a binary editor, we validated indeed the message did not have a BOM at all. So we tried replacing the code above to one that simply used a StreamReader object (which uses UTF-8 with a BOM by default), and that fixed the issue right away!

This highlights why the BOM is so important for UTF-8: The basic characters in the set share the same values as the ASCII code. This is generally an advantage, but it can also mean that stuff that's incorrectly coded (such as our example above) might seem to work fine for a while until an unexpected character comes along and everything crumbles down. This is unlike other encoding, such as UTF-16, where things usually blow up right away.

In this particular case, the culprit was really a combination of factors: The lack of a BOM together with the presence of the encoding specification in the XML declaration [1]. I'm not sure why the XML stacks get stuck on a BOM-less UTF-8 file with an encoding declaration, but there you have it. So don't forget the BOM!

[1] I personally thing the encoding specification in the XML declaration is probably the single most stupid idea included in the XML spec. It's just downright evil.

Technorati tags: , ,


Tomas Restrepo

Software developer located in Colombia.