A few days ago Sam Vanhoutte posted on the BizTalk newsgroups about an issue he was having while trying to process Unicode encoded messages using the BizTalk Framework Disassembler. Here's the tale of what we discovered in the process.
The error happened while trying to process an UTF-16LE encoded XML message using the BizTalk Framework Disassembler component. The message in question was received with no declaration and hence no encoding attribute, and contained no BOM. This cause the operation to fail with the "None of the components at Disassemble stage can recognize the data" error, suggesting that the disassembler couldn't figure out the document encoding.
After looking around a bit using Reflector, I noticed that the BizTalk Framework Disassembler used the XML Disassembler (XmlDasm) underneath. Because of this I suggested Sam he tried using my FixEncoding Pipeline component in the decode stage of his pipeline to set the message's Body part Charset to the correct encoding (UTF-16 Little Endian, Codepage 1200). It worked, almost. Now the document was being recognized by the disassembler, but the disassemble itself failed, with the following error:
System.Xml.XmlException : Name cannot begin with the '.' character, hexadecimal value 0x00. Line 1, position 2.
at Microsoft.BizTalk.Component.NamespaceTranslatorStream.Read(Byte buffer, Int32 offset, Int32 count)
at Microsoft.BizTalk.Streaming.MarkableForwardOnlyEventingReadStream.ReadInternal(Byte buffer, Int32 offset, Int32 count)
at Microsoft.BizTalk.Streaming.EventingReadStream.Read(Byte buffer, Int32 offset, Int32 count)
at System.IO.StreamReader.ReadBuffer(Char userBuffer, Int32 userOffset, Int32 desiredChars, Boolean& readToUserBuffer)
at System.IO.StreamReader.Read(Char buffer, Int32 index, Int32 count)
at System.Xml.XmlTextReaderImpl.InitTextReaderInput(String baseUriStr, TextReader input)
at System.Xml.XmlTextReaderImpl..ctor(String url, TextReader input, XmlNameTable nt)
at System.Xml.XmlTextReader..ctor(TextReader input)
at Microsoft.BizTalk.Streaming.Utils.GetDocType(MarkableForwardOnlyEventingReadStream stm, Encoding encoding)
at Microsoft.BizTalk.Component.XmlDasmReader.CreateReader(IPipelineContext pipelineContext, IBaseMessageContext messageContext, MarkableForwardOnlyEventingReadStream data, Encoding encoding, Boolean saveEnvelopes, Boolean allowUnrecognizedMessage, Boolean validateDocument, SchemaList envelopeSpecNames, SchemaList documentSpecNames, IFFDocumentSpec docSpecType, SuspendCurrentMessageFunction documentScanner)
at Microsoft.BizTalk.Component.XmlDasmComp.Disassemble2(IPipelineContext pc, IBaseMessage inMsg)
at Microsoft.BizTalk.Component.XmlDasmComp.Disassemble(IPipelineContext pc, IBaseMessage inMsg)
at Microsoft.BizTalk.Component.BtfDasmComp.DoLoad(IPipelineContext pc, IBaseMessage inMsg)
at Microsoft.BizTalk.Component.BtfDasmStateLoad.LoadMessage(IBtfDasmAction act, IPipelineContext pc, IBaseMessage inMsg)
at Microsoft.BizTalk.Component.BtfDasmComp.Disassemble2(IPipelineContext pc, IBaseMessage inMsg)
at Microsoft.BizTalk.Component.BtfDasmComp.Disassemble(IPipelineContext pc, IBaseMessage inMsg)
at Microsoft.Test.BizTalk.PipelineObjects.Stage.Execute(IPipelineContext pipelineContext, IBaseMessage inputMessage)
This was a clear sign that the disassembler was somehow trying to interpret the document using the wrong encoding, even though we were clearly specifying the correct Charset. At this point, I asked Sam to pass on the problematic file to see what I could find out.
The Real Problem
After a lot of digging, I think I've discovered what seems to be a bug in the way the BtfDasmComp component works. It seems like it doesn't correctly decode documents encoding with anything else than UTF-8, unless the .NET Framework's XmlTextReader can figure out the document encoding on it's own. None of the requirements to be able to do this were met by the problematic document, so apparently the disassembler was defaulting to trying to interpret the document using UTF-8, which caused the error.
The question was then why this was happening, when we were specifying the correct Charset for the document, and it was pretty obvious that made a difference, since probing was succeeding. Why was the correct encoding being used while probing but not while disassembling?
After spending a couple more hours going though the BizTalk Framework disassembler, I can venture an educated guess as to why the wrong encoding is being used.
The first thing I noticed was that the BtfDasmComp component, just like the XmlDasmComp component, clearly looked specifically for the part's Chartset property (IBaseMessagePart.Charset) both before probing and before disassembling the document. So up to here, everything was just fine.
However, during disassembling, eventually control lands on the BtfDasmComp.DoLoad() method, where the body part data stream of the message is replaced with an instance of the BTFDasmTranslator class:
Up to here, encoding1 correctly has the encoding created from the value of the part's Charset property. While the code correctly passes an encoding to the new stream, I spotted that the BTFDasmTranslatorStream class is derived from the NamespaceTranslatorStream class.
In one of the constructors for the NamespaceTranslatorStream class, a new XmlTextReader class is created to process the document, but no encoding is specified for it; thus letting the reader try to figure out itself what encoding the message stream has. This makes no sense because by this point the disassembler knows exactly what encoding to use. Here's the relevant code:
public NamespaceTranslatorStream(IPipelineContext pipelineContext, Stream data, string oldNamespace, string newNamespace, Encoding encoding) : base(new XmlTextReader(data), encoding)
You can see that while the specified encoding is passed on to the base class (XmlBufferedStreamReader), but it is not used in the creation of the XmlTextReader itself. Of course, the encoding cannot be provided directly to it because the XmlTextReader class doesn't contain a constructor that contains an Encoding argument (which I think it should, really), so instead you need to create a StreamReader object with the correct encoding and construct the XmlTextReader on top of that, instead.
It became clear with this that getting the messages to process correctly was not going to be possible by simply selecting the proper Charset. Instead, Sam was able to work around the problem successfully by creating a custom pipeline component that actually transcoded the message from UTF-16 to UTF-8 and using that as part of the decoding stage before the disassembler runs.