I've been playing for the last couple of days with BizTalk Server 2006, building a custom encoder pipeline component. One of the things I've been trying to do is finding a way to do all encoding operations in a streaming fashion, by building a pass-through stream implementation that only reads and encodes small portions of the message as they are read by the outbound adapter.

One of the options I've been experimenting with was to do partial reads and writes on an intermediate memory stream: Instead of reading and encoding the entire body part of the message in a single pass in memory, and then returning that from the encoder component, I only read from the original stream as much as the adapter asks me for on the Stream.Read() implementation, encode and write than into the intermediate memory stream, and then read back from it and return it to the adapter.

I realize it sounds a little convoluted, but it's the easiest way to do it with the library I'm using to do the encoding. One of the reasons why this works fairly well is that the encoding process does some compression, and so, it will be the case that whatever I read and encode from the original stream will be smaller than what originally asked for. For example, if someone tried to read 64KB from my custom encoding stream, I might return just a couple of KB even if there's further data in the original stream. Granted, it is not the most efficient implementation, but it ensures I use little memory during the encoding operations.

Note: This is not a problem in the .NET Stream API; if you're reading from a stream you must be prepared to deal with partial reads. A partial read does not mean that you've read the end of the stream nor that a problem was encountered.

Now, I know this works, as I unit tested the encoding stream and component in isolation (using my PipelineTesting library). However, when I went to try my custom pipeline component in a real messaging scenario with the File adapter, it failed, and miserably: The BizTalk host would pretty much crash (after a huge spike of 100% processor usage) with the following error: "The parameter is incorrent". Humm, not much useful.

At this point I took out the debugger and attached to BTSNTSvc.exe to try and repro the error. I was able to track my custom pipeline component getting called, and see BizTalk read off my custom stream. At this point I noticed weird things.

The first thing I noticed was that the file adapter (or is it the BizTalk messaging engine itself?) uses very small buffers to read of the message streams. Indeed, it only reads it in 4KB chunks. That seems rather small to me, particularly considering the fact that the FILE adapter is an unmanaged adapter and so each Read() call into the stream will cause unmanaged<->managed transitions which are costly. I would've expected it to at least use buffers of 64KB, but maybe there's a good reason for that.

That by itself was not too much of an issue; my component was perfectly capable of dealing with that and indeed I had unit tests using both 4KB and 64KB buffers (though I believe it is the cause of the poor performance and hight CPU usage I noticed). The real problem was that my stream would almost always do a partial read, and the adapter seemed unable to cope for that, as it started asking for weird buffer sizes on consecutive Read() calls.

Let's see a small table that explains how it called each time (this was on a run with a 5MB file):








Iteration Buffer Size Offset Count Bytes Read
1 4096 0 4096 38
2 4096 0 4058 383
3 4096 0 3674 47
4 4096 0 3628 50
5 4096 0 3578 50
6 4096 0 3528 50

As you can probably guess by now, what seems to be happening is that if the stream does a partial read, then the next time around the adapter asks for a read of (buffer.Size-bytesRead) length (or close enough). Eventually, that length reaches zero if the stream hasn't been totally consumed, which in this case causes an exception as it is an invalid parameter value. I'm not sure if this is a bug in the file adapter, or if it's simply a side-effect of the way the managed<->unmanaged code interaction happens at the messaging engine, but I though it was something worth looking at more closely.

I'm planning on working around this by doing a some extra things to try as much as possible to do full reads and avoid the partial ones (as the ones I'm doing now are obviously innefficient, but that's caused partly because the input file is highly compressible). Hopefully that should make this a non-issue.

Technorati:


Tomas Restrepo

Software developer located in Colombia.