The Net.MSMQ Binding in Windows Communication Foundation has some built-in support for detecting and handling "Poison Messages".

One of the most useful ways of handling Poison Messages in queuing systems is to get them out of the way so that following messages can still be processed (if you're not doing in-order processing), but still keep the messages around so that they are not lost.

Unfortunately, this behavior is only supported by WCF in MSMQ 4.0 with the use of a poison sub-queue, which is new in the Vista and WinServer2k8 versions of MSMQ. It's really a shame because, honestly, this could've been implemented on MSMQ 3.0 just by using a separate queue instead of sub-queues.

Instead, we're stuck with the PoisonErrorHandler sample included in the SDK. The sample provides a way to mimic this behavior by introducing an IErrorHandler implementation that detects the poison message and moves it to another queue. To be able to do this, you need to set the binding properties so that the ServiceHost listening to the queue faults when the poison message is detected. The error handler then moves the message to a new queue and starts a new service host instance to resume processing of following messages.

The Sample itself isn't all that bad, but does provide insight into a rather significant feature left out of the way the Net.MSMQ Binding reports poison messages: All it provides the error handler with is the MSMQ LookupId of the poison message. Unfortunately, this LookupId is queue-specific and the exception does not provide information on which queue or at least which endpoint was the one that received the poison message, which is a huge gaping hole in this feature.

Because of that the LookupId is useless without more information provided out-of-band [1]. The sample works around this limitation by adding an entry in the section of the app.config file with the path to the queue where messages are being received and the path to the queue to move poison messages to.

This works OK, until you need a single service host listening on multiple endpoints, and then the sample won't really work anymore because you don't know which of the endpoints caused the error.

Working around the limitation

I spent some time recently playing around with WCF and ran into this same problem, so I tried to find a way of working around this limitation and still use the basic functionality the sample provided.

I found one possible workaround, which seems to work so far in my limited tests. I'm not sure yet how well it will hold, as it seems to me there's always the possibility of some race conditions here, but let me illustrate at least the basic mechanics.

First we keep the basic code of the PoisonErrorHandler class. However, instead of using appSettings to keep track of the queue we're listening to, we just try to find out dynamically which endpoint configured for the service is causing the error, and extract the path to the queue from there based on the URI of the endpoint address.

To do this, in my implementation of IErrorHandler.HandleError() I grab the current OperationContext, reach out to the ServiceHost associated with it, and then iterate over the ChannelDispatchers attached to it. The one that's in the Faulted state is very likely the one that caused the MsmqPoisonMessageException in the first place.

Code could be something like this:

private String FindFaultedQueue() {
   ServiceHostBase host = OperationContext.Current.Host;
   Uri faultedQueue = null;
   // the queue that fired the exception will be the faulted one
   foreach ( ChannelDispatcher chd in host.ChannelDispatchers ) {
      if ( chd.State == CommunicationState.Faulted ) {
         faultedQueue = chd.Listener.Uri;
   // translate URI into a regular MSMQ format name 
   StringBuilder fn = new StringBuilder("FormatName:DIRECT=OS:");
   string path = faultedQueue.PathAndQuery;
   if ( path.StartsWith("/private/") ) {
      path = path.Substring("/private".Length);
   path = path.Substring(1);
   return fn.ToString();

This lacks some error handling and a few other things, but it illustrates the point. Now that we have the FormatName of the queue where the poison message came from, we can try to move it to the poison message  queue. We could handle this by convention: for queue Q1, the corresponding poison message queue could be Q1Poison. The rest of the sample is pretty much the same.

I probably wouldn't use this in production, as it doesn't feel very reliable given the basic limitations imposed by the WCF implementation, but it was interesting to take a stab at. YMMV.

[1] I'll leave to you, dear reader, to wonder about the sense of providing a sample that so clearly points out such a giant hole in a product feature rather than actually fixing it. Oh wait, guess they did fix it by changing MSMQ for Vista and WinServer2k8.

Technorati tags: , , ,

Tomas Restrepo

Software developer located in Colombia. Sr. PFE at Microsoft.