Nicholas Allen has been posting some really great entries on Poison Message handling in WCF and specifically the options the MSMQ binding offers for that. He also mentions that MSMQ v4 (Longhorn server?) will support some new features to make this easier: Retry Queues. Very interesting stuff.

One thing I wanted to comment on to hopefully complement a bit Nicholas' explanations is the relationship between poison messages and in-order processing.

The term "poison messages" might seem to relate to a kind of malicious attack; that is, something intentional. However, that's not necessarily true. A poison message is in essence anything that prevents you from continuing processing by failing repeatedly. This could be something unintentional but well intented, like receiving a malformed message from a trusted business partner that you still need to process.

If you don't need to guarantee in-order processing of the received messages, then dealing with poison messages can be dealt very efficiently in an asynchronous fashion like the one Nicholas describes: You move the problem message somewhere else, and continue processing the messages on the incoming queue; meanwhile you deal with the message, possibly fix it and eventually feed it back into the incoming queue for processing.

However, if you do need to guarantee in-order processing of received messages, this is not an option. If the poison message was a valid business message (like a malformed order from one of your big customers), you can't simply move the message and continue processing. Instead, the situation needs to be resolved in a synchronous fashion: you must inspect and resolve the problem message before you continue processing.

Normally, doing this requires some manual intervention from an operator or analyst, but it also requires your service to proactively deal with this situations. At the least, a responsible application will suspend all processing when it detects a poison message and raise alerts so that the operator knows a problem ocurrs. Notice that it will need to suspend processing, but it doesn't mean it needs to stop receiving messages altogether. Normally, you'll still receive messages and queue them for later processing.

The operator or analyst would then inspect the problem message and determine if it's not something important (or if it indeed was a malicious message) and discard it. If he/she determines it is a valid request message, he/she might need to fix it by hand or with appropriate tools (for example: remove an invalid character that was causing problems or change the message encoding) and then feed it back to the service. In the latter case, you need to provide a way to put the message back into the processing queue in the right position (i.e. at the beginning) and in either case you need to provide a way for the operator to ask the service to restart processing of received messages.

One possible way to support feeding messages to the service for immediate processing and not having to mess around with the incoming queues is to provide private alternative endpoints that bypass the normal receive queues. Fortunately, doing this with WCF is easy because you can expose a single service as multiple endpoints using different bindings, for example, expose an alternative HTTP-based endpoint besides the main MSMQ based one. In BizTalk, for example, one would do it by having an alternative receive location (for this purpose usually a FILE receive location works very well).

One thing to keep in mind is that it's important that these alternate endpoints are private only, are not advertised and are kept secure, as they could be used by rogue agents to disrupt your in-order processing mechanisms.


Tomas Restrepo

Software developer located in Colombia. Sr. PFE at Microsoft.