WCF: Be careful with allowCookies=”true”

I always enjoy a good troubleshooting challenge, and recently my good friend and colleage Martin Jalaf came up with an interesting one.

A customer was load testing a Web application. Like many others, it consisted of farm of load-balanced servers running an ASP.NET application front-end application, which then talked to a set of WCF services running on a second server farm. In both cases, an F5 device was used to load balanced HTTP traffic.

When the customer started increasing the load as part of their load tests, they would notice that calls to the WCF services would get slower, and would sometimes take upwards of 10 minutes to complete. However, the servers running the WCF services, as well as the SQL Server used showed little CPU usage.

Reviewing a few memory dumps of the front-end servers, all that could be observed was threads stuck calling the WCF service. Given this, our first suspicion was that WCF throttling was incorrectly configured. However, dumps of the WCF service revealed this was not the case. In fact, only a single call to the WCF service was running at any given time, and throttling settings were too high to be causing any issues. This could be easily observed using the excellent netext debugging extension:

0:000> !wservice
Address State EndPoints BaseAddresses Behaviors Throttled Calls/Max Sessions/Max ConfigName,.NET Type
000000e9229c78d8 Opened 0n1 0n1 0n10 True 0n1/0n2000 0n0/0n400 "WCFSecurity.SecurityService",WCFSecurity.SecurityService

The WCF service was configured with InstanceContextMode=Single, and ConcurrencyMode=Multiple. However, this did not appear to be related at all. Another unexpected symptom was that despite the fact that the WCF servers were behind a load balancer device, all requests ended up being processed by the same server most of the time. I immediately suspected that the F5 device was configured with session affinity enabled, even though the customer had initially denied it.

Since the WCF service was exposed using an HTTP-based binding, I asked Martin to check the ASP.NET HttpContext instances, just in case they were somehow relevant. We immediately found that there were hundreds of HttpContext instances in the dump that had yet to complete!

0:000> !whttp
HttpContext Thread Time Out Running Status Verb Url
000000e622a12850 -- 00:00:00 00:20:52 200 POST /WCFSecurity/SecurityService.svc
000000e622a18e58 -- 00:00:00 00:20:47 200 POST /WCFSecurity/SecurityService.svc
000000e622a23c58 -- 00:00:00 00:20:39 200 POST /WCFSecurity/SecurityService.svc
....

Looking closely, we realized that all HTTP calls were stuck in the AcquireRequestState stage of the ASP.NET pipeline, which meant we were actually observing an instance of the infamous ASP.NET Session Lock issue (see the section titled Concurrent Requests and Session State)!

Session: cq2uoplja4aqu2aoobw3f1pf
HttpContext Thread Poll (ms) Time Path
================ ====== ========= ==== ===================================
000000e622a12850 500 1253 /WCFSecurity/SecurityService.svc
000000e622a18e58 500 1248 /WCFSecurity/SecurityService.svc
000000e622a23c58 500 1240 /WCFSecurity/SecurityService.svc
000000e622a25d50 500 1238 /WCFSecurity/SecurityService.svc
000000e622a27b60 500 1236 /WCFSecurity/SecurityService.svc
000000e622a283f8 500 1236 /WCFSecurity/SecurityService.svc
....

Why was ASP.NET Session being used here? The service was configured with:

<system.serviceModel>
  <serviceHostingEnvironment
    aspnetCompatibilityEnabled="true"
  />
</system.serviceModel>

This meant that the service was enabling access to the ASP.NET services, but the service itself was not using session state. What was really unexpected in this case was that all the requests from a single front-end server were ending up with the exact same ASP.NET session cookie, particularly because, by default, the client-side of WCF will not accept and send cookies at all.

Given the issues with load-balancing not working as expected, we also checked the cookies collection of the HttpContext, and found out that, indeed, the F5 server had session affinity enabled using cookies:

0x000000e922d5ced0 System.Collections.Specialized.NameObjectCollectionBase+NameObjectEntry
0000 Key : 000000e922d5cc48 "BIGipServerpl_ATV_qa_80" [23] (System.String)
0008 Value : 000000e922d5cc00 (System.Web.HttpCookie)

With all this data, the light bulb went on, and we realized that the WCF binding on the ASP.NET front-end application had to have the allowCookies property enabled! We looked at it, and, sure enough it was enabled. However, this only explained why session cookies were present, but not why why the same session ID was used for all requests.

Digging a bit into the WCF source code, we realized that cookie support is implemented at the ChannelFactory level in the HTTP transport. The ASP.NET Front-End application was using standard proxy classes derived from ClientBase<T>, which meant that by default, channel factory caching was enabled. This meant that, on a W3WP.EXE process on the front-end, every single call to the WCF service would end up using the same HTTP cookies.

Conclusion

Once the customer configured the binding with allowCookies=”false” and fixed the other issues found, application performance significantly improved!

The moral of the story is: be careful when you enable the allowCookies option when calling WCF services over HTTP. In this case the combination of allowCookies, channel factory caching, aspnetCompabilitityEnabled, and an incorrectly configured load-balancer completely killed the application capability to scale.

 

WCF Messages Not Getting Closed

I spent some time yesterday evening tracking down an issue with a custom WCF transport channel. This particular channel would support the IInputChannel channel shape, but with a twist: it would return a custom implementation of System.ServiceModel.Channels.Message instead of one of the built-in WCF implementations.

There’s nothing wrong with that and it was working just fine in most cases, until I ran into a few scenarios where the OnClose() method of my custom Message class wasn’t being called at all.

After some digging, I discovered that the specific messages this was happening to were not being processed normally by the ServiceHost infrastructure. In particular, they were not being dispatched because no method in the service contract matched the action in the messages, so instead the UnknownMessageReceived event in the ServiceHost instance was being raised.

Our UnknownMessageReceived implementation wasn’t closing messages explicitly, but that was easily corrected, so no problems, right? Wrong. Turns out, that it appears that in WCF 3.0/5 (haven’t checked 4.0 yet), if you do not have an event handler for UnknownMessageReceived, then the messages won’t get closed either.

This seems like a bad bug to me, because since Message implements IDisposable, there’s obviously the expectation that it will hold resources that should be released as soon as possible, so not calling Dispose() or Closed() will leak resources and potentially cause trouble.

Attaching an event handler to UnknownMessageReceived just isn’t an option always, and even if it were, there’s no reason why WCF itself shouldn’t be guaranteeing that messages are closed as soon as they aren’t needed.

Using Bindings Without a Binding Configuration

A WCF question came up recently on StackOverflow about using a binding for a custom transport through configuration instead of code. My initial reply was that you needed all the right configuration classes associated with your binding so that you could register the right binding extensions for the usage.

The question from there evolved into something more specific: Once I’ve got my binding collection element created and registered as an extension, what if I want to use it on an endpoint by specifying only the binding and not a bindingConfiguration (like you can do with the standard bindings). In other words, do this:

<system.serviceModel>
   <extensions>
      <bindingExtensions>
         <add name="myBinding"
              type="MyNamespace.MyBindingCollectionElement, MyBinding" />
      </bindingExtensions>
   </extensions>
   <services>
      <service name="server">
         <endpoint address="my://localhost:8484" binding="myBinding">
         </endpoint>
      </service>
   </services>
</system.serviceModel>

In theory this should just work, but it didn’t for the original poster. I had a sample app set up for one of my custom WCF transport channels, and had tested it, so I knew something was amiss, but didn’t notice it at first. That is, until the original poster uploaded a sample app and I could reproduce the problem.

The solution was simple, but unexpected: It turns out that WCF won’t recognize your custom binding name unless you’ve got a <bindings> section in your configuration file. Mind you, it doesn’t need for you to actually add a configuration for your custom binding there at all, just that an empty bindings section exists. In other words, just adding this would fix the problem:

<bindings/>

Comparing Text and Binary Serialization in WCF

A question I recently ran into asked about comparing message sizes when using the default Text Message Encoder used by most bindings in WCF with the Binary Message Encoder used by the TcpBinding, without going through the entire WCF stack and having to use a protocol analyzer.

Here’s a sample application that just serializes an object using the DataContractSerializer and then runs it through both message encoders and presents the results:


using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Runtime.Serialization;
using System.ServiceModel;
using System.ServiceModel.Channels;
using System.Text;

namespace BinSerializationTest {
   [DataContract]
   class SampleDataContract {
      [DataMember]
      public string Member1;
      [DataMember]
      public int Member2;
      [DataMember]
      public string Member3;
   }

   class Program {
      static void Main(string[] args) {
         try {
            SampleDataContract dc = new SampleDataContract {
               Member1 = "Sample string 1",
               Member2 = 142234,
               Member3 = "Different data here"
            };
            byte[] xml = SerializeToXml(dc);
            Console.WriteLine("DC serialized to XML resulted in {0} bytes", xml.Length);
            byte[] bin = SerializeToBin(dc);
            Console.WriteLine("DC serialized to binary resulted in {0} bytes", bin.Length);
         } catch ( Exception ex ) {
            Console.WriteLine(ex);
         }
      }

      static byte[] SerializeToXml<T>(T obj) {
         Message msg = ObjectToMessage(obj);
         MessageEncodingBindingElement mebe = new TextMessageEncodingBindingElement();
         mebe.MessageVersion = MessageVersion.Soap12;
         return Serialize(msg, mebe.CreateMessageEncoderFactory());
      }
      static byte[] SerializeToBin<T>(T obj) {
         Message msg = ObjectToMessage(obj);
         MessageEncodingBindingElement mebe = new BinaryMessageEncodingBindingElement();
         mebe.MessageVersion = MessageVersion.Soap12;
         return Serialize(msg, mebe.CreateMessageEncoderFactory());
      }
      static byte[] Serialize(Message msg, MessageEncoderFactory factory) {
         MessageEncoder enc = factory.Encoder;
         MemoryStream stream = new MemoryStream();
         enc.WriteMessage(msg, stream);
         return stream.ToArray();
      }
      static Message ObjectToMessage<T>(T obj) {
         DataContractSerializer ser = new DataContractSerializer(typeof(T));
         return Message.CreateMessage(MessageVersion.Soap12, "", obj, ser);
      }
   }
}

Hope it’s useful!

WCF Channel and BeginTryReceive

I’m going over some WCF custom transport channels I had worked on recently and noticed something that made me suspicious right away that something was wrong: Using my custom channel listener, only a single message would be processed at a time, even though several messages were being fed to WCF.

That something was wrong became more evident when I started messing around with the concurrency and instancing modes and realized that while they were being obeyed, it didn’t change message processing at all.

Now, WCF has some pretty important options for throttling message processing on the service side, which are configurable through the ServiceThrottling behavior. Wanting to confirm my fears, I started tweaking the MaxConcurrentInstances setting and noticed that indeed, it didn’t seem to make any difference.

Something interesting happened then: I rebuild my test service class implementation to use the Async pattern instead, and what happened? Multiple messages started being processed concurrently, in accordance with the settings I had put into the InstanceContextMode and the ServiceThrottlingBehavior. Obviously, something was wrong my channel listener implementation!

Digging further, and after a bunch of different tests and lots of printf debugging, I found out what the problem was:

In my tests, I was using a normal ServiceHost to run my custom channel and the test service implementation. The way the host interacts with the channels is such that WCF was eventually using the BeginTryReceive() and EndTryReceive() methods to receive the messages that were queued in the channels. All that implementation was working perfectly, except for a single detail:

The code that handled dispatching any waiters that had called BeginTryReceive(), was firing all the notifications using the same thread. That means that the dispatcher was waking up all waiters by going through the waiter list and calling the corresponding AsyncCallback for each one synchronously.

This, turns out, is a big mistake: It essentially causes WCF to invoke your service implementation using a single thread, thus causing the behavior I was seeing of only seeing one message processed at a time. Changing the service to use the async pattern pretty much meant that WCF would pick up the message call the service and be able to pick up another message right away, making the issue go away (at least up to a point). That’s why I was seeing different behaviors between the two modes.

The solution was simply to make sure the dispatcher would delegate notifying each waiter to a thread pool task, thus making each one independent.

Thinking about this after the fact, I can see this should’ve been just obvious from the start, and was a really stupid mistake on my part. However, I can’t help but wonder if the WCF model gives way too much responsability to the underlying channel. Just look how a simple mistake here can make the transport channel completely change the WCF execution model and make all the throttling go all wrong.

A Gotcha Writing Messages

I’ve been working a lot with low-level contracts in Windows Communication Foundation that send and receive raw System.ServiceModel.Channels.Message objects. These are fairly flexible, though the API can be a bit confusing.

One of the things I was doing was making sure I was doing proper streaming of large messages, which in WCF relies on the XmlReader and XmlWriter classes.

What is not be very obvious at first is that it is possible to receive a Message object and still do streaming: Basically when the received message is created, it takes an XmlReader that is only consumed when you ask WCF to write the message to an XmlWriter.

Internally, the message will read each node from the XmlReader and write it to the XmlWriter, which works very well for the streaming. But it can also trip you sometimes, like what happened to me recently:

Basically, my WCF service was receiving an streamed XML file, went up to the Message-based contract, and then I streamed the message out to a file. With some message payloads, I’d sometimes get the following error:

System.InvalidOperationException: Token Text in state EndRootElement would result in an in valid XML document. Make sure that the ConformanceLevel setting is set to ConformanceLevel.Fragment or ConformanceLevel.Auto if you want to write an XML fragment.

I opened up the original message and as far as I could see it was perfectly well-formed XML. Until I opened the file in a hex editor and realized it had a Line Feed (\n) character right at the end after the closing tag!

You can, of course, fix the error by making sure the source doesn’t have any characters after the closing tag. However the exception also suggest another workaround which can be more useful when you don’t control the input, like in a WCF service: Set the XmlWriter to the Fragment conformance level, which won’t give you trouble with the extra \n at the end.

Here's a simple console application that exhibits the problem:

using System.IO;
using System.ServiceModel;
using System.ServiceModel.Channels;
using System.Xml;

class Program {
   public static void Main() {
      string xml = "<doc><part1>Start</part1><part2>End</part2></doc>\n";
      MessageEncoder encoder = CreateEncoder();
      Message msg = encoder.ReadMessage(ToStream(xml), Int32.MaxValue);

      try {
         XmlWriterSettings set = new XmlWriterSettings();
         // Uncomment this line to avoid the exception:
         //set.ConformanceLevel = ConformanceLevel.Fragment;
         XmlWriter xmlWriter = XmlWriter.Create(Console.Out, set);
         XmlDictionaryWriter xmlDictWriter =
            XmlDictionaryWriter.CreateDictionaryWriter(xmlWriter);
         msg.WriteBodyContents(xmlDictWriter);
         xmlDictWriter.Flush();
      } catch ( Exception ex ) {
         Console.WriteLine("\n****************");
         Console.WriteLine(ex.ToString());
         Console.WriteLine("****************");
      }
   }

   static Stream ToStream(String text) {
      MemoryStream stream = new MemoryStream();
      StreamWriter writer = new StreamWriter(stream);
      writer.Write(text);
      writer.Flush();
      stream.Position = 0;
      return stream;
   }

   static MessageEncoder CreateEncoder() {
      TextMessageEncodingBindingElement tmebe =
         new TextMessageEncodingBindingElement();
      tmebe.MessageVersion = MessageVersion.None;

      MessageEncoderFactory factory = tmebe.CreateMessageEncoderFactory();
      return factory.Encoder;
   }
}

The Case of the Unused Input Channel

While working on a custom WCF transport channel implementation I ran into a problem that baffled me for a while. As usual, the problem ended being caused by my own fault, but it was hard to figure what was actually going wrong :-).

The problem revolved around a the ChannelListener for the service side of the transport, which was getting executed through a normal WCF-provided ServiceHost (as opposed to using the channel model directly). The puzzling behavior I was seeing while testing and lots of stepping around in the debugger was like this:

  1. The ServiceHost would open my ChannelListener implementation, which in turn would mean my OnOpen() method would get executed.
  2. The host would then do a BeginAcceptChannel() call on my listener with an infinite timeout. The listener gladly processed the request and returned the corresponding IAsyncResult object.
  3. Meanwhile, the channel listener was busy doing it’s thing, and, within a few hundred milliseconds, would produce a new IInputChannel and signal the WaitHandle returned in the BeginAcceptChannel() call.
  4. The host would wake up after the wait handle was signal and call the listener’s EndAcceptChannel(), which promptly returned the new queued input channel instance.
  5. … And that was it. Nothing else happened. The host would not call any method on the IInputChannel returned. Not Open(), much less posting a wait or a receive for a message. Obviously, the whole thing would grind up to a halt at this point, but no errors would get raised, the service host did not fault and there wasn’t anything suspicious logged to the ServiceModel activity traces.

What was going on? Well, it seems that the OnOpen() implementation in my ChannelListener was actually blocking, instead of returning control right away. I’m not sure if this was an effect of how the async variant of the call was working, but the end result was that as far as the service host was concerned, the listener had, not completed initializing, so it just didn’t do anything else.

What really confused me while troubleshooting this problem was that because BeginAcceptChannel() was getting called, it didn’t even occur to me to look into the opening of the listener; I just assumed that if the host was trying to accept a channel, the listener had successfully opened. Looks like this isn’t as clear cut as this.

Once I fixed the blocking open of the channel listener, the host started calling my input channel methods again and everything went back to normal. Lesson learned.

SvcUtil: Endpoint Cannot Be Configured

I’ve been working the past week on some custom WCF transport channels. It’s been pretty instructive, though it can be a bit of a frustrating experience at times.

The past couple of days, I’ve been busy getting metadata/policy export/import working correctly, and ran into an error when using SvcUtil.exe to import my service metadata:

Attempting to download metadata from 'http://localhost:6466/Metadata/' using WS-Metadata Exchange or DISCO.
Error: An error occurred in the tool.

Error: The binding on the service endpoint cannot be configured.
Parameter name: endpoint.Binding

This is a pretty generic error message that, well, doesn’t really tell you anything. SvcUtil.exe (and it’s related VS feature) is a pretty key piece of the WCF developer toolkit, but, unfortunately, it’s (a) full of bugs and (b) has horrible error handling, which makes diagnosing problems with metadata import incredibly frustrating.

As usual, the problem turned out to be mine: I had registered my IPolicyImportExtension/IWsdlImportExtension implementation in a custom app.config file that I was feeding to SvcUtil using the /svcutilConfig argument. This worked fine and the importer was getting correctly invoked and executed just fine.

However, I forgot to register my bindingExtension & bindingElementExtension for my custom binding and custom transport binding element, which prevented SvcUtil from serializing the custom binding to the configuration XML.

Once I fixed that, it worked right away. I just wish that SvcUtil could have generated a clearer error message pointing out what the problem actually was.

IRequestChannel and One-Way Contracts

I’ve been working this week on implementing some custom WCF transport channels. The one I was working on initially is a two-way send channel, so, naturally, I decided to begin by having my channel implement the IRequestChannel shape.

This was working fine until I decided to try a one-way operation on top of my two-way channel, which was interesting because I was not implementing an equivalent part of the transport supporting IOutputChannel.

I was expecting it to fail right from the start, but, surprisingly enough, it almost worked. That is, the transport channel was called, executed, and only bombed when the WCF stack ran into the unexpected response message. The error message I was seeing was something like this:

System.ServiceModel.ProtocolException: The one-way operation returned a non-null message with Action=''.

Looking around, I realized a couple of things: First of all, the built-in HTTP transports in WCF also implement only IRequestChannel, and yet, somehow, they still manage to work just fine in this case.

The second thing that came to mind was that I was using a custom MessageEncoder. Granted, it was a very rough and crappy encoder, but it was mostly enough for me to test a few things about my custom transport.

That last part sparked something deep in my memory: I seemed to vaguely remember something about encoders that might be related, so I turned right away to the best source for everything WCF related: Nicholas Allen’s blog.

Fortunately, Dr. Nick did not disappoint! In one of his articles about writing a custom POX Message Encoder, he included this little gem:

“One of the properties of the standard message encoders is that they only surface up messages that have a non-empty message body.”

That was indeed what I was seeing: My custom transport was indeed returning a message from IRequestChannel.Request(), but it had no content at all (something particular to the kind of request I was making over the one-way channel).

Once I changed my custom encoder to ignore the empty message, it worked just fine.

Net.MSMQ and Poison Messages

The Net.MSMQ Binding in Windows Communication Foundation has some built-in support for detecting and handling "Poison Messages".

One of the most useful ways of handling Poison Messages in queuing systems is to get them out of the way so that following messages can still be processed (if you're not doing in-order processing), but still keep the messages around so that they are not lost.

Unfortunately, this behavior is only supported by WCF in MSMQ 4.0 with the use of a poison sub-queue, which is new in the Vista and WinServer2k8 versions of MSMQ. It's really a shame because, honestly, this could've been implemented on MSMQ 3.0 just by using a separate queue instead of sub-queues.

Instead, we're stuck with the PoisonErrorHandler sample included in the SDK. The sample provides a way to mimic this behavior by introducing an IErrorHandler implementation that detects the poison message and moves it to another queue. To be able to do this, you need to set the binding properties so that the ServiceHost listening to the queue faults when the poison message is detected. The error handler then moves the message to a new queue and starts a new service host instance to resume processing of following messages.

The Sample itself isn't all that bad, but does provide insight into a rather significant feature left out of the way the Net.MSMQ Binding reports poison messages: All it provides the error handler with is the MSMQ LookupId of the poison message. Unfortunately, this LookupId is queue-specific and the exception does not provide information on which queue or at least which endpoint was the one that received the poison message, which is a huge gaping hole in this feature.

Because of that the LookupId is useless without more information provided out-of-band [1]. The sample works around this limitation by adding an entry in the <appSettings/> section of the app.config file with the path to the queue where messages are being received and the path to the queue to move poison messages to.

This works OK, until you need a single service host listening on multiple endpoints, and then the sample won't really work anymore because you don't know which of the endpoints caused the error.

Working around the limitation

I spent some time recently playing around with WCF and ran into this same problem, so I tried to find a way of working around this limitation and still use the basic functionality the sample provided.

I found one possible workaround, which seems to work so far in my limited tests. I'm not sure yet how well it will hold, as it seems to me there's always the possibility of some race conditions here, but let me illustrate at least the basic mechanics.

First we keep the basic code of the PoisonErrorHandler class. However, instead of using appSettings to keep track of the queue we're listening to, we just try to find out dynamically which endpoint configured for the service is causing the error, and extract the path to the queue from there based on the URI of the endpoint address.

To do this, in my implementation of IErrorHandler.HandleError() I grab the current OperationContext, reach out to the ServiceHost associated with it, and then iterate over the ChannelDispatchers attached to it. The one that's in the Faulted state is very likely the one that caused the MsmqPoisonMessageException in the first place.

Code could be something like this:

private String FindFaultedQueue() {
   ServiceHostBase host = OperationContext.Current.Host;
   Uri faultedQueue = null;
   // the queue that fired the exception will be the faulted one
   foreach ( ChannelDispatcher chd in host.ChannelDispatchers ) {
      if ( chd.State == CommunicationState.Faulted ) {
         faultedQueue = chd.Listener.Uri;
      }
   }
   // translate URI into a regular MSMQ format name 
   StringBuilder fn = new StringBuilder("FormatName:DIRECT=OS:");
   fn.Append(faultedQueue.Host).Append("\\");
   string path = faultedQueue.PathAndQuery;
   if ( path.StartsWith("/private/") ) {
      fn.Append("private$\\");
      path = path.Substring("/private".Length);
   }
   path = path.Substring(1);
   fn.Append(path);
   return fn.ToString();
}

This lacks some error handling and a few other things, but it illustrates the point. Now that we have the FormatName of the queue where the poison message came from, we can try to move it to the poison message  queue. We could handle this by convention: for queue Q1, the corresponding poison message queue could be Q1Poison. The rest of the sample is pretty much the same.

I probably wouldn't use this in production, as it doesn't feel very reliable given the basic limitations imposed by the WCF implementation, but it was interesting to take a stab at. YMMV.

[1] I'll leave to you, dear reader, to wonder about the sense of providing a sample that so clearly points out such a giant hole in a product feature rather than actually fixing it. Oh wait, guess they did fix it by changing MSMQ for Vista and WinServer2k8.

Technorati tags: , , ,