Friday, October 22, 2010

Debugging my debugger - ICorDebug and the ramifications of thread suspension

Thanks

First of all, with this post I feel like I've come of age as a debugger. I have to thank Tess (http://blogs.msdn.com/tess) and various other WinDbg bloggers for starting me down this path in first place. Also, thanks to Mike Stall and the other ICorDebug gurus for such great blogging info. Thanks everyone!

The Need
Now, when I troubleshoot production systems in my job, I often find myself doing the same one debugging task over and over again:
1) attach to process with visual studio
2) set it to break on all CLR exceptions
3) wait
4) catch one, look at the call stack, look at the value of local variables and arguments
5) detach
6) fix something and go back to step 1

This is rather mundane, and one day I decided to see if I could automate this. I did. :) I call it "FirstChanceExceptionWatcher". Here's a rundown of the feature set:

The Solution
=============================================================================
FirstChanceExceptionWatcher project
=============================================================================

Usage:
This windows service captures all exceptions that happen in an IIS process,
and writes them to an XML file, along with a configurable amount of context
(call stack, value of strings in call stack, etc.)


Features:
- Runs as windows service
- Resilient and automated debugging
- Automatic re-attach after app pool recycles
- Automatic re-attach after IISReset
- Automatic re-attach after task manager kill of w3wp or aspnet_wp

- log4net rolling append (keeps x days worth of data, and deletes anything
older automatically)

- Logs to a "log" subfolder, next to FirstChanceExceptionWatcher.exe

- Capture first chance Exception. Write results of .ToString()

- Configurable capture of call stack, with parameter types and values (values only for string
or something with a .ToString() method)

- Configurable capture of local variables within methods that are on the call stack (values
only for string or something with a .ToString() method)


Setup:
- Build .sln in VS 2010
- Copy the contents of bin/Release onto a machine running an ASP.NET application.
- Tip: Make sure the drive has free space for logging. One day of logging
might be a negligible amount of logging (KB's) or a ton of logging (1 GB)
(depending on frequency of exceptions in application)
- Run install.bat
- Start service and stop service as desired
- Monitor log files to learn new information

Configuration:
- REQUIRED: Specify name of IIS app pool to attach to in
FirstChanceExceptionWatcher.config
()
- OPTIONAL: Adjust other settings in FirstChanceExceptionWatcher.config
(descriptions are in .config file)
- OPTIONAL: Adjust log4net rolling append or verbosity in
FirstChanceExceptionWatcher.exe.log4net
(Read about configuring log4net ver 1.2.10.x online for details)

Log files:
MainSvc.xml - Information about windows service (attaching/detaching from processes, OS and environment info, startup/shutdown, etc)
Events.xml - Captures debug information when exceptions occur.
Instr.xml - Writes timing information to measure how much the debugger is impacting the performance of
the system. How long are we in the callback methods doing work?

=============================================================================

Cool! Let's put it to use!

The First "Real" Test
So I installed it on a system where I was experiencing a bizarre threading issue. 1 in 10 times that I ran a certain piece of code, it would fail, all based on how busy the system was. The failure was some sort of a silent exception of an asynchronous long running thread. "If I could just see the exception", I thought to myself, "I could probably fix this in no time".

Well, let's just say the public use of the debugger did not go as planned.

The Adventure (aka WinDbg debugging)
Problem:
After a few minutes of debugging, the ASP.NET web site would hang and stop serving any requests for the users. If I turn off the debugging service, the site would begin to function normally again. If I turn it on, the same thing happens within a few moments. Obviously my debugger was causing some kind of a problem!

Other symptoms:
Perfmon showed typical request execution times for the web site (0-4 seconds). Then, suddenly, while hanging, would go to "---------".

The process ID is taking 40-50% CPU as it sits there and "hangs".

Troubleshooting:
I decided to resort back to my good friend WinDbg to debug my debugger. I took two hang dumps about 8 minutes apart to see what I could learn.

cscript adplus.vbs -hang -quiet -p 1239

I found a thread of particular interest in both dumps:

Taken at 12:17 PM:
OS Thread Id: 0x151c (25)
ESP EIP
199bc574 7c82860c [HelperMethodFrame: 199bc574]
199bc5cc 792e0e30 System.String.ConcatArray(System.String[], Int32)
199bc5e4 7928a99c System.String.Concat(System.String[])
199bc600 799cc6d2 System.Exception.ToString()
199bc614 799cc67c System.Exception.ToString()
199bc8ac 79e71b4c [GCFrame: 199bc8ac]
199bc8c8 79e71b4c [GCFrame: 199bc8c8]
199bcb40 79e71b4c [GCFrame: 199bcb40]
199bcb08 79e71b4c [GCFrame: 199bcb08]
199bcb24 79e71b4c [GCFrame: 199bcb24]
199bcda8 79e71b4c [FuncEvalFrame: 199bcda8]

Taken at 12:25 PM:
OS Thread Id: 0x151c (23)
ESP EIP
199bc574 7c82860c [HelperMethodFrame: 199bc574]
199bc5cc 792e0e30 System.String.ConcatArray(System.String[], Int32)
199bc5e4 7928a99c System.String.Concat(System.String[])
199bc600 799cc6d2 System.Exception.ToString()
199bc614 799cc67c System.Exception.ToString()
199bc8ac 79e71b4c [GCFrame: 199bc8ac]
199bc8c8 79e71b4c [GCFrame: 199bc8c8]
199bcb40 79e71b4c [GCFrame: 199bcb40]
199bcb08 79e71b4c [GCFrame: 199bcb08]
199bcb24 79e71b4c [GCFrame: 199bcb24]
199bcda8 79e71b4c [FuncEvalFrame: 199bcda8]

There are several things to notice about this call stack.

First, the managed thread ID changes (23 vs 25). I have never seen this before and am not sure what causes this, but it is interesting because the underlying operating system thread ID (ox151c) remains the same. So be warned, you should watch both.

Second, I actually know what this call stack is, because of the code I wrote into my debugger. When an exception occurs in my ICorDebug based debugger, it takes the exception object and sets up a function evaluation call on the current thread, calling System.Exception.ToString(). Also notice the phrase "FuncEvalFrame" at the base of the call stack in WinDbg.

Third, I know this is not another ToString() call on a different exception, because the OS thread ID and the address of the exception object itself stays the same (799cc67c).

Observations:
OK, so we know that an exception ToString() call is taking minutes to complete...but why? And why is that chewing the CPU?

Troubleshooting:
I ran !runaway to see what threads have been working the hardest:

Taken at 12:17 PM:
0:000> !runaway
User Mode Time
Thread Time
12:ce4 0 days 0:03:48.062
10:218 0 days 0:03:43.328
13:1194 0 days 0:01:36.562
25:151c 0 days 0:00:14.234
11:ea4 0 days 0:00:00.734
14:82c 0 days 0:00:00.578
2:a5c 0 days 0:00:00.078
4:1614 0 days 0:00:00.046
0:b94 0 days 0:00:00.046
5:164c 0 days 0:00:00.031

Taken at 12:25 PM:
0:000> !runaway
User Mode Time
Thread Time
12:ce4 0 days 0:07:14.031
10:218 0 days 0:03:43.328
13:1194 0 days 0:01:36.562
23:151c 0 days 0:00:14.234
11:ea4 0 days 0:00:00.734
14:82c 0 days 0:00:00.578
2:a5c 0 days 0:00:00.078
4:1614 0 days 0:00:00.046
0:b94 0 days 0:00:00.046
5:164c 0 days 0:00:00.031

Whoa! Thread 12 is obviously working while everyone else sits around. Who's thread 12?

!clrstack shows nothing. It's not a managed thread. So, I run !threads -special:

0:012> !threads -special
ThreadCount: 12
UnstartedThread: 0
BackgroundThread: 9
PendingThread: 0
DeadThread: 1
Hosted Runtime: no
PreEmptive GC Alloc Lock
ID OSID ThreadOBJ State GC Context Domain Count APT Exception
10 1 218 000d02c0 1808228 Enabled 274061c4:27408194 000d7040 0 MTA (Threadpool Worker)
14 2 82c 000ddd70 b228 Enabled 00000000:00000000 000d7040 0 MTA (Finalizer)
15 3 17d8 000f6920 80a228 Enabled 00000000:00000000 000d7040 0 MTA (Threadpool Completion Port)
16 4 10f4 000fa748 1228 Enabled 00000000:00000000 000d7040 0 Ukn
8 5 d90 000ff758 880a228 Enabled 00000000:00000000 000d7040 0 MTA (Threadpool Completion Port)
17 7 14b8 00149350 200b028 Enabled 00000000:00000000 000fae88 0 MTA
18 8 44c 0014c2e0 b028 Enabled 00000000:00000000 000fae88 0 MTA
XXXX c 0 1109d4c0 9820 Enabled 00000000:00000000 000d7040 0 Ukn
25 6 151c 110af340 180b220 Enabled 18cccab0:18cce0f8 000fae88 1 MTA (Threadpool Worker) System.Exception (28098c44) (nested exceptions)
26 b 1774 001365f0 8b228 Enabled 00000000:00000000 000fae88 1 MTA
27 9 1044 00136220 880b228 Enabled 00000000:00000000 000d7040 0 MTA (Threadpool Completion Port)
28 a 56c 3a2bb0e0 880b228 Enabled 00000000:00000000 000d7040 0 MTA (Threadpool Completion Port)

OSID Special thread type
8 d90 IOCompletion
9 9b8 Gate
10 218 ThreadpoolWorker
11 ea4 DbgHelper
12 ce4 GC
13 1194 GC
14 82c Finalizer
15 17d8 Timer
16 10f4 ADUnloadHelper
25 151c ThreadpoolWorker
27 1044 IOCompletion
28 56c IOCompletion

Garbage collection is taking up my CPU. Why can't other threads work?

Because I'm doing this in my code!

(just prior to setting up the func eval and letting it run...)
pController.SetAllThreadsDebugState(CorDebugThreadState.THREAD_SUSPEND, pThread);

(once the eval completes...)
pProcess.SetAllThreadsDebugState(CorDebugThreadState.THREAD_RUN, pThread);

So, until the func eval completes, nothing else (apparently even garbage collection) can complete. Ok, so how do I know that my ToString() call probably needs / triggered GC? Memory allocation. Notice the System.String.Concat method on the call stack.

So, at this moment, I finally fully appreciate and understand what Mike Stall means when he says this:

(Taken from http://blogs.msdn.com/b/jmstall/archive/2006/03/05/writing-funceval-in-a-debugger.aspx )

2) You need to make some policy decisions.
a) Other debug events may come in the meantime and you need to decide how to handle those. For eg, if you hit a breakpoint, do you want to silently ignore it finish the eval, or do you want to enter a nested break state? It's your debugger and you get to decide what policy you want.
b) Also, do you suspend other threads? If no, they will move on you when you eval and thus your eval becomes even more invasive. If yes, you may get deadlocks.

Smoking gun! I think we have it. My debugger is running .ToString(), while blocking all other threads. The .ToString() needs to allocate memory, which triggers a GC (sometimes, not all the time), and that causes .ToString() call to never complete, which means no one else can complete either. And so the GC thread is apparently spinning off doing something unproductive for the rest of eternity.

So, I will be reworking my code to allow other threads to continue on while func evals occur, and find a way to correlate the information in the log file, since the .ToString() result and the logging of the first chance exception may not be near each other in the log file.

Hopefully that will correct the issue.

But it didn't. :(

I found some abort related functions on the eval objects, and tried running those after a certain timeout threshold, but that didn't unlock whatever deadlock state my internals of .NET were in.

So, I eventually involved Microsoft Support, and discovered that there is an undocumented requirement in the ICorDebug interfaces that you can only do one eval per debugger break. So, I reworked my logic to keep a list of all parameters and locals for a given exception, eval the first one, and continue. Once the eval completes, I setup an eval for the next. Then that eval completes, and I set up another, and so on, until all evals are ran for the exception. On that last eval complete event, I simply let the debugger resume, and the code goes about its business of running.

This thing is so stinking cool and useful now! I can attach it anytime I'm at a customer and don't have a debugger handy, but need to gather info about how to reproduce something in a non-prod environment. I love it.

Happy Coding!

No comments:

Post a Comment