Debugging the deadlock but for the scheduler: Part 2 of 2

This is the second part of my Deadlocked Schedulers debugging posts. In the first post, I discussed about the difference between a 17888 and 17884 dump and showed the amount of information that can be extracted from a mini-dump that is generated for the first occurrence of a Deadlocked Scheduler issue. In this post, I shall walk you through a 17884 condition.

The first difference that you will notice with a 17884 message i.e. Deadlocked Schedulers reported in the SQL Errorlog is that no information about the blocking resource is mentioned in the SQL Errorlog unlike the 17888 message. If you inspect the SQL Errorlog, you will see only an error message of the following type:

2010-09-20 14:04:07.78 Server      ***Stack Dump being sent to D:\Program Files\Microsoft SQL Server\MSSQL10.MSSQLServer\MSSQL\LOG\SQLDump0001.txt
2010-09-20 14:04:07.78 Server      *   09/20/10 14:04:07 spid 27276
2010-09-20 14:04:07.78 Server      *
2010-09-20 14:04:07.78 Server      *
Deadlocked Schedulers
2010-09-20 14:04:07.78 Server      *
2010-09-20 14:04:07.78 Server      * *******************************************************************************
2010-09-20 14:04:07.78 Server      * ——————————————————————————-
2010-09-20 14:04:07.78 Server      * Short Stack Dump
2010-09-20 14:04:07.82 Server      Stack Signature for the dump is 0x0000000000000285
External dump process returned no errors.

As mentioned in my previous posts, the dump process returned no errors (Note message in green). This means that I can extract some useful information regarding the issue from the dump. One of the important things about a Deadlocked Schedulers (17884) is that you need to look through the Errorlogs for any other Access Violation or Assertion that occurred prior to the Deadlocked Schedulers event. Always check for previous errors first. The 17884 error messages are often symptoms pointing to a problem. Before investigating the dump, always review overall system health information such as performance information, event logs, SQL Server error logs, and other application logging information. Understanding prior problems on the system may quickly clarify why SQL Server detected and reported a health error. Previous errors should always be addressed before the investigation continues.

The common reasons for a Deadlocked Schedulers are:

  1. All schedulers have encountered a 17883 condition.
  2. All workers are blocked on a critical resource.
  3. All workers are executing a long-running query.

I have very rarely seen #1 being the cause of Deadlocked Schedulers. More often than not it is #2 or #3.

Now when you load the dump using WinDBG, as noted in the debugging a non-yielding scheduler issue, you will note similar things. When you load the dump using WinDBG, you will see the following information:

Loading Dump File [D:\Program Files\Microsoft SQL Server\MSSQL10.MSSQLServer\MSSQL\LOG\SQLDump0001.mdmp]
User Mini Dump File: Only registers, stack and portions of memory are available

Comment: ‘Stack Trace’
Comment: ‘Deadlocked Schedulers’

The above tells you that this is a mini-dump for a deadlocked schedulers condition and the location from where you loaded the dump. Then I use the command to set my symbol path and direct the symbols downloaded from the Microsoft symbol server to a local symbol file cache on my machine.

.sympath srv*D:\PublicSymbols*http://msdl.microsoft.com/download/symbols

Then I issue a reload command to load the symbols for sqlservr.exe. This can also be done using CTRL+L and providing the complete string above (without .sympath), checking the Reload checkbox and clicking on OK. The only difference here is that the all the public symbols for all loaded modules in the dump will be downloaded from the Microsoft Symbol Server which are available.

.reload /f sqlservr.exe

Next thing is to verify that the symbols were correctly loaded using the lmvm sqlservr command. If the symbols were loaded correctly, you should see the following output. Note the text in green.

0:005> lmvm sqlservr
start end module name

00000000`00530000 00000000`040de000   sqlservr T (pdb symbols)          d:\publicsymbols\sqlservr.pdbD90C4EC67D442D692A791A19FBA638F2\sqlservr.pdb

If symbols were not loaded, then you will see an output as shown below.

0:005> lmvm sqlservr
start end module name
00000000`00530000 00000000`040de000 sqlservr (export symbols) sqlservr.exe

Since I didn’t have any prior errors to the deadlocked schedulers issue in the Errorlog, I don’t have a search string to look for in the dump. This makes it tricky. So I will have to manually look through the callstacks for all the threads in the dump and establish a common pattern. You can use the !uniqstack command to dump out all the unique callstacks in the dump along with the count of the duplicate callstacks.

0:005> !uniqstack

Output excerpt from previous command:

Total threads: 868
Duplicate callstacks: 810 (windbg thread #s follow):
4, 8, 10, … … 865, 866, 867

As you can see above, there are 868 threads in the dump and there are 810 callstacks which are listed as duplicate.

On manual inspection of the threads using the thread ids mentioned in the Duplicate callstacks list (italicized numbers above), I did find a common pattern. There are multiple threads waiting for locks and there are quite a few threads which are executing in parallel.

Two common thread stacks that emerged were:

Waiting on lock:

0:804> kL100
ntdll!ZwSignalAndWaitForSingleObject
kernel32!SignalObjectAndWait
sqlservr!SOS_Scheduler::Switch
sqlservr!SOS_Scheduler::SuspendNonPreemptive
sqlservr!SOS_Scheduler::Suspend
sqlservr!EventInternal<Spinlock<153,1,0> >::Wait
sqlservr!LockOwner::Sleep
sqlservr!lck_lockInternal


sqlservr!process_request
sqlservr!process_commands
sqlservr!SOS_Task::Param::Execute
sqlservr!SOS_Scheduler::RunTask
sqlservr!SOS_Scheduler::ProcessTasks
sqlservr!SchedulerManager::WorkerEntryPoint
sqlservr!SystemThread::RunWorker
sqlservr!SystemThreadDispatcher::ProcessWorker
sqlservr!SchedulerManager::ThreadEntryPoint
msvcr80!endthreadex
msvcr80!endthreadex
kernel32!BaseThreadInitThunk
ntdll!RtlUserThreadStart

Parallel thread:

0:852> kL100

ntdll!ZwWaitForSingleObject
KERNELBASE!WaitForSingleObjectEx
sqlservr!SOS_Scheduler::SwitchContext
sqlservr!SOS_Scheduler::SuspendNonPreemptive
sqlservr!SOS_Scheduler::Suspend
sqlservr!EventInternal<Spinlock<153,1,0> >::Wait
sqlservr!EventInternal<Spinlock<153,1,0> >::WaitAllowPrematureWakeup
sqlservr!XchngWait<Spinlock<38,10,0> >
sqlservr!CXPort::Open
sqlservr!CXTransLocal::Synchronize
sqlservr!CQScanXProducerNew::Open
sqlservr!FnProducerOpen
sqlservr!FnProducerThread
sqlservr!SubprocEntrypoint
sqlservr!SOS_Task::Param::Execute
sqlservr!SOS_Scheduler::RunTask
sqlservr!SOS_Scheduler::ProcessTasks
sqlservr!SchedulerManager::WorkerEntryPoint
sqlservr!SystemThread::RunWorker
sqlservr!SystemThreadDispatcher::ProcessWorker
sqlservr!SchedulerManager::ThreadEntryPoint
msvcr80!endthreadex
msvcr80!endthreadex
kernel32!BaseThreadInitThunk
ntdll!RtlUserThreadStart

The next thing that I shall do is pick one of the common relevant functions from the above callstacks and search for the number of occurrences of these frames in the all the threads present in the dump using the !findstack command.

!findstack sqlservr!LockOwner::Sleep 0

Excerpt from the output of the above command:

Thread 054, 1 frame(s) match
Thread 056, 1 frame(s) match
Thread 059, 1 frame(s) match
Thread 061, 1 frame(s) match
Thread 062, 1 frame(s) match


Thread 799, 1 frame(s) match
Thread 800, 1 frame(s) match
Thread 801, 1 frame(s) match
Thread 802, 1 frame(s) match
Thread 804, 1 frame(s) match
Thread 805, 1 frame(s) match

!findstack sqlservr!FnProducerThread 0

Excerpt from the output of the above command:

Thread 055, 1 frame(s) match
Thread 057, 1 frame(s) match
Thread 066, 1 frame(s) match
Thread 082, 1 frame(s) match
Thread 087, 1 frame(s) match
Thread 093, 1 frame(s) match


Thread 840, 1 frame(s) match
Thread 843, 1 frame(s) match
Thread 844, 1 frame(s) match
Thread 850, 1 frame(s) match
Thread 852, 1 frame(s) match

I found that there were 584 threads (67%) waiting for locks and 166 threads (19%) involved in parallelism. Note that the lines colored in blue are the callstacks which I had mentioned above to determine that there were waits on locks and parallelism. This confirms that the problem is due to concurrent requests waiting for lock(s) on object(s) and some of the queries are executing in parallelism. If a CSS engineer is examining this dump, then they could probably extract additional information like the queries being executed by the threads (if the information is available in the dump) using private symbols. If you were monitoring the server using a monitoring tool which captured the query requests, then you could go back to your monitoring repository and get the queries that the above threads were executing using the thread ids that you have identified.

TIP: The interesting thing to note is the thread IDs that you see above will directly match the OS Thread ID reported in the sys.dm_os_threads os_thread_id column value or the sys.sysprocesses kpid column value. So if you have a monitoring tool that is capturing the output of queries being executed by the active SQL Server sessions, then you can go back to your monitoring repository and dig up the queries which these threads were executing. If such data is not available from the time of the issue, more often that not you will have to end up collecting additional data (profiler traces along with DMV outputs to track blocking chains) for the next problem occurrence!

Also check if the Deadlocked Schedulers matches any KB Article (update) already released. Remember that deadlocked schedulers is a symptom and the not the cause!! So make sure that if you find a KB Article which talks about a Deadlocked Schedulers issue being addressed in a later build, the symptoms match the symptoms that you are experiencing on your server! If in doubt, ask CSS!

If you are stuck and have no clue why the issue happened, then this would be a good time to engage Microsoft CSS Support and provide them with the SQL Errorlogs, the dump file and the windows event logs in .txt format for further analysis.

Reference:
KB 315263 – How to read the small memory dump files that Windows creates for debugging
How To Diagnose and Correct Errors 17883, 17884, 17887, and 17888 whitepaper

Advertisement