Debugging a non-yielding scheduler issue


If you have read the How To Diagnose and Correct Errors 17883, 17884, 17887, and 17888 whitepaper, you would know what a 17883 error reported in SQL Server 2000 means. If you read the “SQL Server 2005 – SchedulerMonitor” section of the whitepaper, you will understand that the way non-yielding messages are reported for a SQL Server worker thread is much more refined than it’s predecessor. We report a non-yielding worker thread as a Non-yielding Scheduler message in the SQL Server Errorlog. Since the whitepaper is very comprehensive in explaining SQL Server non-yielding detection algorithms, I shall not reiterate the same. In this post, I shall show you how to track down the cause of a non-yielding scheduler using public symbols in conjunction with the SQL Server Errorlog and Windows Event Logs.

Below is an excerpt from an Errorlog which is reporting a non-yielding scheduler message.

2011-07-24 22:17:45.850 Server       * BEGIN STACK DUMP:
2011-07-24 22:17:45.850 Server       *   07/24/11 22:17:45 spid 2388
2011-07-24 22:17:45.850 Server       *
2011-07-24 22:17:45.850 Server       * Non-yielding Scheduler
2011-07-24 22:17:45.850 Server       Stack Signature for the dump is 0x00000000000003B6
2011-07-24 22:17:53.380 Server       External dump process return code 0x20000001.
External dump process returned no errors.

2011-07-24 22:17:53.400 Server Process 0:0:0 (0x990) Worker 0x0000000005A6C1A0 appears to be non-yielding on Scheduler 1. Thread creation time: 12955600342903. Approx Thread CPU Used: kernel 0 ms, user 0 ms. Process Utilization 0%. System Idle 99%. Interval: 70119 ms.

There are lines that I have colored differently in the above excerpt whose significance I shall detail later in this thread. The Windows Application Event Log will report this as an Informational message with the EventID = 17883. This is what I have in my application event log for the above reported non-yielding condition:

Process 0:0:0 (0x990) Worker 0x0000000005A6C1A0 appears to be non-yielding on Scheduler 1. Thread creation time: 12955600342903. Approx Thread CPU Used: kernel 0 ms, user 0 ms. Process Utilization 0%%. System Idle 99%%. Interval: 70119 ms.

The above message is the same message reported in the Errorlog excerpt above. Since the Errorlog doesn’t report the MDMP (dump file) number, you will have to track it down using the timestamp when the issue was reported or using the SQL Dumper log in case you are performing post-mortem analysis. The SQL Dumper log name is “SQLDUMPER_ERRORLOG.log” and it is located by default in the SQL Server LOG folder. You will find a message similar to the one shown below which will point you to the correct .mdmp file.

(2A30:2760) 07/24/11 22:17:52, ACTION, SQLDUMPER_UNKNOWN_APP.EXE, MiniDump completed: C:\Program Files\Microsoft SQL Server\MSSQL10_50.MSSQLSERVER\MSSQL\LOG\SQLDump0025.mdmp

Now I need to use Windows Debugging Tools to load the .mdmp file. Note that the non-yielding message in the SQL Server Errorlog reported “External dump process returned no errors.”. This means that the dump generation routine didn’t encounter any errors. The most common error that occurs while generating dumps (due to resource crunch – memory or CPU), is a timeout. This means that the dump generation timed out and it is a delayed dump. So the information contained in the dump file may or may not be useful to perform a complete post-mortem analysis. This is when Microsoft CSS would ask you to collect additional data for the next occurrence if the current data available is not sufficient for a complete post-mortem analysis.

So let’s move on to the good part now… i.e. analysis of the dump. Keep in mind that this is a mini-dump which means that the is limited amount of information stored in the dump and we are using public symbols to analyze the dump. Even with private symbols that Microsoft CSS Engineers have access to, there might not be a whole lot they can decipher from a mini-dump and might request additional data!

When you load the dump using WinDBG, you will see the following information:

Loading Dump File [C:\Program Files\Microsoft SQL Server\MSSQL10_50.MSSQLSERVER\MSSQL\Log\SQLDump0025.mdmp]
User Mini Dump File: Only registers, stack and portions of memory are available

Comment: ‘Stack Trace’
Comment: ‘Non-yielding Scheduler’


Debug session time: Sun Jul 24 22:17:52.000 2011 (UTC + 5:30)

The above tells you that this is a mini-dump for a non-yielding scheduler condition and the location from where you loaded the dump. Then I use the command to set my symbol path and direct the symbols downloaded from the Microsoft symbol server to a local symbol file cache on my machine. It also tells me the time when the dump was generated.

What is a minidump?
A memory dump which contains the memory for the call stack of all threads, the CPU registers and information about which modules are loaded.

.sympath srv*D:\PublicSymbols*http://msdl.microsoft.com/download/symbols

Then I issue a reload command to load the symbols for sqlservr.exe. This can also be done using CTRL+L and providing the complete string above (without .sympath), checking the Reload checkbox and clicking on OK. The only difference here is that the all the public symbols for all loaded modules in the dump will be downloaded from the Microsoft Symbol Server which are available.

.reload /f sqlservr.exe

Next thing is to verify that the symbols were correctly loaded using the lmvm sqlservr command. If the symbols were loaded correctly, you should see the following output. Note the text in green.

0:019> lmvm sqlservr
start             end                 module name
00000000`00c60000 00000000`04820000   sqlservr   (pdb symbols)          D:\publicsymbols\sqlservr.pdb\6A3527657A3F4090AB2C96F7008A55C52\sqlservr.pdb
    Loaded symbol image file: sqlservr.exe
    Mapped memory image file: C:\Program Files\Microsoft SQL Server\MSSQL10_50.MSSQLSERVER\MSSQL\Binn\sqlservr.exe
    Image path: C:\Program Files\Microsoft SQL Server\MSSQL10_50.MSSQLSERVER\MSSQL\Binn\sqlservr.exe
    Product version:  10.50.1790.0

If symbols were not loaded, then you will see an output as shown below.

0:019> lmvm sqlservr
start             end                 module name
00000000`00c60000 00000000`04820000   sqlservr   (export symbols)       sqlservr.exe
    Loaded symbol image file: sqlservr.exe
    Mapped memory image file: C:\Program Files\Microsoft SQL Server\MSSQL10_50.MSSQLSERVER\MSSQL\Binn\sqlservr.exe
    Image path: C:\Program Files\Microsoft SQL Server\MSSQL10_50.MSSQLSERVER\MSSQL\Binn\sqlservr.exe

Next I shall use the Thread ID noted above (0x990) in the Errorlog excerpt to switch to the thread and get the callstack for the thread which was reported as non-yielding.

0:019> ~~[0x990]s
ntdll!NtWaitForSingleObject+0xa:
00000000`76e06bda c3              ret
0:019> kL100

ntdll!NtWaitForSingleObject
kernel32!WaitForSingleObjectEx
sqlservr!SOS_Scheduler::SwitchContext
sqlservr!SOS_Scheduler::SuspendNonPreemptive
sqlservr!SOS_Scheduler::Suspend


msvcr80!endthreadex
msvcr80!endthreadex
kernel32!BaseThreadInitThunk
ntdll!RtlUserThreadStart

Looking at the callstack above, I can say that this is not the cause of the non-yielding condition (99.99999% true in most cases)!! The reason I can say this is because I find SwitchContext on the thread stack which means that the thread has yielded. The sequence of events is as follows: SQL Server detected a non-yielding worker thread. By the time, SQL Dumper was initiated to capture a memory dump of the SQL Server process, the thread yielded and the processing moved on to the next work request in the queue.

However, if your dump didn’t have the top functions as the ones noted above but was something like the two functions mentioned below, then you definitely have the right callstack and the issue noted below is due to stuck-stalled I/O.

ntdll!NtWriteFileGather
kernel32!WriteFileGather

The non-yielding scheduler message from the dump I investigated which had the above callstack had the following stuck-stalled IO warnings in the SQL Server Errorlog just prior to the dump getting generated:

SQL Server has encountered 218 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [C:\Program Files\Microsoft SQL Server\MSSQL10_50.MSSQLSERVER\MSSQL\DATA\tempdb.mdf] in database [tempdb] (2). 

So now I definitely know that the non-yielding condition was due to faulty disk sub-system and the SQL Server instance at this point is just a victim of the issue!

If you see the the following message in the SQL Server Errorlog along with the dump generation message, then it is definitely a delayed dump and it could have some valuable information missing that might have been pertinent to your analysis.

Timeout while waiting for external dump

Most of the common non-yielding conditions are documented in the whitepaper that I referred to above. If you are unable to get a callstack for the non-yielding issue and the symptoms or sequence of events leading up to the issue doesn’t match any KB Article, then it’s time to call CSS to get the memory dump analyzed with the help of the private symbols that the CSS engineers have access to.

Happy debugging!!

Reference:
KB 315263 – How to read the small memory dump files that Windows creates for debugging