Debugging a non-yielding scheduler issue


If you have read the How To Diagnose and Correct Errors 17883, 17884, 17887, and 17888 whitepaper, you would know what a 17883 error reported in SQL Server 2000 means. If you read the “SQL Server 2005 – SchedulerMonitor” section of the whitepaper, you will understand that the way non-yielding messages are reported for a SQL Server worker thread is much more refined than it’s predecessor. We report a non-yielding worker thread as a Non-yielding Scheduler message in the SQL Server Errorlog. Since the whitepaper is very comprehensive in explaining SQL Server non-yielding detection algorithms, I shall not reiterate the same. In this post, I shall show you how to track down the cause of a non-yielding scheduler using public symbols in conjunction with the SQL Server Errorlog and Windows Event Logs.

Below is an excerpt from an Errorlog which is reporting a non-yielding scheduler message.

2011-07-24 22:17:45.850 Server       * BEGIN STACK DUMP:
2011-07-24 22:17:45.850 Server       *   07/24/11 22:17:45 spid 2388
2011-07-24 22:17:45.850 Server       *
2011-07-24 22:17:45.850 Server       * Non-yielding Scheduler
2011-07-24 22:17:45.850 Server       Stack Signature for the dump is 0x00000000000003B6
2011-07-24 22:17:53.380 Server       External dump process return code 0x20000001.
External dump process returned no errors.

2011-07-24 22:17:53.400 Server Process 0:0:0 (0x990) Worker 0x0000000005A6C1A0 appears to be non-yielding on Scheduler 1. Thread creation time: 12955600342903. Approx Thread CPU Used: kernel 0 ms, user 0 ms. Process Utilization 0%. System Idle 99%. Interval: 70119 ms.

There are lines that I have colored differently in the above excerpt whose significance I shall detail later in this thread. The Windows Application Event Log will report this as an Informational message with the EventID = 17883. This is what I have in my application event log for the above reported non-yielding condition:

Process 0:0:0 (0x990) Worker 0x0000000005A6C1A0 appears to be non-yielding on Scheduler 1. Thread creation time: 12955600342903. Approx Thread CPU Used: kernel 0 ms, user 0 ms. Process Utilization 0%%. System Idle 99%%. Interval: 70119 ms.

The above message is the same message reported in the Errorlog excerpt above. Since the Errorlog doesn’t report the MDMP (dump file) number, you will have to track it down using the timestamp when the issue was reported or using the SQL Dumper log in case you are performing post-mortem analysis. The SQL Dumper log name is “SQLDUMPER_ERRORLOG.log” and it is located by default in the SQL Server LOG folder. You will find a message similar to the one shown below which will point you to the correct .mdmp file.

(2A30:2760) 07/24/11 22:17:52, ACTION, SQLDUMPER_UNKNOWN_APP.EXE, MiniDump completed: C:\Program Files\Microsoft SQL Server\MSSQL10_50.MSSQLSERVER\MSSQL\LOG\SQLDump0025.mdmp

Now I need to use Windows Debugging Tools to load the .mdmp file. Note that the non-yielding message in the SQL Server Errorlog reported “External dump process returned no errors.”. This means that the dump generation routine didn’t encounter any errors. The most common error that occurs while generating dumps (due to resource crunch – memory or CPU), is a timeout. This means that the dump generation timed out and it is a delayed dump. So the information contained in the dump file may or may not be useful to perform a complete post-mortem analysis. This is when Microsoft CSS would ask you to collect additional data for the next occurrence if the current data available is not sufficient for a complete post-mortem analysis.

So let’s move on to the good part now… i.e. analysis of the dump. Keep in mind that this is a mini-dump which means that the is limited amount of information stored in the dump and we are using public symbols to analyze the dump. Even with private symbols that Microsoft CSS Engineers have access to, there might not be a whole lot they can decipher from a mini-dump and might request additional data!

When you load the dump using WinDBG, you will see the following information:

Loading Dump File [C:\Program Files\Microsoft SQL Server\MSSQL10_50.MSSQLSERVER\MSSQL\Log\SQLDump0025.mdmp]
User Mini Dump File: Only registers, stack and portions of memory are available

Comment: ‘Stack Trace’
Comment: ‘Non-yielding Scheduler’


Debug session time: Sun Jul 24 22:17:52.000 2011 (UTC + 5:30)

The above tells you that this is a mini-dump for a non-yielding scheduler condition and the location from where you loaded the dump. Then I use the command to set my symbol path and direct the symbols downloaded from the Microsoft symbol server to a local symbol file cache on my machine. It also tells me the time when the dump was generated.

What is a minidump?
A memory dump which contains the memory for the call stack of all threads, the CPU registers and information about which modules are loaded.

.sympath srv*D:\PublicSymbols*http://msdl.microsoft.com/download/symbols

Then I issue a reload command to load the symbols for sqlservr.exe. This can also be done using CTRL+L and providing the complete string above (without .sympath), checking the Reload checkbox and clicking on OK. The only difference here is that the all the public symbols for all loaded modules in the dump will be downloaded from the Microsoft Symbol Server which are available.

.reload /f sqlservr.exe

Next thing is to verify that the symbols were correctly loaded using the lmvm sqlservr command. If the symbols were loaded correctly, you should see the following output. Note the text in green.

0:019> lmvm sqlservr
start             end                 module name
00000000`00c60000 00000000`04820000   sqlservr   (pdb symbols)          D:\publicsymbols\sqlservr.pdb\6A3527657A3F4090AB2C96F7008A55C52\sqlservr.pdb
    Loaded symbol image file: sqlservr.exe
    Mapped memory image file: C:\Program Files\Microsoft SQL Server\MSSQL10_50.MSSQLSERVER\MSSQL\Binn\sqlservr.exe
    Image path: C:\Program Files\Microsoft SQL Server\MSSQL10_50.MSSQLSERVER\MSSQL\Binn\sqlservr.exe
    Product version:  10.50.1790.0

If symbols were not loaded, then you will see an output as shown below.

0:019> lmvm sqlservr
start             end                 module name
00000000`00c60000 00000000`04820000   sqlservr   (export symbols)       sqlservr.exe
    Loaded symbol image file: sqlservr.exe
    Mapped memory image file: C:\Program Files\Microsoft SQL Server\MSSQL10_50.MSSQLSERVER\MSSQL\Binn\sqlservr.exe
    Image path: C:\Program Files\Microsoft SQL Server\MSSQL10_50.MSSQLSERVER\MSSQL\Binn\sqlservr.exe

Next I shall use the Thread ID noted above (0x990) in the Errorlog excerpt to switch to the thread and get the callstack for the thread which was reported as non-yielding.

0:019> ~~[0x990]s
ntdll!NtWaitForSingleObject+0xa:
00000000`76e06bda c3              ret
0:019> kL100

ntdll!NtWaitForSingleObject
kernel32!WaitForSingleObjectEx
sqlservr!SOS_Scheduler::SwitchContext
sqlservr!SOS_Scheduler::SuspendNonPreemptive
sqlservr!SOS_Scheduler::Suspend


msvcr80!endthreadex
msvcr80!endthreadex
kernel32!BaseThreadInitThunk
ntdll!RtlUserThreadStart

Looking at the callstack above, I can say that this is not the cause of the non-yielding condition (99.99999% true in most cases)!! The reason I can say this is because I find SwitchContext on the thread stack which means that the thread has yielded. The sequence of events is as follows: SQL Server detected a non-yielding worker thread. By the time, SQL Dumper was initiated to capture a memory dump of the SQL Server process, the thread yielded and the processing moved on to the next work request in the queue.

However, if your dump didn’t have the top functions as the ones noted above but was something like the two functions mentioned below, then you definitely have the right callstack and the issue noted below is due to stuck-stalled I/O.

ntdll!NtWriteFileGather
kernel32!WriteFileGather

The non-yielding scheduler message from the dump I investigated which had the above callstack had the following stuck-stalled IO warnings in the SQL Server Errorlog just prior to the dump getting generated:

SQL Server has encountered 218 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [C:\Program Files\Microsoft SQL Server\MSSQL10_50.MSSQLSERVER\MSSQL\DATA\tempdb.mdf] in database [tempdb] (2). 

So now I definitely know that the non-yielding condition was due to faulty disk sub-system and the SQL Server instance at this point is just a victim of the issue!

If you see the the following message in the SQL Server Errorlog along with the dump generation message, then it is definitely a delayed dump and it could have some valuable information missing that might have been pertinent to your analysis.

Timeout while waiting for external dump

Most of the common non-yielding conditions are documented in the whitepaper that I referred to above. If you are unable to get a callstack for the non-yielding issue and the symptoms or sequence of events leading up to the issue doesn’t match any KB Article, then it’s time to call CSS to get the memory dump analyzed with the help of the private symbols that the CSS engineers have access to.

Happy debugging!!

Reference:
KB 315263 – How to read the small memory dump files that Windows creates for debugging

Advertisements

6 thoughts on “Debugging a non-yielding scheduler issue

  1. Pingback: Debugging the deadlock but for the scheduler: Part 1 of 2 « TroubleshootingSQL

  2. Hi Amit,

    I just applied the valuable tips and tricks that you have made available on this page. Thank you a lot!!

    However, in my case, the tread dump is a bit different, which I am not able to interprete. I know it can be easily resolved by CSS, but I want to understand what exactly is happening here.It will really great if you could provide your inputs –

    Below is the thread dump-

    ~~[0xd54]s
    sqlservr!Spinlock::SpinToAcquireWithExponentialBackoff+0xca:
    00000000`0137a0e1 90 nop
    0:040> kL100
    Child-SP RetAddr Call Site
    00000000`102bdc70 00000000`01377f8c sqlservr!Spinlock::SpinToAcquireWithExponentialBackoff+0xca
    00000000`102bdcd0 00000000`00f01d9f sqlservr!lck_lockInternal+0x8c4
    00000000`102bea60 00000000`013f0474 sqlservr!GetLock+0x1eb
    00000000`102beb30 00000000`013f0331 sqlservr!GET_LOCK+0xa8
    00000000`102beba0 00000000`013f1f4d sqlservr!IndexDataSetSession::AcquireLocksForInsert+0xdd4
    00000000`102beda0 00000000`013f1a30 sqlservr!IndexDataSetSession::InsertSmallRecord+0xf78
    00000000`102c6ff0 00000000`013f2bd8 sqlservr!IndexDataSetSession::InsertRowInternal+0x2e64
    00000000`102c92a0 00000000`013f2ab6 sqlservr!DatasetSession::InsertRow+0x294
    00000000`102c9380 00000000`00f30c86 sqlservr!RowsetNewSS::InsertRow+0x26
    00000000`102c93b0 00000000`00f2f1d7 sqlservr!CMEDScanBase::FInsertSingle+0xd6
    00000000`102c9470 00000000`00f3ae03 sqlservr!CMEDScan::FInsertRow+0x22e
    00000000`102c96b0 00000000`00f3aa92 sqlservr!ECatBitsKatmai::FInsertRowsetColumn+0x35d
    00000000`102c98a0 00000000`00f3f004 sqlservr!CMEDCatKatmaiIndex::AddRowsetColumn+0xd6
    00000000`102c9960 00000000`00f3ef10 sqlservr!CMEDSEIndexAccess::AddRowsetColumn+0x54
    00000000`102c99b0 00000000`00f3e813 sqlservr!HoBtFactory::PersistVisibleMetadata+0x443
    00000000`102c9a60 00000000`0143d3b7 sqlservr!HoBtFactory::CreateHoBt+0x7d0
    00000000`102c9e70 00000000`00f3cf53 sqlservr!SECreateRowset+0x8d9
    00000000`102cbbe0 00000000`00f3cd57 sqlservr!DDLAgent::SECreateRowsets+0x743
    00000000`102cbde0 00000000`01e7a012 sqlservr!CIndexDDL::CreateRowsets+0x817
    00000000`102cbfb0 00000000`014f3fbf sqlservr!CIndexDDL::CreateEmptyHeap+0xca
    00000000`102cbff0 00000000`00f3dcf1 sqlservr!CTableCreate::CreateRelation+0x818
    00000000`102cc3d0 00000000`01f1c116 sqlservr!CTableCreate::LCreateImpl+0xf7d
    00000000`102cc850 00000000`02407e92 sqlservr!CStmtCreateTable::XretCreateTableExecute+0x1aa
    00000000`102cc9c0 00000000`024083eb sqlservr!CXStmtSelectInto::XretSelectIntoExecute+0x186
    00000000`102ccab0 00000000`0138cc76 sqlservr!CXStmtSelectInto::XretExecute+0x12b
    00000000`102ccae0 00000000`0138e1d8 sqlservr!CMsqlExecContext::ExecuteStmts+0x375
    00000000`102ccbf0 00000000`0138da8c sqlservr!CMsqlExecContext::FExecute+0x983
    00000000`102ccd70 00000000`023d486c sqlservr!CSQLSource::Execute+0x7b5
    00000000`102ccea0 00000000`023d41ca sqlservr!CStmtExecProc::XretLocalExec+0x24c
    00000000`102ccf10 00000000`023d21bd sqlservr!CStmtExecProc::XretExecExecute+0x416
    00000000`102cd5c0 00000000`0138cc76 sqlservr!CXStmtExecProc::XretExecute+0x19
    00000000`102cd600 00000000`0138e1d8 sqlservr!CMsqlExecContext::ExecuteStmts+0x375
    00000000`102cd710 00000000`0138da8c sqlservr!CMsqlExecContext::FExecute+0x983
    00000000`102cd890 00000000`023d672e sqlservr!CSQLSource::Execute+0x7b5
    00000000`102cd9c0 00000000`023d2591 sqlservr!CStmtExecStr::XretExecStrExecute+0x57e
    00000000`102ce080 00000000`0138cc76 sqlservr!CXStmtExecStr::XretExecute+0x19
    00000000`102ce0c0 00000000`0138e1d8 sqlservr!CMsqlExecContext::ExecuteStmts+0x375
    00000000`102ce1d0 00000000`0138da8c sqlservr!CMsqlExecContext::FExecute+0x983
    00000000`102ce350 00000000`023d486c sqlservr!CSQLSource::Execute+0x7b5
    00000000`102ce480 00000000`023d41ca sqlservr!CStmtExecProc::XretLocalExec+0x24c
    00000000`102ce4f0 00000000`023d21bd sqlservr!CStmtExecProc::XretExecExecute+0x416
    00000000`102ceba0 00000000`0138cc76 sqlservr!CXStmtExecProc::XretExecute+0x19
    00000000`102cebe0 00000000`0138e1d8 sqlservr!CMsqlExecContext::ExecuteStmts+0x375
    00000000`102cecf0 00000000`0138da8c sqlservr!CMsqlExecContext::FExecute+0x983
    00000000`102cee70 00000000`0138fafe sqlservr!CSQLSource::Execute+0x7b5
    00000000`102cefa0 00000000`013dd862 sqlservr!process_request+0x64b
    00000000`102cf600 00000000`00eebe44 sqlservr!process_commands+0x4e5
    00000000`102cf810 00000000`00eebc1e sqlservr!SOS_Task::Param::Execute+0x12a
    00000000`102cf920 00000000`00eeb923 sqlservr!SOS_Scheduler::RunTask+0x96
    00000000`102cf980 00000000`014bd2c6 sqlservr!SOS_Scheduler::ProcessTasks+0x128
    00000000`102cf9f0 00000000`014bd765 sqlservr!SchedulerManager::WorkerEntryPoint+0x2b6
    00000000`102cfad0 00000000`014bcd2d sqlservr!SystemThread::RunWorker+0xcc
    00000000`102cfb10 00000000`014bd58e sqlservr!SystemThreadDispatcher::ProcessWorker+0x2db
    00000000`102cfbc0 00000000`747337d7 sqlservr!SchedulerManager::ThreadEntryPoint+0x173
    00000000`102cfc60 00000000`74733894 msvcr80!endthreadex+0x47
    00000000`102cfc90 00000000`76e5652d msvcr80!endthreadex+0x104
    00000000`102cfcc0 00000000`76f8c521 kernel32!BaseThreadInitThunk+0xd
    00000000`102cfcf0 00000000`00000000 ntdll!RtlUserThreadStart+0x1d

    Like

  3. It seems that the non-yield condition was encountered while your session was trying to acquire a lock. If this is not a delayed dump scenario, this is not a normal condition. Were there other issues reported in the SQL Errorlog (out of memory, non-yielding resource monitor, memory trimming etc.) during the time of the issue when the non-yielding condition was reported?

    What is the build of your SQL Server instance that experienced this issue?

    If you want, you could upload the latest non-yield dump and SQL Errorlog to a share and email me the FTP details. I could take a quick look at it in case that is possible for you.

    Like

  4. Hi Amit,

    Thanks for your reply.

    Below error is present in the errorlog along with the non-yielding scheduler errors-

    Timeout occurred while waiting for latch: class ‘LOG_MANAGER’, id 000000000E280AE0, type 4, Task 0x0000000000399288 : 0, waittime 300, flags 0x1a, owning task 0x00000000003CDDC8. Continuing to wait

    The build is -Microsoft SQL Server 2008 R2 (RTM) – 10.50.1765.0 (X64) Feb 2 2011 17:33:22 Copyright (c) Microsoft Corporation Enterprise Edition (64-bit) on Windows NT 6.1 (Build 7601: Service Pack 1) (Hypervisor)

    I have recently applied the hotfix as per -http://support.microsoft.com/kb/2491214.

    Can you also tell me wha is meant by “Spinlock::SpinToAcquireWithExponentialBackoff” in the stack dump ? Also, is there any place which I can refer to interpret the stack dumps from windbg ?

    Unfortunately, I can uploda the dump as this issue occurred on a client machine.

    Please let me know if you require any further details.

    Like

  5. The SQL instance was on a pretty older version of SQL 2008 R2. It is good that you applied the latest build. The function is used while acquiring a Spinlock in SQL Server. You can read about spinlocks here http://sqlcat.com/sqlcat/b/whitepapers/archive/2011/07/05/diagnosing-and-resolving-latch-contention-on-sql-server.aspx

    There is no single place where you can get the information about the functions that you see in a callstack. This is source code specific information which is available only to a select group of folks within Microsoft who are involved in the development or support of the product.

    The very fact that you have a latch timeout during that time period indicates that the non-yielding condition might be a fallout of the latch timeout. If a memory dump was generated for the latch timeout, then see if this helps: https://troubleshootingsql.com/2011/08/26/debugging-that-latch-timeout/

    To investigate this further in detail would require a support case with Microsoft CSS.

    Like

  6. Thank you Amit 🙂 ..I am planning to apply SP2 on the instance and allocate few more CPUs on the VM. Hope this will resolve the issue !! I will update you in case I face any issues.

    Like

It is always good to hear from you! :)

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s