Deadlocked Schedulers and event notifications with a bit of Powershell


I had written 2 posts recently about Deadlocked Schedulers debugging (Part 1 | Part 2) which basically walked through a 17784 and 17888 condition and the steps that you could take to debug such an issue. A quote from the both these articles:

So if you have a monitoring tool that is capturing the output of queries being executed by the active SQL Server sessions, then you can go back to your monitoring repository and dig up the queries which these threads were executing. If such data is not available from the time of the issue, more often that not you will have to end up collecting additional data for the next problem occurrence!

You will need to collect additional data if there wasn’t a monitoring software collecting the same to determine the set of queries and sequence of events leading upto the issue. CSS Engineers can dig into the dump using the private symbols and probably provide you with additional information which can help you along the right direction. But alas, even they do not have magic wands if you cannot provide them with additional information like blocking output/profiler traces for 17884 and 17888 conditions caused by excessive blocking or parallelism.

SQL Server 2008 Service Pack 2 provided an update which made the FailoverAnalysis.sql script available again. This was available in SQL Server 2005 and now is available for SQL Server 2008 as well. This is particularly useful for SQL Server Failover Clusters where you can configure SQLDumper to generate a dump before the failover occurs and also set the SqlDMVScriptTimeout property to get an output of the DMVs used in the script before a failover happens.

image

You can refer “Cluster failovers and the Sqldumper.exe utility” section in the KB917825 for more details on the above mentioned feature. So what am I doing writing this post??

This post is aimed at SQL Server Standalone instances where you cannot benefit from sqldumper. Deadlocked Schedulers are reported with EventIDs 17884 or 17888. I am going to use this to create an scheduled task on a Event Log entry for these two specific EventIDs. I am big fan of automating repetitive tasks and the benefit of such an automation is that you can get additional data from the problem occurrence without having to just rely on a SQL Server mini-dump memory file and be at the mercy of the Windows Debugging tools. For this post, you can say a small good-bye to your favorite debugging tool and read on for some automation in monitoring!!

However, there are some requirements for this kind of indigenous monitoring to work:

1. Dedicated Administrator Connection needs to be enabled for your SQL Server instance(s) – This is required because during a Deadlocked Schedulers condition, you would have run out of worker threads. So, new connection requests cannot be processed unless and until you establish a connection through DAC. If you are planning to use Event Subscription and remote data gathering, then you need Remote DAC enabled for your SQL instance(s) being monitored. Note: Event subscription is a feature available on Windows Server 2008 and above.

2. You need sysadmin privileges for the account that will be used to collect the output of the FailoverAnalysis.sql script.

3. Powershell and SQLPS needs to be available on the server being monitored. If you don’t want to install Powershell components, then the same functionality can be achieved using SMO and VBScript which I have used extensively in the past till I became a fan of POSH awesomeness!! Smile

Now that I have the pre-requisites out of the way, let’s get down to the good part…

The Powershell Script

I had already mentioned that I will be using a Powershell script to get the output of the FailoverAnalysis.sql script for the SQL instance that is encountering the 17884 condition. The piping feature of Powershell really makes it easy to write compact scripts to perform a wide variety of tasks. The script is divided into three parts.

The first part of the script actually uses the script in the section “Adding the SQL Server Snap-ins to Windows PowerShell” of the article “Running SQL Server PowerShell” to initialize the SQLPS snap-in in the session that you will be running the script.

The second part of the script reads the Application event log for the first occurrence of an informational message with the text substring “have not been picked up by a worker thread in the last 60 seconds” from any SQL Server instance as the source.

The third part of the script figures out the SQL Server instance name and connects to it to execute the FailoverAnalysis.sql script. You can modify the output location of the script output below as appropriate for your environment.

Script filename: DeadlockSchedulerMonitor.ps1


#################################################################################

# Script Name: DeadlockSchedulerMonitor #

# Author: Amit Banerjee #

# Date: 9/1/2011 #

# Description: #

# The script reads the first 17884 (Deadlocked Schedulers) event from the #

# Windows Application Event log and identifies the source. Based on the #

# source name, it connects to the SQL Server instance using a DAC #

# connection and collects the FailoverAnalysis.sql script output. #

#################################################################################

$vMsg = (Get-Date).Year.ToString() + (Get-Date).Month.ToString() + (Get-Date).Day.ToString() + (Get-Date).Hour.ToString() + (Get-Date).Minute.ToString() + ":Performing SQLPS snap-in initialization`n"

Write-Host $vMsg

# Call the initialization script to load the SQLPS snap-in in the session that you running your script

# Add the SQL Server Provider.

$ErrorActionPreference = "Stop"

$sqlpsreg="HKLM:\SOFTWARE\Microsoft\PowerShell\1\ShellIds\Microsoft.SqlServer.Management.PowerShell.sqlps"

if (Get-ChildItem $sqlpsreg -ErrorAction "SilentlyContinue")

{

Write-Host "SQL Server Provider for Windows PowerShell is not installed.`n"

}

else

{

$item = Get-ItemProperty $sqlpsreg

$sqlpsPath = [System.IO.Path]::GetDirectoryName($item.Path)

}

# Set mandatory variables for the SQL Server provider

Set-Variable -scope Global -name SqlServerMaximumChildItems -Value 0

Set-Variable -scope Global -name SqlServerConnectionTimeout -Value 30

Set-Variable -scope Global -name SqlServerIncludeSystemObjects -Value $false

Set-Variable -scope Global -name SqlServerMaximumTabCompletion -Value 1000

# Load the snapins, type data, format data

Push-Location

cd $sqlpsPath

Add-PSSnapin SqlServerCmdletSnapin100 -ErrorAction SilentlyContinue

Add-PSSnapin SqlServerProviderSnapin100 -ErrorAction SilentlyContinue

Update-TypeData -PrependPath SQLProvider.Types.ps1xml -ErrorAction SilentlyContinue

update-FormatData -prependpath SQLProvider.Format.ps1xml -ErrorAction SilentlyContinue

Pop-Location

$vMsg = (Get-Date).Year.ToString() + (Get-Date).Month.ToString() + (Get-Date).Day.ToString() + (Get-Date).Hour.ToString() + (Get-Date).Minute.ToString() + ":SQLPS snap-in initialization completed`n"

Write-Host $vMsg

# Local variables to store the event log source and the instance name to which the DAC connection needs to be made

[string] $vSource

[string] $vInstName

[string] $vFileName

$vMsg = (Get-Date).Year.ToString() + (Get-Date).Month.ToString() + (Get-Date).Day.ToString() + (Get-Date).Hour.ToString() + (Get-Date).Minute.ToString() + ":Reading Application Event Log`n"

Write-Host $vMsg

# Get the source which generated the 17884 message

$vSource = Get-EventLog -LogName "Application" -Message "*have not been picked up by a worker thread in the last 60 seconds*" -Source MSSQL* -Newest 1

$vSource = $vSource.Source.ToString()

$vMsg = (Get-Date).Year.ToString() + (Get-Date).Month.ToString() + (Get-Date).Day.ToString() + (Get-Date).Hour.ToString() + (Get-Date).Minute.ToString() + ":Connecting to SQL instance and executing script`n"

Write-Host $vMsg

# Check if it is a default instance or named instance and accordingly collect the output

if ($vSource.Equals("MSSQLSERVER"))

{

# Create the filename

$vFileName = "F:\" + (Get-Date).Year.ToString() + (Get-Date).Month.ToString() + (Get-Date).Day.ToString() + (Get-Date).Hour.ToString() + (Get-Date).Minute.ToString() + (Get-Date).Second.ToString() + "_" + $env:ComputerName.ToString() + "_FailoverAnalysis.OUT"

# Use Invoke-Sqlcmd to get the output of the script

Invoke-Sqlcmd -InputFile "C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\Install\FailoverAnalysis.sql" -ServerInstance $env:ComputerName -DedicatedAdministratorConnection | Out-File -filepath $vFileName

}

else

{

# Construct the SQL Server name to connect to for a named instance

$vInstName = $env:ComputerName + "\" + $vSource.Split("$")[-1]

# Create the filename

$vFileName = "F:\" + (Get-Date).Year.ToString() + (Get-Date).Month.ToString() + (Get-Date).Day.ToString() + (Get-Date).Hour.ToString() + (Get-Date).Minute.ToString() + (Get-Date).Second.ToString() + "_" + $vInstName + "_FailoverAnalysis.OUT"

# Use Invoke-Sqlcmd to get the output of the script

Invoke-Sqlcmd -InputFile "C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\Install\FailoverAnalysis.sql" -ServerInstance $vInstName -DedicatedAdministratorConnection | Out-File -filepath $vFileName

}

$vMsg = (Get-Date).Year.ToString() + (Get-Date).Month.ToString() + (Get-Date).Day.ToString() + (Get-Date).Hour.ToString() + (Get-Date).Minute.ToString() + ":Script execution completed`n"

Write-Host $vMsg

The Task Scheduler Configuration

The next thing that is needed to be done is create Task Scheduler task to execute the Powershell script above when a 17884 condition is reported.

1. Open the Task Scheduler MMC using Start –> Administrative Tools –> Task Scheduler to get the Task Scheduler UI. Right click on Event Viewer Tasks and click on Create Task (see screenshot below). Provide the appropriate settings, description and name. The salient points to keep in mind here are:
a. Use the Change User or Group option to provide an account which has permissions to log into the SQL Server instance(s) to gather the output of the FailoverAnalysis.sql script.
b. You can use the Configure for drop-down box to configure the task for Windows Server 2003 as well.image

2. Next click on the Triggers tab. Click on New, which will open the New Trigger dialog box. Use the Begin the Task drop down box to select On an event option (see screenshot below). Then select the Custom radio button and click on New Event Filter button which will open the New Event Filter dialog box. Provide the options as shown by the highlighted items in the screenshot. In the event sources drop down list, I selected all the SQL Server instances installed on my server. You can choose to leave the default option in case you plan to install more instances on this server. Using the default option can be beneficial if you choose to export this task to other servers as well. Provide the Advanced Settings option in the New Trigger dialog box. I chose to Stop task if it runs more than 30 minutes. Once that is done, you should see new trigger entry in the Trigger tab as On an Event with a custom event filter and the status as enabled.image

3. Under the Actions tab, click on New. Select Start a Program from the Action drop-down box. In the Program/script text box, provide Powershell.exe. In the Add arguments (optional) text box, provide the following: -File “F:\DeadlockSchedulerMonitor.ps1”. Or you can choose to provide a .cmd file path whose contents are “Powershell.exe -File “F:\DeadlockSchedulerMonitor.ps1” > F:\DeadlockSchedulerMonitorLog.txt”. I went with the second option. F: drive root is where my .ps1 script and .cmd file are present. You would need to modify this accordingly depending on where your script is. Once that is done, your action tab should be similar to the screenshot below.

image

4. I used the default settings for the Conditions and Settings tab. You can modify this as appropriate for your environment.

Once the task is created, all you need is for a 17884 condition to occur for the script output to be generated.

Let’s see the configuration in action

image

I simulated a 17884 condition on my SQL Server instance, see if the newly created task gets fired. (see above Application Log snippet). Looking into the F: drive, I find that the following files were created

20119212197_<sql instance name>_FailoverAnalysis.OUT
201192121918_<sql instance name>_FailoverAnalysis.OUT

The reason I have two files is that this is a two NUMA node box, and the deadlock schedulers condition was reported for both nodes. Note that this script reports first occurrence which is reported at 60 seconds since the message text snippet that is used to fetch the event from the Application log specifically looks for the 60 seconds keyword.

NOTE: You can monitor for 17888 events using the following message text instead of the one used above in the script for 17884 script and create a new task for the same. Message text: *appear deadlocked due to a large number of worker threads waiting*

What next?

Well with automation the possibilities are endless… Some of the quick things that come to my mind are:

1. You can create an event subscription and use the Forwarded Events log to track all 17884 from different servers which have SQL Server instances installed on them.
2. You can modify the powershell script to perform additional post processing or send out email notifications to your DBAs for the occurrence of such events.
3. You could even add additional post processing to the Powershell script to perform additional tasks.

If you have SCOM or any other Event log monitoring software, the only thing that you need to do is setup a custom alert for 17884 or 17888 error messages.

Have fun monitoring and customizing further!

Reference:

Understanding and Using PowerShell Support in SQL Server 2008
Windows Server 2008 Event Subscription with Task Scheduling

Disclaimer: This script and the solution provided is “AS IS” and any deployment that you do using a similar logic described requires due-diligence and testing on your part. The testing that I did for this was on Windows Server 2008 R2 & SQL Server 2008 R2.

Advertisements

Debugging the deadlock but for the scheduler: Part 2 of 2


This is the second part of my Deadlocked Schedulers debugging posts. In the first post, I discussed about the difference between a 17888 and 17884 dump and showed the amount of information that can be extracted from a mini-dump that is generated for the first occurrence of a Deadlocked Scheduler issue. In this post, I shall walk you through a 17884 condition.

The first difference that you will notice with a 17884 message i.e. Deadlocked Schedulers reported in the SQL Errorlog is that no information about the blocking resource is mentioned in the SQL Errorlog unlike the 17888 message. If you inspect the SQL Errorlog, you will see only an error message of the following type:

2010-09-20 14:04:07.78 Server      ***Stack Dump being sent to D:\Program Files\Microsoft SQL Server\MSSQL10.MSSQLServer\MSSQL\LOG\SQLDump0001.txt
2010-09-20 14:04:07.78 Server      *   09/20/10 14:04:07 spid 27276
2010-09-20 14:04:07.78 Server      *
2010-09-20 14:04:07.78 Server      *
Deadlocked Schedulers
2010-09-20 14:04:07.78 Server      *
2010-09-20 14:04:07.78 Server      * *******************************************************************************
2010-09-20 14:04:07.78 Server      * ——————————————————————————-
2010-09-20 14:04:07.78 Server      * Short Stack Dump
2010-09-20 14:04:07.82 Server      Stack Signature for the dump is 0x0000000000000285
External dump process returned no errors.

As mentioned in my previous posts, the dump process returned no errors (Note message in green). This means that I can extract some useful information regarding the issue from the dump. One of the important things about a Deadlocked Schedulers (17884) is that you need to look through the Errorlogs for any other Access Violation or Assertion that occurred prior to the Deadlocked Schedulers event. Always check for previous errors first. The 17884 error messages are often symptoms pointing to a problem. Before investigating the dump, always review overall system health information such as performance information, event logs, SQL Server error logs, and other application logging information. Understanding prior problems on the system may quickly clarify why SQL Server detected and reported a health error. Previous errors should always be addressed before the investigation continues.

The common reasons for a Deadlocked Schedulers are:

  1. All schedulers have encountered a 17883 condition.
  2. All workers are blocked on a critical resource.
  3. All workers are executing a long-running query.

I have very rarely seen #1 being the cause of Deadlocked Schedulers. More often than not it is #2 or #3.

Now when you load the dump using WinDBG, as noted in the debugging a non-yielding scheduler issue, you will note similar things. When you load the dump using WinDBG, you will see the following information:

Loading Dump File [D:\Program Files\Microsoft SQL Server\MSSQL10.MSSQLServer\MSSQL\LOG\SQLDump0001.mdmp]
User Mini Dump File: Only registers, stack and portions of memory are available

Comment: ‘Stack Trace’
Comment: ‘Deadlocked Schedulers’

The above tells you that this is a mini-dump for a deadlocked schedulers condition and the location from where you loaded the dump. Then I use the command to set my symbol path and direct the symbols downloaded from the Microsoft symbol server to a local symbol file cache on my machine.

.sympath srv*D:\PublicSymbols*http://msdl.microsoft.com/download/symbols

Then I issue a reload command to load the symbols for sqlservr.exe. This can also be done using CTRL+L and providing the complete string above (without .sympath), checking the Reload checkbox and clicking on OK. The only difference here is that the all the public symbols for all loaded modules in the dump will be downloaded from the Microsoft Symbol Server which are available.

.reload /f sqlservr.exe

Next thing is to verify that the symbols were correctly loaded using the lmvm sqlservr command. If the symbols were loaded correctly, you should see the following output. Note the text in green.

0:005> lmvm sqlservr
start end module name

00000000`00530000 00000000`040de000   sqlservr T (pdb symbols)          d:\publicsymbols\sqlservr.pdbD90C4EC67D442D692A791A19FBA638F2\sqlservr.pdb

If symbols were not loaded, then you will see an output as shown below.

0:005> lmvm sqlservr
start end module name
00000000`00530000 00000000`040de000 sqlservr (export symbols) sqlservr.exe

Since I didn’t have any prior errors to the deadlocked schedulers issue in the Errorlog, I don’t have a search string to look for in the dump. This makes it tricky. So I will have to manually look through the callstacks for all the threads in the dump and establish a common pattern. You can use the !uniqstack command to dump out all the unique callstacks in the dump along with the count of the duplicate callstacks.

0:005> !uniqstack

Output excerpt from previous command:

Total threads: 868
Duplicate callstacks: 810 (windbg thread #s follow):
4, 8, 10, … … 865, 866, 867

As you can see above, there are 868 threads in the dump and there are 810 callstacks which are listed as duplicate.

On manual inspection of the threads using the thread ids mentioned in the Duplicate callstacks list (italicized numbers above), I did find a common pattern. There are multiple threads waiting for locks and there are quite a few threads which are executing in parallel.

Two common thread stacks that emerged were:

Waiting on lock:

0:804> kL100
ntdll!ZwSignalAndWaitForSingleObject
kernel32!SignalObjectAndWait
sqlservr!SOS_Scheduler::Switch
sqlservr!SOS_Scheduler::SuspendNonPreemptive
sqlservr!SOS_Scheduler::Suspend
sqlservr!EventInternal<Spinlock<153,1,0> >::Wait
sqlservr!LockOwner::Sleep
sqlservr!lck_lockInternal


sqlservr!process_request
sqlservr!process_commands
sqlservr!SOS_Task::Param::Execute
sqlservr!SOS_Scheduler::RunTask
sqlservr!SOS_Scheduler::ProcessTasks
sqlservr!SchedulerManager::WorkerEntryPoint
sqlservr!SystemThread::RunWorker
sqlservr!SystemThreadDispatcher::ProcessWorker
sqlservr!SchedulerManager::ThreadEntryPoint
msvcr80!endthreadex
msvcr80!endthreadex
kernel32!BaseThreadInitThunk
ntdll!RtlUserThreadStart

Parallel thread:

0:852> kL100

ntdll!ZwWaitForSingleObject
KERNELBASE!WaitForSingleObjectEx
sqlservr!SOS_Scheduler::SwitchContext
sqlservr!SOS_Scheduler::SuspendNonPreemptive
sqlservr!SOS_Scheduler::Suspend
sqlservr!EventInternal<Spinlock<153,1,0> >::Wait
sqlservr!EventInternal<Spinlock<153,1,0> >::WaitAllowPrematureWakeup
sqlservr!XchngWait<Spinlock<38,10,0> >
sqlservr!CXPort::Open
sqlservr!CXTransLocal::Synchronize
sqlservr!CQScanXProducerNew::Open
sqlservr!FnProducerOpen
sqlservr!FnProducerThread
sqlservr!SubprocEntrypoint
sqlservr!SOS_Task::Param::Execute
sqlservr!SOS_Scheduler::RunTask
sqlservr!SOS_Scheduler::ProcessTasks
sqlservr!SchedulerManager::WorkerEntryPoint
sqlservr!SystemThread::RunWorker
sqlservr!SystemThreadDispatcher::ProcessWorker
sqlservr!SchedulerManager::ThreadEntryPoint
msvcr80!endthreadex
msvcr80!endthreadex
kernel32!BaseThreadInitThunk
ntdll!RtlUserThreadStart

The next thing that I shall do is pick one of the common relevant functions from the above callstacks and search for the number of occurrences of these frames in the all the threads present in the dump using the !findstack command.

!findstack sqlservr!LockOwner::Sleep 0

Excerpt from the output of the above command:

Thread 054, 1 frame(s) match
Thread 056, 1 frame(s) match
Thread 059, 1 frame(s) match
Thread 061, 1 frame(s) match
Thread 062, 1 frame(s) match


Thread 799, 1 frame(s) match
Thread 800, 1 frame(s) match
Thread 801, 1 frame(s) match
Thread 802, 1 frame(s) match
Thread 804, 1 frame(s) match
Thread 805, 1 frame(s) match

!findstack sqlservr!FnProducerThread 0

Excerpt from the output of the above command:

Thread 055, 1 frame(s) match
Thread 057, 1 frame(s) match
Thread 066, 1 frame(s) match
Thread 082, 1 frame(s) match
Thread 087, 1 frame(s) match
Thread 093, 1 frame(s) match


Thread 840, 1 frame(s) match
Thread 843, 1 frame(s) match
Thread 844, 1 frame(s) match
Thread 850, 1 frame(s) match
Thread 852, 1 frame(s) match

I found that there were 584 threads (67%) waiting for locks and 166 threads (19%) involved in parallelism. Note that the lines colored in blue are the callstacks which I had mentioned above to determine that there were waits on locks and parallelism. This confirms that the problem is due to concurrent requests waiting for lock(s) on object(s) and some of the queries are executing in parallelism. If a CSS engineer is examining this dump, then they could probably extract additional information like the queries being executed by the threads (if the information is available in the dump) using private symbols. If you were monitoring the server using a monitoring tool which captured the query requests, then you could go back to your monitoring repository and get the queries that the above threads were executing using the thread ids that you have identified.

TIP: The interesting thing to note is the thread IDs that you see above will directly match the OS Thread ID reported in the sys.dm_os_threads os_thread_id column value or the sys.sysprocesses kpid column value. So if you have a monitoring tool that is capturing the output of queries being executed by the active SQL Server sessions, then you can go back to your monitoring repository and dig up the queries which these threads were executing. If such data is not available from the time of the issue, more often that not you will have to end up collecting additional data (profiler traces along with DMV outputs to track blocking chains) for the next problem occurrence!

Also check if the Deadlocked Schedulers matches any KB Article (update) already released. Remember that deadlocked schedulers is a symptom and the not the cause!! So make sure that if you find a KB Article which talks about a Deadlocked Schedulers issue being addressed in a later build, the symptoms match the symptoms that you are experiencing on your server! If in doubt, ask CSS!

If you are stuck and have no clue why the issue happened, then this would be a good time to engage Microsoft CSS Support and provide them with the SQL Errorlogs, the dump file and the windows event logs in .txt format for further analysis.

Reference:
KB 315263 – How to read the small memory dump files that Windows creates for debugging
How To Diagnose and Correct Errors 17883, 17884, 17887, and 17888 whitepaper

Debugging the deadlock but for the scheduler: Part 1 of 2


I had very recently blogged about debugging a non-yielding scheduler condition for a SQL Server instance. In today’s post, I shall talk about debugging a “deadlocked schedulers” issue which is a.k.a the infamous 17884 to you old-school folks. If ALL schedulers are considered stuck and SQL Server is not under active memory pressure, then the 17884 or 17888 condition is reported starting from SQL Server 2005. In this post, I shall show you how to dig into the root cause for a 17888 message reported in the SQL Errorlog. There are subtle differences between a 17888 and 17884 condition detected and reported by the SchedulerMonitor. A 17884 generally tells you that no work is being processed by a particular NUMA Node (for non-NUMA machines, it denotes the entire instance). On the other hand, 17888 tells you that no new work request is being processed on a particular NUMA node by the SQL server instance due to worker threads waiting on particular resource(s). The whitepaper mentioned at the end of this post covers the nitty-gritty details of the detection algorithm.

Description of the error messages from sys.messages catalog

17784 – New queries assigned to process on Node %d have not been picked up by a worker thread in the last %d seconds. Blocking or long-running queries can contribute to this condition, and may degrade client response time. Use the "max worker threads" configuration option to increase number of allowable threads, or optimize current running queries. SQL Process Utilization: %d%%. System Idle: %d%%.

17888 – All schedulers on Node %d appear deadlocked due to a large number of worker threads waiting on %ls. Process Utilization %d%%.

When a 17888 is reported, you will find messages of the following nature in the SQL Errorlog:

2008-03-25 14:14:53.37 Server      ***Stack Dump being sent to g:\mssql\data\MSSQL\LOG\SQLDump0006.txt
2008-03-25 14:14:53.38 Server      * BEGIN STACK DUMP:
2008-03-25 14:14:53.38 Server      *   03/25/08 14:14:53 spid 0
2008-03-25 14:14:53.38 Server      *
2008-03-25 14:14:53.38 Server      * Deadlocked Schedulers
2008-03-25 14:14:53.38 Server      * Short Stack Dump
2008-03-25 14:14:53.41 Server      Stack Signature for the dump is 0x00000260
2008-03-25 14:14:56.44 Server      External dump process return code 0x20000001.
External dump process returned no errors.

2008-03-25 14:14:56.44 Server      All schedulers on Node 0 appear deadlocked due to a large number of worker threads waiting on LCK_M_S. Process Utilization 1%.

Again note the differently colored lines and text whose significance, if not self-explanatory at this time, I shall talk about them later in the post. The last message in the above excerpt very clearly tells you that you have a blocking issue on the SQL instance and most of the worker threads on Node 0 are waiting for a LCK_M_S which they cannot acquire due to a conflicting lock(s) held on the object(s) by another session. This time around the SQLDumpXXXX.txt filename is mentioned in the Errorlog unlike the non-yielding scheduler message. So what needs to be done is locate the .mdmp file (in this case SQLDump0006.mdmp) from the same location and use WinDBG to load the dump for analysis.

If you encounter this condition on a SQL Server Failover Cluster instance, then this will result in IsAlive check failures and the SQL service being re-started on the same node or could even result in a failover, if the SQL instance doesn’t respond to the stop request in time. Note the “External dump process returned no errors.” in the above output. This means that the dump generation routine didn’t encounter any errors. Before I move onto the interesting part of looking through the data, let me set expectation here that there is very little information that you can coax a mini-dump for a deadlocked schedulers issue to provide you with while using public symbols. Even with the help of private symbols which CSS engineers have access to, it might not be possible to perform a complete root cause analysis of the issue. The main reason for this is that even if CSS engineers are able to find out the queries being executed on the SQL instance from the mini-dump, the blocking chain origin (in this case because it is clear from the error message excerpt in the Errorlog) cannot be determined from the mini-dump. This invariably ends up requiring collection of SQL Server Perf Stats script output (for capturing blocking information) and profiler traces to get the queries being executed by the head blocker which caused the blocking condition.

Now when you load the dump using WinDBG, as noted in the debugging a non-yielding scheduler issue, you will note similar things:

When you load the dump using WinDBG, you will see the following information:

Loading Dump File [g:\mssql\data\MSSQL\LOG\SQLDump0006.mdmp]
User Mini Dump File: Only registers, stack and portions of memory are available

Comment: ‘Stack Trace’
Comment: ‘Deadlocked Schedulers’

The above tells you that this is a mini-dump for a deadlocked schedulers condition and the location from where you loaded the dump. Then I use the command to set my symbol path and direct the symbols downloaded from the Microsoft symbol server to a local symbol file cache on my machine.

.sympath srv*D:\PublicSymbols*http://msdl.microsoft.com/download/symbols

Then I issue a reload command to load the symbols for sqlservr.exe. This can also be done using CTRL+L and providing the complete string above (without .sympath), checking the Reload checkbox and clicking on OK. The only difference here is that the all the public symbols for all loaded modules in the dump will be downloaded from the Microsoft Symbol Server which are available.

.reload /f sqlservr.exe

Next thing is to verify that the symbols were correctly loaded using the lmvm sqlservr command. If the symbols were loaded correctly, you should see the following output. Note the text in green.

0:019> lmvm sqlservr
start end module name

01000000 02c0a000   sqlservr T (pdb symbols)          d:\publicsymbols\sqlservr.pdb\389EF554D94A4947846D85FCDC4233382\sqlservr.pdb

If symbols were not loaded, then you will see an output as shown below.

0:005> lmvm sqlservr
start    end        module name
01000000 02c0a000   sqlservr (export symbols) sqlservr.exe

Since I know that there is a blocking issue due to which a deadlocked schedulers condition was reported, I shall use the !findstack debugger command to find out all the callstacks which are potentially waiting for a lock:

!findstack sqlservr!LockOwner::Sleep 2

Output for one of the threads from the above command:

Thread 271, 1 frame(s) match
          00 6db5eef8 7c821fc4 ntdll!KiFastSystemCallRet
          01 6db5eefc 77e43c1d ntdll!NtSignalAndWaitForSingleObject
          02 6db5ef70 010034f2 kernel32!SignalObjectAndWait
          03 6db5ef94 01002172 sqlservr!SOS_Scheduler::Switch
          04 6db5f108 01002cb9 sqlservr!SOS_Scheduler::SwitchContext
          05 6db5f120 01001c75 sqlservr!SOS_Scheduler::SuspendNonPreemptive
          06 6db5f138 01002f25 sqlservr!SOS_Scheduler::Suspend
          07 6db5f15c 01056f6a sqlservr!SOS_Event::Wait
       * 08 6db5f198 01057081 sqlservr!LockOwner::Sleep
          09 6db5f238 0100a75d sqlservr!lck_lockInternal
          10 6db5f2b4 0100d506 sqlservr!GetLock
          11 6db5f324 010432c7 sqlservr!BTreeRow::AcquireLock
          12 6db5f3a4 0100dc00 sqlservr!IndexDataSetSession::GetRowByKeyValue
      ….
      ….
          33 6db5feb0 010b9064 sqlservr!SystemThread::RunWorker
          34 6db5ff10 010b9201 sqlservr!SystemThreadDispatcher::ProcessWorker
          35 6db5ff78 781329bb sqlservr!SchedulerManager::ThreadEntryPoint
          36 6db5ffb0 78132a47 msvcr80!_callthreadstartex
          37 6db5ffb8 77e6608b msvcr80!_threadstartex
          38 6db5ffec 00000000 kernel32!BaseThreadStart

As you can see from the above call-stack, the worker thread is waiting for the lock that it requested for to become available. I found that most of the call-stacks reported in the !findstack output were similar. I found that 183 threads (counted the number of match keyword occurrences in the !findstack output) reported LockOwner::Sleep in their call-stack. This confirms that I have a serious blocking issue either due to large number of concurrent requests asking for the same conflicting lock(s) which are held by other thread(s) or it could be a single long running transaction or uncommitted transaction which is causing all threads to wait for the lock to be released post the commit or completion of the query. In my experience of handling such issues, I have seen the latter cause to be most prevalent.

If you need to find out the total number of threads in the dump, then use the ~* command to get the number of threads.

0:278> ~*
   0  Id: 45c.43c Suspend: 1 Teb: 7ffdd000 Unfrozen
      Start: sqlservr!wmainCRTStartup (02432df8)
      Priority: 0  Priority class: 32  Affinity: ff
   1 Id: 45c.51c Suspend: 1 Teb: 7ffda000 Unfrozen
      Start: msvcr80!_threadstartex (781329e1)
      Priority: 0  Priority class: 32  Affinity: 22



.278  Id: 45c.2b4c Suspend: 1 Teb: 7fea3000 Unfrozen
      Start: msvcr80!_threadstartex (781329e1)
      Priority: 0  Priority class: 32  Affinity: ff

This means that 183 threads out of 278 threads (65% of total worker threads) are waiting for locks!!! In case you got a wait type reported in the 17888 message which is not a LOCK waittype, then what keyword do you search for using the !findstack command?? That would lead to an interesting scenario. You can then use another debugger command !uniqstack to list out the unique call-stacks in the dump.

0:278> !uniqstack

Partial output of the above command:
Processing 279 threads, please wait

.134  Id: 45c.30a8 Suspend: 1 Teb: 7ff36000 Unfrozen
      Start: msvcr80!_threadstartex (781329e1)
      Priority: 0  Priority class: 32  Affinity: 22
ChildEBP RetAddr
6db5eef8 7c821fc4 ntdll!KiFastSystemCallRet
6db5eefc 77e43c1d ntdll!NtSignalAndWaitForSingleObject
6db5ef70 010034f2 kernel32!SignalObjectAndWait
6db5ef94 01002172 sqlservr!SOS_Scheduler::Switch
6db5f108 01002cb9 sqlservr!SOS_Scheduler::SwitchContext
6db5f120 01001c75 sqlservr!SOS_Scheduler::SuspendNonPreemptive
6db5f138 01002f25 sqlservr!SOS_Scheduler::Suspend
6db5f15c 01056f6a sqlservr!SOS_Event::Wait
6db5f198 01057081 sqlservr!LockOwner::Sleep
6db5f238 0100a75d sqlservr!lck_lockInternal
6db5f2b4 0100d506 sqlservr!GetLock
6db5f324 010432c7 sqlservr!BTreeRow::AcquireLock
6db5f3a4 0100dc00 sqlservr!IndexDataSetSession::GetRowByKeyValue
….
….
6db5feb0 010b9064 sqlservr!SystemThread::RunWorker
6db5ff10 010b9201 sqlservr!SystemThreadDispatcher::ProcessWorker
6db5ff78 781329bb sqlservr!SchedulerManager::ThreadEntryPoint
6db5ffb0 78132a47 msvcr80!_callthreadstartex
6db5ffb8 77e6608b msvcr80!_threadstartex
6db5ffec 00000000 kernel32!BaseThreadStart

Total threads: 279
Duplicate callstacks: 219 (windbg thread #s follow):
3, 12, 13, 15, 16, 17, 19, 21, 22, … …. 276, 277, 278

 

With the thread IDs reported in the duplicate call-stacks output (italicized numbers), you will manually (with some soul searching) have to look through a few thread call stacks using the following commands (below) and check what is most common callstack pattern that you notice from the list of duplicate threads. Disclaimer: This is not an easy task unless you do engage in such kind of debugging on a regular basis! And definitely not for the faint hearted and easily irritated!

~[Thread ID]s
kL100

Example:

0:052> ~53s
eax=553985c0 ebx=7ff2ce38 ecx=00000000 edx=00000000 esi=00001bac edi=00000000
eip=7c82ed54 esp=693aeefc ebp=693aef70 iopl=0         nv up ei pl zr na pe nc
cs=001b  ss=0023  ds=0023  es=0023  fs=003b  gs=0000             efl=00000246
ntdll!KiFastSystemCallRet:
7c82ed54 c3              ret

0:053> kL100

ChildEBP RetAddr
6db5eef8 7c821fc4 ntdll!KiFastSystemCallRet
6db5eefc 77e43c1d ntdll!NtSignalAndWaitForSingleObject
6db5ef70 010034f2 kernel32!SignalObjectAndWait
6db5ef94 01002172 sqlservr!SOS_Scheduler::Switch
6db5f108 01002cb9 sqlservr!SOS_Scheduler::SwitchContext
6db5f120 01001c75 sqlservr!SOS_Scheduler::SuspendNonPreemptive
6db5f138 01002f25 sqlservr!SOS_Scheduler::Suspend
6db5f15c 01056f6a sqlservr!SOS_Event::Wait
6db5f198 01057081 sqlservr!LockOwner::Sleep
6db5f238 0100a75d sqlservr!lck_lockInternal
6db5f2b4 0100d506 sqlservr!GetLock
6db5f324 010432c7 sqlservr!BTreeRow::AcquireLock
6db5f3a4 0100dc00 sqlservr!IndexDataSetSession::GetRowByKeyValue
….
….
6db5feb0 010b9064 sqlservr!SystemThread::RunWorker
6db5ff10 010b9201 sqlservr!SystemThreadDispatcher::ProcessWorker
6db5ff78 781329bb sqlservr!SchedulerManager::ThreadEntryPoint
6db5ffb0 78132a47 msvcr80!_callthreadstartex
6db5ffb8 77e6608b msvcr80!_threadstartex
6db5ffec 00000000 kernel32!BaseThreadStart

TIP: The interesting thing to note is the thread IDs that you see above will directly match the OS Thread ID reported in the sys.dm_os_threads os_thread_id column value or the sys.sysprocesses kpid column value. So if you have a monitoring tool that is capturing the output of queries being executed by the active SQL Server sessions, then you can go back to your monitoring repository and dig up the queries which these threads were executing. If such data is not available from the time of the issue, more often that not you will have to end up collecting additional data for the next problem occurrence!

Also check if the Deadlocked Schedulers matches any KB Article (update) already released. Remember that deadlocked schedulers is a symptom and the not the cause!! So make sure that if you find a KB Article which talks about a Deadlocked Schedulers issue being addressed in a later build, the symptoms match the symptoms that you are experiencing on your server! If in doubt, ask CSS!

If you are stuck and have no clue why the issue happened, then this would be a good time to engage Microsoft CSS Support and provide them with the SQL Errorlogs, the dump file and the windows event logs in .txt format for further analysis.

Reference:
KB 315263 – How to read the small memory dump files that Windows creates for debugging
How To Diagnose and Correct Errors 17883, 17884, 17887, and 17888 whitepaper