Deadlocked Schedulers and event notifications with a bit of Powershell

I had written 2 posts recently about Deadlocked Schedulers debugging (Part 1 | Part 2) which basically walked through a 17784 and 17888 condition and the steps that you could take to debug such an issue. A quote from the both these articles:

So if you have a monitoring tool that is capturing the output of queries being executed by the active SQL Server sessions, then you can go back to your monitoring repository and dig up the queries which these threads were executing. If such data is not available from the time of the issue, more often that not you will have to end up collecting additional data for the next problem occurrence!

You will need to collect additional data if there wasn’t a monitoring software collecting the same to determine the set of queries and sequence of events leading upto the issue. CSS Engineers can dig into the dump using the private symbols and probably provide you with additional information which can help you along the right direction. But alas, even they do not have magic wands if you cannot provide them with additional information like blocking output/profiler traces for 17884 and 17888 conditions caused by excessive blocking or parallelism.

SQL Server 2008 Service Pack 2 provided an update which made the FailoverAnalysis.sql script available again. This was available in SQL Server 2005 and now is available for SQL Server 2008 as well. This is particularly useful for SQL Server Failover Clusters where you can configure SQLDumper to generate a dump before the failover occurs and also set the SqlDMVScriptTimeout property to get an output of the DMVs used in the script before a failover happens.

image

You can refer “Cluster failovers and the Sqldumper.exe utility” section in the KB917825 for more details on the above mentioned feature. So what am I doing writing this post??

This post is aimed at SQL Server Standalone instances where you cannot benefit from sqldumper. Deadlocked Schedulers are reported with EventIDs 17884 or 17888. I am going to use this to create an scheduled task on a Event Log entry for these two specific EventIDs. I am big fan of automating repetitive tasks and the benefit of such an automation is that you can get additional data from the problem occurrence without having to just rely on a SQL Server mini-dump memory file and be at the mercy of the Windows Debugging tools. For this post, you can say a small good-bye to your favorite debugging tool and read on for some automation in monitoring!!

However, there are some requirements for this kind of indigenous monitoring to work:

1. Dedicated Administrator Connection needs to be enabled for your SQL Server instance(s) – This is required because during a Deadlocked Schedulers condition, you would have run out of worker threads. So, new connection requests cannot be processed unless and until you establish a connection through DAC. If you are planning to use Event Subscription and remote data gathering, then you need Remote DAC enabled for your SQL instance(s) being monitored. Note: Event subscription is a feature available on Windows Server 2008 and above.

2. You need sysadmin privileges for the account that will be used to collect the output of the FailoverAnalysis.sql script.

3. Powershell and SQLPS needs to be available on the server being monitored. If you don’t want to install Powershell components, then the same functionality can be achieved using SMO and VBScript which I have used extensively in the past till I became a fan of POSH awesomeness!! Smile

Now that I have the pre-requisites out of the way, let’s get down to the good part…

The Powershell Script

I had already mentioned that I will be using a Powershell script to get the output of the FailoverAnalysis.sql script for the SQL instance that is encountering the 17884 condition. The piping feature of Powershell really makes it easy to write compact scripts to perform a wide variety of tasks. The script is divided into three parts.

The first part of the script actually uses the script in the section “Adding the SQL Server Snap-ins to Windows PowerShell” of the article “Running SQL Server PowerShell” to initialize the SQLPS snap-in in the session that you will be running the script.

The second part of the script reads the Application event log for the first occurrence of an informational message with the text substring “have not been picked up by a worker thread in the last 60 seconds” from any SQL Server instance as the source.

The third part of the script figures out the SQL Server instance name and connects to it to execute the FailoverAnalysis.sql script. You can modify the output location of the script output below as appropriate for your environment.

Script filename: DeadlockSchedulerMonitor.ps1


#################################################################################

# Script Name: DeadlockSchedulerMonitor #

# Author: Amit Banerjee #

# Date: 9/1/2011 #

# Description: #

# The script reads the first 17884 (Deadlocked Schedulers) event from the #

# Windows Application Event log and identifies the source. Based on the #

# source name, it connects to the SQL Server instance using a DAC #

# connection and collects the FailoverAnalysis.sql script output. #

#################################################################################

$vMsg = (Get-Date).Year.ToString() + (Get-Date).Month.ToString() + (Get-Date).Day.ToString() + (Get-Date).Hour.ToString() + (Get-Date).Minute.ToString() + ":Performing SQLPS snap-in initialization`n"

Write-Host $vMsg

# Call the initialization script to load the SQLPS snap-in in the session that you running your script

# Add the SQL Server Provider.

$ErrorActionPreference = "Stop"

$sqlpsreg="HKLM:\SOFTWARE\Microsoft\PowerShell\1\ShellIds\Microsoft.SqlServer.Management.PowerShell.sqlps"

if (Get-ChildItem $sqlpsreg -ErrorAction "SilentlyContinue")

{

Write-Host "SQL Server Provider for Windows PowerShell is not installed.`n"

}

else

{

$item = Get-ItemProperty $sqlpsreg

$sqlpsPath = [System.IO.Path]::GetDirectoryName($item.Path)

}

# Set mandatory variables for the SQL Server provider

Set-Variable -scope Global -name SqlServerMaximumChildItems -Value 0

Set-Variable -scope Global -name SqlServerConnectionTimeout -Value 30

Set-Variable -scope Global -name SqlServerIncludeSystemObjects -Value $false

Set-Variable -scope Global -name SqlServerMaximumTabCompletion -Value 1000

# Load the snapins, type data, format data

Push-Location

cd $sqlpsPath

Add-PSSnapin SqlServerCmdletSnapin100 -ErrorAction SilentlyContinue

Add-PSSnapin SqlServerProviderSnapin100 -ErrorAction SilentlyContinue

Update-TypeData -PrependPath SQLProvider.Types.ps1xml -ErrorAction SilentlyContinue

update-FormatData -prependpath SQLProvider.Format.ps1xml -ErrorAction SilentlyContinue

Pop-Location

$vMsg = (Get-Date).Year.ToString() + (Get-Date).Month.ToString() + (Get-Date).Day.ToString() + (Get-Date).Hour.ToString() + (Get-Date).Minute.ToString() + ":SQLPS snap-in initialization completed`n"

Write-Host $vMsg

# Local variables to store the event log source and the instance name to which the DAC connection needs to be made

[string] $vSource

[string] $vInstName

[string] $vFileName

$vMsg = (Get-Date).Year.ToString() + (Get-Date).Month.ToString() + (Get-Date).Day.ToString() + (Get-Date).Hour.ToString() + (Get-Date).Minute.ToString() + ":Reading Application Event Log`n"

Write-Host $vMsg

# Get the source which generated the 17884 message

$vSource = Get-EventLog -LogName "Application" -Message "*have not been picked up by a worker thread in the last 60 seconds*" -Source MSSQL* -Newest 1

$vSource = $vSource.Source.ToString()

$vMsg = (Get-Date).Year.ToString() + (Get-Date).Month.ToString() + (Get-Date).Day.ToString() + (Get-Date).Hour.ToString() + (Get-Date).Minute.ToString() + ":Connecting to SQL instance and executing script`n"

Write-Host $vMsg

# Check if it is a default instance or named instance and accordingly collect the output

if ($vSource.Equals("MSSQLSERVER"))

{

# Create the filename

$vFileName = "F:\" + (Get-Date).Year.ToString() + (Get-Date).Month.ToString() + (Get-Date).Day.ToString() + (Get-Date).Hour.ToString() + (Get-Date).Minute.ToString() + (Get-Date).Second.ToString() + "_" + $env:ComputerName.ToString() + "_FailoverAnalysis.OUT"

# Use Invoke-Sqlcmd to get the output of the script

Invoke-Sqlcmd -InputFile "C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\Install\FailoverAnalysis.sql" -ServerInstance $env:ComputerName -DedicatedAdministratorConnection | Out-File -filepath $vFileName

}

else

{

# Construct the SQL Server name to connect to for a named instance

$vInstName = $env:ComputerName + "\" + $vSource.Split("$")[-1]

# Create the filename

$vFileName = "F:\" + (Get-Date).Year.ToString() + (Get-Date).Month.ToString() + (Get-Date).Day.ToString() + (Get-Date).Hour.ToString() + (Get-Date).Minute.ToString() + (Get-Date).Second.ToString() + "_" + $vInstName + "_FailoverAnalysis.OUT"

# Use Invoke-Sqlcmd to get the output of the script

Invoke-Sqlcmd -InputFile "C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\Install\FailoverAnalysis.sql" -ServerInstance $vInstName -DedicatedAdministratorConnection | Out-File -filepath $vFileName

}

$vMsg = (Get-Date).Year.ToString() + (Get-Date).Month.ToString() + (Get-Date).Day.ToString() + (Get-Date).Hour.ToString() + (Get-Date).Minute.ToString() + ":Script execution completed`n"

Write-Host $vMsg

The Task Scheduler Configuration

The next thing that is needed to be done is create Task Scheduler task to execute the Powershell script above when a 17884 condition is reported.

1. Open the Task Scheduler MMC using Start –> Administrative Tools –> Task Scheduler to get the Task Scheduler UI. Right click on Event Viewer Tasks and click on Create Task (see screenshot below). Provide the appropriate settings, description and name. The salient points to keep in mind here are:
a. Use the Change User or Group option to provide an account which has permissions to log into the SQL Server instance(s) to gather the output of the FailoverAnalysis.sql script.
b. You can use the Configure for drop-down box to configure the task for Windows Server 2003 as well.image

2. Next click on the Triggers tab. Click on New, which will open the New Trigger dialog box. Use the Begin the Task drop down box to select On an event option (see screenshot below). Then select the Custom radio button and click on New Event Filter button which will open the New Event Filter dialog box. Provide the options as shown by the highlighted items in the screenshot. In the event sources drop down list, I selected all the SQL Server instances installed on my server. You can choose to leave the default option in case you plan to install more instances on this server. Using the default option can be beneficial if you choose to export this task to other servers as well. Provide the Advanced Settings option in the New Trigger dialog box. I chose to Stop task if it runs more than 30 minutes. Once that is done, you should see new trigger entry in the Trigger tab as On an Event with a custom event filter and the status as enabled.image

3. Under the Actions tab, click on New. Select Start a Program from the Action drop-down box. In the Program/script text box, provide Powershell.exe. In the Add arguments (optional) text box, provide the following: -File “F:\DeadlockSchedulerMonitor.ps1”. Or you can choose to provide a .cmd file path whose contents are “Powershell.exe -File “F:\DeadlockSchedulerMonitor.ps1” > F:\DeadlockSchedulerMonitorLog.txt”. I went with the second option. F: drive root is where my .ps1 script and .cmd file are present. You would need to modify this accordingly depending on where your script is. Once that is done, your action tab should be similar to the screenshot below.

image

4. I used the default settings for the Conditions and Settings tab. You can modify this as appropriate for your environment.

Once the task is created, all you need is for a 17884 condition to occur for the script output to be generated.

Let’s see the configuration in action

image

I simulated a 17884 condition on my SQL Server instance, see if the newly created task gets fired. (see above Application Log snippet). Looking into the F: drive, I find that the following files were created

20119212197_<sql instance name>_FailoverAnalysis.OUT
201192121918_<sql instance name>_FailoverAnalysis.OUT

The reason I have two files is that this is a two NUMA node box, and the deadlock schedulers condition was reported for both nodes. Note that this script reports first occurrence which is reported at 60 seconds since the message text snippet that is used to fetch the event from the Application log specifically looks for the 60 seconds keyword.

NOTE: You can monitor for 17888 events using the following message text instead of the one used above in the script for 17884 script and create a new task for the same. Message text: *appear deadlocked due to a large number of worker threads waiting*

What next?

Well with automation the possibilities are endless… Some of the quick things that come to my mind are:

1. You can create an event subscription and use the Forwarded Events log to track all 17884 from different servers which have SQL Server instances installed on them.
2. You can modify the powershell script to perform additional post processing or send out email notifications to your DBAs for the occurrence of such events.
3. You could even add additional post processing to the Powershell script to perform additional tasks.

If you have SCOM or any other Event log monitoring software, the only thing that you need to do is setup a custom alert for 17884 or 17888 error messages.

Have fun monitoring and customizing further!

Reference:

Understanding and Using PowerShell Support in SQL Server 2008
Windows Server 2008 Event Subscription with Task Scheduling

Disclaimer: This script and the solution provided is “AS IS” and any deployment that you do using a similar logic described requires due-diligence and testing on your part. The testing that I did for this was on Windows Server 2008 R2 & SQL Server 2008 R2.

What does cntr_type mean?

Have you ever wondered if the cntr_type column value in the sys.sysperfinfo or sys.dm_os_performance_counters output has a significant meaning or not. Well since the column value is there, it obviously has a meaning. Additionally, if the raw values represented by the output of some of the counter types is considered at face value, then your performance base lining can take a severe beating.

Each cntr_type value meaning can be found from the WMI Performance Counter Type or Windows Server Performance Counter Type documentation on MSDN. 

The common counter types in SQL Server are:
PERF_COUNTER_RAWCOUNT | Decimal | 65536
Raw counter value that does not require calculations, and represents one sample.

PERF_COUNTER_LARGE_RAWCOUNT | Decimal | 65792
Same as PERF_COUNTER_RAWCOUNT, but a 64-bit representation for larger values.

PERF_COUNTER_COUNTER | Decimal | 272696320
Average number of operations completed during each second of the sample interval. NOTE: For "per-second counters", this value is cumulative. The rate value must be calculated by sampling the value at discrete time intervals. The difference between any two successive sample values is equal to the rate for the time interval used. For example, batch requests/sec is a per-second counter, it would show cumulative values.

PERF_COUNTER_BULK_COUNT | Decimal | 272696576
Average number of operations completed during each second of the sample interval. This counter type is the same as the PERF_COUNTER_COUNTER type, but it uses larger fields to accommodate larger values.

PERF_AVERAGE_BULK | Decimal | 1073874176 | Decimal | 537003264
Number of items processed, on average, during an operation. This counter type displays a ratio of the items processed (such as bytes sent) to the number of operations completed, and requires a base property with PERF_AVERAGE_BASE as the counter type.

PERF_LARGE_RAW_BASE | Decimal | 1073939712
Base value found in the calculation of PERF_RAW_FRACTION, 64 bits.

Example:
If you had the following values:
SQLServer:Plan Cache | Cache Hit Ratio | Temporary Tables & Table Variables | 381
SQLServer:Plan Cache | Cache Hit Ratio Base | Temporary Tables & Table Variables | 386
Then the Temp Table/Variable cache hit ratio percentage would be: 98.7% (approx.)

You can use the query below to get the comments for each counter type as discussed above:

select object_name,counter_name,instance_name,cntr_value,
case cntr_type 
	when 65792 then 'Absolute Meaning' 
	when 65536 then 'Absolute Meaning' 
	when 272696576 then 'Per Second counter and is Cumulative in Nature'
	when 1073874176 then 'Bulk Counter. To get correct value, this value needs to be divided by Base Counter value'
	when 537003264 then 'Bulk Counter. To get correct value, this value needs to be divided by Base Counter value' 
end as counter_comments
from sys.dm_os_performance_counters
where cntr_type not in (1073939712)

 

Documentation on MSDN:

WMI Performance Counter Types

http://msdn.microsoft.com/en-us/library/aa394569(VS.85).aspx

SQL Server 2005 BOL Topic

sys.dm_os_performance_counters (Transact-SQL) 

The broad classes of counters are as follows:

Non-computational Counter Types

http://msdn.microsoft.com/en-us/library/aa392713(VS.85).aspx

Basic Algorithm Counter Types

http://msdn.microsoft.com/en-us/library/aa384813(VS.85).aspx

Counter Algorithm Counter Types

http://msdn.microsoft.com/en-us/library/aa389384(VS.85).aspx

Timer Algorithm Counter Types

http://msdn.microsoft.com/en-us/library/aa393909(VS.85).aspx

Precision Timer Algorithm Counter Types

http://msdn.microsoft.com/en-us/library/aa392755(VS.85).aspx

Queue-length Algorithm Counter Types

http://msdn.microsoft.com/en-us/library/aa392905(VS.85).aspx

Base Counter Types

http://msdn.microsoft.com/en-us/library/aa384811(VS.85).aspx

Statistical Counter Types

http://msdn.microsoft.com/en-us/library/aa393663(VS.85).aspx

What can you tell me Default Trace

The default trace functionality introduced in SQL Server 2005 is still a much under appreciated feature of the product. This is somewhat like a black box trace which is analogous to a airline black box can help you in certain post-mortem analysis as a DBA. I am sure a lot of you have a host of in-house scripts that you have in-place or some monitoring/auditing tool that reports some inadvertent/unscrupulous activity on the databases that you administer. Nonetheless, it is always a good idea to be aware of the out-of-the-box features that are available that can make your job a wee bit easier.

Deepak Rangarajan (Blog), a Microsoft MVP, had blogged about the set of events which are captured by this default trace. So, I shall not attempt to duplicate his efforts here but collate a set of issues that CSS has been able to resolve using Default Traces in the past and will continue to do so in the future.

Dude: Where’s my database?
Not really a question that you want to be pondering on if you do not have any monitoring/auditing tools in place. And nor do you want to BING that search term. (Ok, if you prefer even Google it!) The default trace captures the following events: Object:Created, Object:Altered and Object:Deleted which will help you track down inadvertent deletion of databases. SQL Server Management Studio has a standard report called “Schema Changes History” report which can help you track such changes on a SQL instance in the form of a report.

Memory pressure: Where, When?
Ideally this can be tracked by Perfmon data collection and most of the SQL monitoring tools available in the market today track the Target and Total Server Memory counters for SQL instances. When you monitor the delta changes for these counters, you can track the amount of fluctuations that are occurring. The default trace tracks these changes using the Server Memory Change profiler event which will help you decide if there is a large fluctuation in Max Server Memory configuration for the SQL instance. This event gets raised when SQL Server memory usage has increased or decreased by either 1 megabyte (MB) or 5 percent of the maximum server memory (Buffer Pool only), whichever is greater. This is quite useful when tracking memory pressure being experienced for a SQL instance due to factors external to the SQL instance.

Mirrored Database: When did your state change?
All database mirroring state changes for all your mirrored databases are also monitored using the Default Trace using the event: Database mirroring status change. This helps track all database mirroring state changes without having to parse the SQL Errorlogs for the same.

OMG: Someone’s mucking around with my database security?
We have handled multiple issues in the past where we needed to identify various security related audit activity on a SQL instance like new Logins/Users being added, mostly for post-mortem analysis. The default trace “Security Audit” event classes show their usefulness by letting you trace such kind of activity using the default trace.

Full-Text: What are you upto?
The default traces also help trace Full-Text crawl activity which at times have been known to cause performance degradation when very large FT catalogs are involved.

Database: Did you just grow or shrink?
SQL Server databases enabled for Autogrow help prevent the database from going down if the file becomes full but at the same time Autogrow events can cause performance issues like blocking. Additionally, database auto shrink (both log and data files) operations can severely impact database performance. With the help of the default traces, you can can all Auto Grow/Shrink operations for your databases on the instance.

SQL Performance: Now how can I use the default trace for this?
Well the answer is not a definitive one but the default trace can give you an indication if you have a baseline established. The default trace traces the following events: Hash Warning, Missing Column Statistics, Missing Join Predicate and Sort Warning. These events can be used to find out if there is an excessive amount of sorting or hashing that is occurring on your SQL instance (without a baseline, this data is moot point though). You can also identify if a number of your queries are suffering from missing statistics and missing join predicates.

One thing to note here is that the default traces rollover which means that if the default traces are not copied over to an alternate location, they will be overwritten and valuable information required for post-mortem information will be lost.

How to get File Space used information

There are multiple ways to get this done. Starting from SQL Server 2005, you can get this information using the system logs to get the similar kind of information using system catalogs.

This can be achieved using sys.master_files and FileProperty function. The query below will give you the same information as you see in the Object Explorer window when you do the following: Right Click database name –> Tasks –> Shrink –> Files.

EXEC sp_MSforeachdb 'SELECT DB_NAME() as database_name,
name as [File Name],
physical_name as [Physical Name],
size/128.0 as [Total Size in MB],
size/128.0 - CAST (FILEPROPERTY(name,''SpaceUsed'') as int)/128.0 AS [Available Space in MB]. [file_id]
FROM sys.database_files;'

Sleeping session ID: Where art thou?

This has been an age old confusion ever since the Dynamic Management Views changed the way SQL Performance monitoring was done.

A SPID/Session ID will show up as Sleeping/Awaiting Command when it has no active work request that it needs to execute on the Database Engine. Consider the following scenario:

1. You connect to a SQL instance using Management Studio

2. You execute: SELECT @@SPID

3. You then open another SSMS query window and try and lookup the Session ID returned by the output of Step #2 using the DMV: sys.dm_exec_requests.

If you didn’t know of this behavior, then you will be in for a surprise! This is because the Session ID associated with the query in Step #2 has no active work to perform which is why the DMV didn’t report a row for that particular session. However, when you look up this session id using sys.dm_exec_connections or sys.sysprocesses. The reason sys.sysprocesses reports this SPID is because it doesn’t differentiate between a session with/without an active work request. It will report all SPIDs currently connected to the database engine.

The above behavior is expected and by-design.

Bob Dorr has mentioned about this in his post on the CSS SQL Escalation blog and also talks about how this can affect concurrency if such a session has open transactions:

How It Works: What is a Sleeping / Awaiting Command Session