Debugging a non-yielding scheduler issue

If you have read the How To Diagnose and Correct Errors 17883, 17884, 17887, and 17888 whitepaper, you would know what a 17883 error reported in SQL Server 2000 means. If you read the “SQL Server 2005 – SchedulerMonitor” section of the whitepaper, you will understand that the way non-yielding messages are reported for a SQL Server worker thread is much more refined than it’s predecessor. We report a non-yielding worker thread as a Non-yielding Scheduler message in the SQL Server Errorlog. Since the whitepaper is very comprehensive in explaining SQL Server non-yielding detection algorithms, I shall not reiterate the same. In this post, I shall show you how to track down the cause of a non-yielding scheduler using public symbols in conjunction with the SQL Server Errorlog and Windows Event Logs.

Below is an excerpt from an Errorlog which is reporting a non-yielding scheduler message.

2011-07-24 22:17:45.850 Server       * BEGIN STACK DUMP:
2011-07-24 22:17:45.850 Server       *   07/24/11 22:17:45 spid 2388
2011-07-24 22:17:45.850 Server       *
2011-07-24 22:17:45.850 Server       * Non-yielding Scheduler
2011-07-24 22:17:45.850 Server       Stack Signature for the dump is 0x00000000000003B6
2011-07-24 22:17:53.380 Server       External dump process return code 0x20000001.
External dump process returned no errors.

2011-07-24 22:17:53.400 Server Process 0:0:0 (0x990) Worker 0x0000000005A6C1A0 appears to be non-yielding on Scheduler 1. Thread creation time: 12955600342903. Approx Thread CPU Used: kernel 0 ms, user 0 ms. Process Utilization 0%. System Idle 99%. Interval: 70119 ms.

There are lines that I have colored differently in the above excerpt whose significance I shall detail later in this thread. The Windows Application Event Log will report this as an Informational message with the EventID = 17883. This is what I have in my application event log for the above reported non-yielding condition:

Process 0:0:0 (0x990) Worker 0x0000000005A6C1A0 appears to be non-yielding on Scheduler 1. Thread creation time: 12955600342903. Approx Thread CPU Used: kernel 0 ms, user 0 ms. Process Utilization 0%%. System Idle 99%%. Interval: 70119 ms.

The above message is the same message reported in the Errorlog excerpt above. Since the Errorlog doesn’t report the MDMP (dump file) number, you will have to track it down using the timestamp when the issue was reported or using the SQL Dumper log in case you are performing post-mortem analysis. The SQL Dumper log name is “SQLDUMPER_ERRORLOG.log” and it is located by default in the SQL Server LOG folder. You will find a message similar to the one shown below which will point you to the correct .mdmp file.

(2A30:2760) 07/24/11 22:17:52, ACTION, SQLDUMPER_UNKNOWN_APP.EXE, MiniDump completed: C:\Program Files\Microsoft SQL Server\MSSQL10_50.MSSQLSERVER\MSSQL\LOG\SQLDump0025.mdmp

Now I need to use Windows Debugging Tools to load the .mdmp file. Note that the non-yielding message in the SQL Server Errorlog reported “External dump process returned no errors.”. This means that the dump generation routine didn’t encounter any errors. The most common error that occurs while generating dumps (due to resource crunch – memory or CPU), is a timeout. This means that the dump generation timed out and it is a delayed dump. So the information contained in the dump file may or may not be useful to perform a complete post-mortem analysis. This is when Microsoft CSS would ask you to collect additional data for the next occurrence if the current data available is not sufficient for a complete post-mortem analysis.

So let’s move on to the good part now… i.e. analysis of the dump. Keep in mind that this is a mini-dump which means that the is limited amount of information stored in the dump and we are using public symbols to analyze the dump. Even with private symbols that Microsoft CSS Engineers have access to, there might not be a whole lot they can decipher from a mini-dump and might request additional data!

When you load the dump using WinDBG, you will see the following information:

Loading Dump File [C:\Program Files\Microsoft SQL Server\MSSQL10_50.MSSQLSERVER\MSSQL\Log\SQLDump0025.mdmp]
User Mini Dump File: Only registers, stack and portions of memory are available

Comment: ‘Stack Trace’
Comment: ‘Non-yielding Scheduler’


Debug session time: Sun Jul 24 22:17:52.000 2011 (UTC + 5:30)

The above tells you that this is a mini-dump for a non-yielding scheduler condition and the location from where you loaded the dump. Then I use the command to set my symbol path and direct the symbols downloaded from the Microsoft symbol server to a local symbol file cache on my machine. It also tells me the time when the dump was generated.

What is a minidump?
A memory dump which contains the memory for the call stack of all threads, the CPU registers and information about which modules are loaded.

.sympath srv*D:\PublicSymbols*http://msdl.microsoft.com/download/symbols

Then I issue a reload command to load the symbols for sqlservr.exe. This can also be done using CTRL+L and providing the complete string above (without .sympath), checking the Reload checkbox and clicking on OK. The only difference here is that the all the public symbols for all loaded modules in the dump will be downloaded from the Microsoft Symbol Server which are available.

.reload /f sqlservr.exe

Next thing is to verify that the symbols were correctly loaded using the lmvm sqlservr command. If the symbols were loaded correctly, you should see the following output. Note the text in green.

0:019> lmvm sqlservr
start             end                 module name
00000000`00c60000 00000000`04820000   sqlservr   (pdb symbols)          D:\publicsymbols\sqlservr.pdb\6A3527657A3F4090AB2C96F7008A55C52\sqlservr.pdb
    Loaded symbol image file: sqlservr.exe
    Mapped memory image file: C:\Program Files\Microsoft SQL Server\MSSQL10_50.MSSQLSERVER\MSSQL\Binn\sqlservr.exe
    Image path: C:\Program Files\Microsoft SQL Server\MSSQL10_50.MSSQLSERVER\MSSQL\Binn\sqlservr.exe
    Product version:  10.50.1790.0

If symbols were not loaded, then you will see an output as shown below.

0:019> lmvm sqlservr
start             end                 module name
00000000`00c60000 00000000`04820000   sqlservr   (export symbols)       sqlservr.exe
    Loaded symbol image file: sqlservr.exe
    Mapped memory image file: C:\Program Files\Microsoft SQL Server\MSSQL10_50.MSSQLSERVER\MSSQL\Binn\sqlservr.exe
    Image path: C:\Program Files\Microsoft SQL Server\MSSQL10_50.MSSQLSERVER\MSSQL\Binn\sqlservr.exe

Next I shall use the Thread ID noted above (0x990) in the Errorlog excerpt to switch to the thread and get the callstack for the thread which was reported as non-yielding.

0:019> ~~[0x990]s
ntdll!NtWaitForSingleObject+0xa:
00000000`76e06bda c3              ret
0:019> kL100

ntdll!NtWaitForSingleObject
kernel32!WaitForSingleObjectEx
sqlservr!SOS_Scheduler::SwitchContext
sqlservr!SOS_Scheduler::SuspendNonPreemptive
sqlservr!SOS_Scheduler::Suspend


msvcr80!endthreadex
msvcr80!endthreadex
kernel32!BaseThreadInitThunk
ntdll!RtlUserThreadStart

Looking at the callstack above, I can say that this is not the cause of the non-yielding condition (99.99999% true in most cases)!! The reason I can say this is because I find SwitchContext on the thread stack which means that the thread has yielded. The sequence of events is as follows: SQL Server detected a non-yielding worker thread. By the time, SQL Dumper was initiated to capture a memory dump of the SQL Server process, the thread yielded and the processing moved on to the next work request in the queue.

However, if your dump didn’t have the top functions as the ones noted above but was something like the two functions mentioned below, then you definitely have the right callstack and the issue noted below is due to stuck-stalled I/O.

ntdll!NtWriteFileGather
kernel32!WriteFileGather

The non-yielding scheduler message from the dump I investigated which had the above callstack had the following stuck-stalled IO warnings in the SQL Server Errorlog just prior to the dump getting generated:

SQL Server has encountered 218 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [C:\Program Files\Microsoft SQL Server\MSSQL10_50.MSSQLSERVER\MSSQL\DATA\tempdb.mdf] in database [tempdb] (2). 

So now I definitely know that the non-yielding condition was due to faulty disk sub-system and the SQL Server instance at this point is just a victim of the issue!

If you see the the following message in the SQL Server Errorlog along with the dump generation message, then it is definitely a delayed dump and it could have some valuable information missing that might have been pertinent to your analysis.

Timeout while waiting for external dump

Most of the common non-yielding conditions are documented in the whitepaper that I referred to above. If you are unable to get a callstack for the non-yielding issue and the symptoms or sequence of events leading up to the issue doesn’t match any KB Article, then it’s time to call CSS to get the memory dump analyzed with the help of the private symbols that the CSS engineers have access to.

Happy debugging!!

Reference:
KB 315263 – How to read the small memory dump files that Windows creates for debugging

Canonical ACLs: You said wwwhhat?

A while back I had to debug the SQL Server 2008 setup code to find out why the Registry Consistency Check rule was taking an obnoxiously long time to complete on a particular server. While trying to apply SQL Server 2008 SP2, you will find that the Registry Consistency Check rule is executed. Now the setup UI didn’t show any progress for over an hour. When I checked a previous set of setup logs ,I found the following failure (from the Detail.txt file):

2011-04-17 15:06:53 Slp: d:14c36b1c9c26b2820a8\x64\FixSqlRegistryKey_x86.exe exit code: 1
2011-04-17 15:06:53 Slp: Rule ‘AclPermissionsFacet’ detection result: AreSqlAclPermissionsCanonical= False
2011-04-17 15:06:53 Slp: Evaluating rule : AclPermissionsFacet
2011-04-17 15:06:53 Slp: Rule running on machine: <machine name>
2011-04-17 15:06:53 Slp: Rule evaluation done : Failed
2011-04-17 15:06:53 Slp: Rule evaluation message: The SQL Server registry keys from a prior installation cannot be modified. To continue, see SQL Server Setup documentation about how to fix registry keys.

In another setup attempt (again from Detail.txt file), I found that the FixSqlRegistryKey_x86.exe did not log any message post the launch of the EXE:

2011-04-29 11:42:58 Slp: D:\Microsoft SQL 2008 SP2\x64\FixSqlRegistryKey_x64.exe exit code: 0
2011-04-29 11:42:58 Slp: Launching external tool: D:\Microsoft SQL 2008 SP2\x64\FixSqlRegistryKey_x86.exe
2011-04-29 11:42:58 Slp

If you capture a Process Monitor trace, you will find that the FixSqlRegistryKey_*.exe is traversing the registry keys pertaining to the SQL Server registry hive. The EXE (FixSqlRegistryKey_*.exe) looks up and fixes all non-canonical ACLs with the registry key and its child keys for the two hives below:

HKLM\Software\Microsoft\Microsoft SQL Server\
HKLM\Software\Microsoft\MSSQLServer\

Did I just read the word Canonical?!? Yes, you did. A canonical ACL must have ACEs sorted according to the following order:

1. Access-denied on the object
2. Access-denied on a child or property
3. Access-allowed on the object
4. Access-allowed on a child or property
5. All inherited ACEs

Reference: http://blogs.msdn.com/b/patcarna/archive/2009/02/09/the-access-control-list-is-not-canonical.aspx

If SQL Setup finds ACLs in non-canonical form, then it will try and fix the ACLs for the registry key and it’s sub-keys which have the ACLs in non-canonical form. If there are permission issues encountered while traversing the registry, then this might take a longer time to error out or fail with the error message mentioned above.

You have the following option to correct the non-canonical ACLs:

1. Locate non-canonical ACLs in two hives mentioned above and fix them manually.
2. Re-run the rules
3. Check which registry key the EXE takes a long time to move forward from or encounters an access denied error using a Process Monitor trace.

Note: Always use a backing file pointed to a local disk when capturing Process Monitor traces for prolonged periods. Refer this post for more details.

Once you have fixed the non-canonical ACLs, you should be able to successfully get beyond the registry consistency rule check without having to get lunch and dinner in between!

Since I didn’t want to sit and traverse through all the registry keys manually, I wrote a Powershell script to check if the ACLs for your SQL Server registry hive and it’s sub-keys are in canonical form or not and report back the key which is not in non-canonical form.


Function fnCheckCanonical
{
Param ([string] $regvalue)

$reg = Get-ChildItem $regvalue -recurse
Write-Host "Checking hive : " $regvalue
foreach ($path in $reg)
{

$regkey = $path.ToString()
$regkey = $regkey.Replace("HKEY_LOCAL_MACHINE","HKLM:")
#Write-Host $regkey

$acl = Get-Acl -Path $regkey
if ($acl.AreAccessRulesCanonical.ToString().Equals("True"))
{
Write-Host "Verified key: " + $acl.Path.ToString()
}
else
{
Write-Host "Reg Key: " $acl.Path.ToString()
Write-Host "IsCanoncial: " $acl.AreAccessRulesCanonical.ToString()
Write-Host "Failed canonical check"
}

}

}

fnCheckCanonical HKLM:"Software\Microsoft\Microsoft SQL Server"
fnCheckCanonical HKLM:"Software\Microsoft\MSSQLServer"
fnCheckCanonical HKLM:"SOFTWARE\Wow6432Node\Microsoft\Microsoft SQL Server"
fnCheckCanonical HKLM:"SOFTWARE\Wow6432Node\Microsoft\MSSQLServer"

ACLs don’t change into non-canonical forms just because they are having a bad day but it could happen when you use scripting tasks from .NET/VBScript/Powershell or subinacl.exe to incorrectly modify/add/remove ACLs on a registry key. Digging into the root cause of how they actually got into the state is as good as searching for a needle in all the haystacks in the world! Here is an article on how subinacl.exe could arrange ACLs incorrectly on Windows 2000: http://support.microsoft.com/kb/296865

Hope this helps you in fixing your non-canonical ACLs if you are unfortunate enough to run into this issue.

del.icio.us Tags: ,,

A shot in the arm for sp_purge_data

I had recently worked on a performance issue with the Management Data Warehouse job “mdw_purge_data”. This job calls a stored procedure sp_purge_data which was taking a long time to complete. This stored procedure just got a shot in the arm with an update to it released by the SQL Server Developer working on it. The T-SQL for the modified stored procedure is available on the SQL Agent MSDN blog. The actual Stored Procedure will be available in an upcoming update for SQL Server 2008 R2.

Addition: July 20th, 2011: SQL Server 2008 R2 Service Pack 1 has various updates for the purge process which doesn’t require modification of the procedure using the script mentioned in the aforementioned blog post.

A recent discussion on Twitter reminded me that I had this as one of the items to blog about so here it is. Now for some of the interesting technical details.

The poor performance of the stored procedure has been addressed through a refactoring of the Stored Procedure code but the issue doesn’t occur on all Management Database Warehouse databases.

The issue occurs when you have large amounts of data stored in the [snapshots].[query_stats] for each unique snapshot ID. If you have orphaned rows in the ‘ [snapshots].[notable_query_plan] ‘ and ‘ [snapshots].[notable_query_text] ‘ tables of the Management Data Warehouse. This was corrected in the builds mentioned in the KB Article below:

970014    FIX: The Management Data Warehouse database grows very large after you enable the Data Collector feature in SQL Server 2008
http://support.microsoft.com/default.aspx?scid=kb;EN-US;970014

The above KB Article also has a workaround mentioned which allows you to clean-up the orphaned rows using a T-SQL script.

The “Query Statistics” data collection set collects query statistics, T-SQL text, and query plans of most of the statements that affect performance. Enables analysis of poor performing queries in relation to overall SQL Server Database Engine activity. It makes use of DMVs sys.dm_exec_query_stats to populate the entries in the snapshots.query_stats table. If this DMV returns a large number of rows for each time the collection set runs, the number of rows in the actual Management Data Warehouse table “snapshots.query_text” can grow very quickly. This will then cause performance issues when the actual purge job executes on the MDW database.

In summary, if you are using Management Data Warehouse to monitor a highly active server where there are a lot of ad-hoc queries or prepared SQL queries being executed, then I would recommend you to modify you the sp_purge_data stored procedure using the MSDN blog link, I mentioned earlier in my post. I have seen significant performance improvement due to this on the specific environment that I was working on.

Setup – RPC, please be nice to me

Recently, I had looked into a SQL Server 2008 cluster setup failure where the following information was noted in the Detail.txt file:

2011-02-21 11:58:37 Slp: Configuration action failed for feature SQL_Engine_Core_Inst during timing ConfigRC and scenario ConfigRC.
2011-02-21 11:58:37 Slp: The RPC server is unavailable
2011-02-21 11:58:37 Slp: Configuration action failed for feature SQL_Engine_Core_Inst during timing ConfigRC and scenario ConfigRC.
2011-02-21 11:58:37 Slp: System.ComponentModel.Win32Exception: The RPC server is unavailable
2011-02-21 11:58:37 Slp:    at Microsoft.SqlServer.Configuration.Cluster.ClusterResource.UpgradeResourceDLL(String nodeName, String dllPathName)
2011-02-21 11:58:37 Slp:    at Microsoft.SqlServer.Configuration.SqlEngine.SQLEngineClusterFeature.UpgradeResourceDLL(SQLServiceResource sqlResource)
2011-02-21 11:58:37 Slp:    at Microsoft.SqlServer.Configuration.SqlEngine.SQLEngineClusterFeature.ConfigureSQLEngineResourceType()
2011-02-21 11:58:37 Slp:    at Microsoft.SqlServer.Configuration.SqlEngine.SqlEngineSetupPrivate.Patch_ConfigRC(EffectiveProperties properties)
2011-02-21 11:58:37 Slp:    at Microsoft.SqlServer.Configuration.SqlEngine.SqlEngineSetupPrivate.Patch(ConfigActionTiming timing, Dictionary`2 actionData, PublicConfigurationBase spcb)
2011-02-21 11:58:37 Slp:    at Microsoft.SqlServer.Configuration.SqlConfigBase.SqlFeatureConfigBase.Execute(ConfigActionScenario scenario, ConfigActionTiming timing, Dictionary`2 actionData, PublicConfigurationBase spcbCurrent)
2011-02-21 11:58:37 Slp:    at Microsoft.SqlServer.Configuration.SqlConfigBase.SlpConfigAction.ExecuteAction(String actionId)
2011-02-21 11:58:37 Slp:    at Microsoft.SqlServer.Configuration.SqlConfigBase.SlpConfigAction.Execute(String actionId, TextWriter errorStream)
2011-02-21 11:58:37 Slp: Exception: System.ComponentModel.Win32Exception.
2011-02-21 11:58:37 Slp: Source: Microsoft.SqlServer.Configuration.Cluster.
2011-02-21 11:58:37 Slp: Message: The RPC server is unavailable.

The RPC Server is unavailable is a pretty generic error and can happen due to a lot of reasons. One of the common reasons that we have seen in CSS while troubleshooting setup issues is with incorrect DNS entries. This can happen if DNS entry for the cluster network name wasn’t configured properly and is pointing to an incorrect IP Address.

Another point to keep in mind is Windows Firewall. Make sure that the Firewall configuration allows this RPC call through.

I worked with Shahryar (Twitter) on a similar issue last week and it was identified that a PING request to the cluster name or IP returned no response.

If all else fails, then network tracing would help in figuring out what is failing and where.

References:

Error message when you connect to a cluster virtual server by using the named pipes protocol: "The machines selected for remote communication is not available at this time."
http://support.microsoft.com/kb/306985/

Downloading that cumulative update or hotfix…

I recently answered a question on Twitter regarding the availability of Cumulative Update downloads for specific platforms.

Starting from SQL Server 2005, our Release Services team moved to an Incremental Servicing Model to deliver fixes for issues identified on current releases of SQL Server[s]. We now release a Cumulative Update package for a Service Pack periodically which contains all the fixes released since the RTM version of the Service Pack and hence the name “Cumulative Update”.

 image

 image

Now you click on the link mentioned in the picture below, you will get an option to download the packages associated for all the platforms.

image
I know this is quite trivial information but I have answered this question a few times and hence the post.