SQL Server 2005 Trace Flags

SQL Server exposes multiple trace flags which are required to set specific server characteristics or to switch off a particular behavior. Some trace flags are required to enable the fix post the installation of the update.

This is currently a work-in-progress. The trace flags mentioned in the table below have the associated public article links mentioned which provide more information about the trace flag.

If you want a trace flag to be mentioned in the list below, then feel free to leave a comment. Note that only publicly documented trace flags will appear in this list.

Disclaimer
1. Data mentioned below is as of 29th June 2012.
2. Trace flags should be used under the guidance of Microsoft SQL Server support.  They are used in this post for discussion purposes only and may not be supported in future versions.
3. Trace flags for hotfixes should be only enabled if the fix is applicable to the SQL Server instance that you are enabling the trace flag on.

Continue reading

SSAS: I will add you to the existing SQL cluster

You have an existing SQL Server 2005 Failover Cluster which has Database Engine and Full-text Search as the clustered components. Now you suddenly decide that this SQL Server cluster group requires a clustered Analysis Services instance. This will then lead you down a rabbit hole trying to figure out what to do to achieve this unless and until you know where to look.

There is a note in the Books Online content stating the following:

You cannot install Analysis Services to the same cluster group as the Database Engine. You must install Analysis Services to its own group and then, after installation, you can move Analysis Services to the same group as SQL Server.

However the above information is incorrect!!

You CANNOT add a failover cluster instance of Analysis Services to an existing SQL Server failover cluster instance using the Setup GUI!

Adding an Analysis Services failover cluster instance to an existing SQL Server cluster group has probably been a point of consternation for a person attempting SQL setup if you ran into the above mentioned scenario.

Now to the interesting part… achieving the objective using setup parameters and command line setup!! Sounds like fun, eh? The setup command would need to use the following parameters:

start /wait <CD or DVD Drive>\setup.exe
/qb VS=<VSName> – Virtual Server name should be the same as the existing database engine virtual server name
INSTALLVS=
Analysis_Server
INSTANCENAME=<InstanceName> – Instance name should also be the same as the instance name of the database engine. For default instances, use MSSQLSERVER.
ADDLOCAL=Analysis_Server
ADDNODE=<NodeName1, NodeName2,… NodeNameN>
GROUP=<SQL Diskgroup>
IP=<IP,Networkname> – Network name here is the cluster network name. See Gotcha#2 below.
ADMINPASSWORD=<StrongPassword>
SAPWD=<StrongPassword
INSTALLSQLDIR=<InstallationPath>
INSTALLASDATADIR=<Sharedrivepath>
SQLACCOUNT=<domain\user>
SQLPASSWORD=<DomainUserPassword>
AGTACCOUNT=<domain\user>
AGTPASSWORD=<DomainUserPassword>
ASACCOUNT=<domain\user>
ASPASSWORD=<DomainUserPassword>
SQLBROWSERACCOUNT=<domain\user>
SQLBROWSERPASSWORD=<DomainUserPassword>
SQLREPORTING=1
ASCLUSTERGROUP=<YourDomain \ YourDomainGroupName>

There are few gotchas here.

1. If you specify a Virtual Server name other than your existing SQL Server virtual server name for the VS parameter, then you will get a failure while trying to create a new IP resource:

Error Code: -2147019839
Windows Error Text: The cluster IP address is already in use

2. If you specify incorrect parameters for the IP parameter, you could encounter a"network name not found" error. The network name value in the IP parameter is the name of the cluster network that shows up in the Failover Cluster Manager snap-in or Cluster Administrator snap-in. The network name is NOT the name of the Windows Network Interface.

3. You need an additional shared disk. If your Database Engine is using G: drive for the existing instance, you cannot use the G: drive to install the Analysis Services instance even though you are installing into the same cluster group.

References

SQL Server 2005 Setup Parameters
http://technet.microsoft.com/en-us/library/ms144259(SQL.90).aspx

Deadlocked Schedulers and event notifications with a bit of Powershell

I had written 2 posts recently about Deadlocked Schedulers debugging (Part 1 | Part 2) which basically walked through a 17784 and 17888 condition and the steps that you could take to debug such an issue. A quote from the both these articles:

So if you have a monitoring tool that is capturing the output of queries being executed by the active SQL Server sessions, then you can go back to your monitoring repository and dig up the queries which these threads were executing. If such data is not available from the time of the issue, more often that not you will have to end up collecting additional data for the next problem occurrence!

You will need to collect additional data if there wasn’t a monitoring software collecting the same to determine the set of queries and sequence of events leading upto the issue. CSS Engineers can dig into the dump using the private symbols and probably provide you with additional information which can help you along the right direction. But alas, even they do not have magic wands if you cannot provide them with additional information like blocking output/profiler traces for 17884 and 17888 conditions caused by excessive blocking or parallelism.

SQL Server 2008 Service Pack 2 provided an update which made the FailoverAnalysis.sql script available again. This was available in SQL Server 2005 and now is available for SQL Server 2008 as well. This is particularly useful for SQL Server Failover Clusters where you can configure SQLDumper to generate a dump before the failover occurs and also set the SqlDMVScriptTimeout property to get an output of the DMVs used in the script before a failover happens.

image

You can refer “Cluster failovers and the Sqldumper.exe utility” section in the KB917825 for more details on the above mentioned feature. So what am I doing writing this post??

This post is aimed at SQL Server Standalone instances where you cannot benefit from sqldumper. Deadlocked Schedulers are reported with EventIDs 17884 or 17888. I am going to use this to create an scheduled task on a Event Log entry for these two specific EventIDs. I am big fan of automating repetitive tasks and the benefit of such an automation is that you can get additional data from the problem occurrence without having to just rely on a SQL Server mini-dump memory file and be at the mercy of the Windows Debugging tools. For this post, you can say a small good-bye to your favorite debugging tool and read on for some automation in monitoring!!

However, there are some requirements for this kind of indigenous monitoring to work:

1. Dedicated Administrator Connection needs to be enabled for your SQL Server instance(s) – This is required because during a Deadlocked Schedulers condition, you would have run out of worker threads. So, new connection requests cannot be processed unless and until you establish a connection through DAC. If you are planning to use Event Subscription and remote data gathering, then you need Remote DAC enabled for your SQL instance(s) being monitored. Note: Event subscription is a feature available on Windows Server 2008 and above.

2. You need sysadmin privileges for the account that will be used to collect the output of the FailoverAnalysis.sql script.

3. Powershell and SQLPS needs to be available on the server being monitored. If you don’t want to install Powershell components, then the same functionality can be achieved using SMO and VBScript which I have used extensively in the past till I became a fan of POSH awesomeness!! Smile

Now that I have the pre-requisites out of the way, let’s get down to the good part…

The Powershell Script

I had already mentioned that I will be using a Powershell script to get the output of the FailoverAnalysis.sql script for the SQL instance that is encountering the 17884 condition. The piping feature of Powershell really makes it easy to write compact scripts to perform a wide variety of tasks. The script is divided into three parts.

The first part of the script actually uses the script in the section “Adding the SQL Server Snap-ins to Windows PowerShell” of the article “Running SQL Server PowerShell” to initialize the SQLPS snap-in in the session that you will be running the script.

The second part of the script reads the Application event log for the first occurrence of an informational message with the text substring “have not been picked up by a worker thread in the last 60 seconds” from any SQL Server instance as the source.

The third part of the script figures out the SQL Server instance name and connects to it to execute the FailoverAnalysis.sql script. You can modify the output location of the script output below as appropriate for your environment.

Script filename: DeadlockSchedulerMonitor.ps1


#################################################################################

# Script Name: DeadlockSchedulerMonitor #

# Author: Amit Banerjee #

# Date: 9/1/2011 #

# Description: #

# The script reads the first 17884 (Deadlocked Schedulers) event from the #

# Windows Application Event log and identifies the source. Based on the #

# source name, it connects to the SQL Server instance using a DAC #

# connection and collects the FailoverAnalysis.sql script output. #

#################################################################################

$vMsg = (Get-Date).Year.ToString() + (Get-Date).Month.ToString() + (Get-Date).Day.ToString() + (Get-Date).Hour.ToString() + (Get-Date).Minute.ToString() + ":Performing SQLPS snap-in initialization`n"

Write-Host $vMsg

# Call the initialization script to load the SQLPS snap-in in the session that you running your script

# Add the SQL Server Provider.

$ErrorActionPreference = "Stop"

$sqlpsreg="HKLM:\SOFTWARE\Microsoft\PowerShell\1\ShellIds\Microsoft.SqlServer.Management.PowerShell.sqlps"

if (Get-ChildItem $sqlpsreg -ErrorAction "SilentlyContinue")

{

Write-Host "SQL Server Provider for Windows PowerShell is not installed.`n"

}

else

{

$item = Get-ItemProperty $sqlpsreg

$sqlpsPath = [System.IO.Path]::GetDirectoryName($item.Path)

}

# Set mandatory variables for the SQL Server provider

Set-Variable -scope Global -name SqlServerMaximumChildItems -Value 0

Set-Variable -scope Global -name SqlServerConnectionTimeout -Value 30

Set-Variable -scope Global -name SqlServerIncludeSystemObjects -Value $false

Set-Variable -scope Global -name SqlServerMaximumTabCompletion -Value 1000

# Load the snapins, type data, format data

Push-Location

cd $sqlpsPath

Add-PSSnapin SqlServerCmdletSnapin100 -ErrorAction SilentlyContinue

Add-PSSnapin SqlServerProviderSnapin100 -ErrorAction SilentlyContinue

Update-TypeData -PrependPath SQLProvider.Types.ps1xml -ErrorAction SilentlyContinue

update-FormatData -prependpath SQLProvider.Format.ps1xml -ErrorAction SilentlyContinue

Pop-Location

$vMsg = (Get-Date).Year.ToString() + (Get-Date).Month.ToString() + (Get-Date).Day.ToString() + (Get-Date).Hour.ToString() + (Get-Date).Minute.ToString() + ":SQLPS snap-in initialization completed`n"

Write-Host $vMsg

# Local variables to store the event log source and the instance name to which the DAC connection needs to be made

[string] $vSource

[string] $vInstName

[string] $vFileName

$vMsg = (Get-Date).Year.ToString() + (Get-Date).Month.ToString() + (Get-Date).Day.ToString() + (Get-Date).Hour.ToString() + (Get-Date).Minute.ToString() + ":Reading Application Event Log`n"

Write-Host $vMsg

# Get the source which generated the 17884 message

$vSource = Get-EventLog -LogName "Application" -Message "*have not been picked up by a worker thread in the last 60 seconds*" -Source MSSQL* -Newest 1

$vSource = $vSource.Source.ToString()

$vMsg = (Get-Date).Year.ToString() + (Get-Date).Month.ToString() + (Get-Date).Day.ToString() + (Get-Date).Hour.ToString() + (Get-Date).Minute.ToString() + ":Connecting to SQL instance and executing script`n"

Write-Host $vMsg

# Check if it is a default instance or named instance and accordingly collect the output

if ($vSource.Equals("MSSQLSERVER"))

{

# Create the filename

$vFileName = "F:\" + (Get-Date).Year.ToString() + (Get-Date).Month.ToString() + (Get-Date).Day.ToString() + (Get-Date).Hour.ToString() + (Get-Date).Minute.ToString() + (Get-Date).Second.ToString() + "_" + $env:ComputerName.ToString() + "_FailoverAnalysis.OUT"

# Use Invoke-Sqlcmd to get the output of the script

Invoke-Sqlcmd -InputFile "C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\Install\FailoverAnalysis.sql" -ServerInstance $env:ComputerName -DedicatedAdministratorConnection | Out-File -filepath $vFileName

}

else

{

# Construct the SQL Server name to connect to for a named instance

$vInstName = $env:ComputerName + "\" + $vSource.Split("$")[-1]

# Create the filename

$vFileName = "F:\" + (Get-Date).Year.ToString() + (Get-Date).Month.ToString() + (Get-Date).Day.ToString() + (Get-Date).Hour.ToString() + (Get-Date).Minute.ToString() + (Get-Date).Second.ToString() + "_" + $vInstName + "_FailoverAnalysis.OUT"

# Use Invoke-Sqlcmd to get the output of the script

Invoke-Sqlcmd -InputFile "C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\Install\FailoverAnalysis.sql" -ServerInstance $vInstName -DedicatedAdministratorConnection | Out-File -filepath $vFileName

}

$vMsg = (Get-Date).Year.ToString() + (Get-Date).Month.ToString() + (Get-Date).Day.ToString() + (Get-Date).Hour.ToString() + (Get-Date).Minute.ToString() + ":Script execution completed`n"

Write-Host $vMsg

The Task Scheduler Configuration

The next thing that is needed to be done is create Task Scheduler task to execute the Powershell script above when a 17884 condition is reported.

1. Open the Task Scheduler MMC using Start –> Administrative Tools –> Task Scheduler to get the Task Scheduler UI. Right click on Event Viewer Tasks and click on Create Task (see screenshot below). Provide the appropriate settings, description and name. The salient points to keep in mind here are:
a. Use the Change User or Group option to provide an account which has permissions to log into the SQL Server instance(s) to gather the output of the FailoverAnalysis.sql script.
b. You can use the Configure for drop-down box to configure the task for Windows Server 2003 as well.image

2. Next click on the Triggers tab. Click on New, which will open the New Trigger dialog box. Use the Begin the Task drop down box to select On an event option (see screenshot below). Then select the Custom radio button and click on New Event Filter button which will open the New Event Filter dialog box. Provide the options as shown by the highlighted items in the screenshot. In the event sources drop down list, I selected all the SQL Server instances installed on my server. You can choose to leave the default option in case you plan to install more instances on this server. Using the default option can be beneficial if you choose to export this task to other servers as well. Provide the Advanced Settings option in the New Trigger dialog box. I chose to Stop task if it runs more than 30 minutes. Once that is done, you should see new trigger entry in the Trigger tab as On an Event with a custom event filter and the status as enabled.image

3. Under the Actions tab, click on New. Select Start a Program from the Action drop-down box. In the Program/script text box, provide Powershell.exe. In the Add arguments (optional) text box, provide the following: -File “F:\DeadlockSchedulerMonitor.ps1”. Or you can choose to provide a .cmd file path whose contents are “Powershell.exe -File “F:\DeadlockSchedulerMonitor.ps1” > F:\DeadlockSchedulerMonitorLog.txt”. I went with the second option. F: drive root is where my .ps1 script and .cmd file are present. You would need to modify this accordingly depending on where your script is. Once that is done, your action tab should be similar to the screenshot below.

image

4. I used the default settings for the Conditions and Settings tab. You can modify this as appropriate for your environment.

Once the task is created, all you need is for a 17884 condition to occur for the script output to be generated.

Let’s see the configuration in action

image

I simulated a 17884 condition on my SQL Server instance, see if the newly created task gets fired. (see above Application Log snippet). Looking into the F: drive, I find that the following files were created

20119212197_<sql instance name>_FailoverAnalysis.OUT
201192121918_<sql instance name>_FailoverAnalysis.OUT

The reason I have two files is that this is a two NUMA node box, and the deadlock schedulers condition was reported for both nodes. Note that this script reports first occurrence which is reported at 60 seconds since the message text snippet that is used to fetch the event from the Application log specifically looks for the 60 seconds keyword.

NOTE: You can monitor for 17888 events using the following message text instead of the one used above in the script for 17884 script and create a new task for the same. Message text: *appear deadlocked due to a large number of worker threads waiting*

What next?

Well with automation the possibilities are endless… Some of the quick things that come to my mind are:

1. You can create an event subscription and use the Forwarded Events log to track all 17884 from different servers which have SQL Server instances installed on them.
2. You can modify the powershell script to perform additional post processing or send out email notifications to your DBAs for the occurrence of such events.
3. You could even add additional post processing to the Powershell script to perform additional tasks.

If you have SCOM or any other Event log monitoring software, the only thing that you need to do is setup a custom alert for 17884 or 17888 error messages.

Have fun monitoring and customizing further!

Reference:

Understanding and Using PowerShell Support in SQL Server 2008
Windows Server 2008 Event Subscription with Task Scheduling

Disclaimer: This script and the solution provided is “AS IS” and any deployment that you do using a similar logic described requires due-diligence and testing on your part. The testing that I did for this was on Windows Server 2008 R2 & SQL Server 2008 R2.

Debugging that latch timeout

My last post of debugging an assertion didn’t have any cool debugging tips since there is not much that you can do with an assertion dump unless you have access to private symbols and sometimes even access to the source code. In this post, I am going to not disappoint and show you some more cool things that the windows debugger can do for you with public symbols for a latch timeout issue.

When you encounter a latch timeout (buffer or non-buffer latch), the first occurrence of it’s type generates a mini-dump. If there are further occurrences of the same latch timeout, then that is reported as an error message in the SQL Errorlog.

Buffer latch timeouts are typically reported using Error: 844 and 845. The common reasons for such errors are documented in a KB Article. For a non-buffer latch timeout, you will get the an 847 error.

Error # Error message template (from sys.messages)
844 Time out occurred while waiting for buffer latch — type %d, bp %p, page %d:%d, stat %#x, database id: %d, allocation unit id: %I64d%ls, task 0x%p : %d, waittime %d, flags 0x%I64x, owning task 0x%p.  Continuing to wait.
845 Time-out occurred while waiting for buffer latch type %d for page %S_PGID, database ID %d.
846 A time-out occurred while waiting for buffer latch — type %d, bp %p, page %d:%d, stat %#x, database id: %d, allocation unit Id: %I64d%ls, task 0x%p : %d, waittime %d, flags 0x%I64x, owning task 0x%p. Not continuing to wait.
847

Timeout occurred while waiting for latch: class ‘%ls’, id %p, type %d, Task 0x%p : %d, waittime %d, flags 0x%I64x, owning task 0x%p. Continuing to wait.

This is what you will see in the SQL Errorlog when a latch timeout occurs.

spid148     Time out occurred while waiting for buffer latchtype 4, bp 0000000832FE1200, page 3:11234374, stat 0x7c20009, database id: 120, allocation unit id: 72057599731367936, task 0x0000000003C4F2E8 : 0, waittime 300, flags 0x1a, owning task 0x0000000003C129B8.  Continuing to wait.
spid148     **Dump thread – spid = 148, PSS = 0x000000044DC17BD0, EC = 0x000000044DC17BE0
spid148     ***Stack Dump being sent to D:\Microsoft SQL Server\MSSQL.1\MSSQL\LOG\SQLDump0001.txt

spid148     * Latch timeout
spid148     * Input Buffer 84 bytes –
spid148     *             DBCC CHECKDB WITH ALL_ERRORMSGS
External dump process returned no errors.

I have only pasted the relevant portion from the Errorlog for brevity. As I have outlined in my previous blog posts on similar topics, that there is a large opportunity for due diligence that can be done with the help of the Windows Event Logs and the SQL Server Errorlogs before you start spawning off windows debugger to analyze the memory dump on your system. The first few obvious things that you will notice is that SPID 148 encountered the issue while performing a CHECKDB on database ID 120. The timeout occurred while waiting for a buffer latch on a page (Page ID is available in the message above). I don’t see a “Timeout waiting for external dump process” message in the SQL Errorlog which means that I have a good chance of extracting useful information from the mini-dump that was generated by SQLDumper.

Latch timeouts are typically victims of either a system related issue (hardware or drivers or operating system or a previous error encountered by SQL Server). So the next obvious action item would be to look into the SQL Errorlogs and find out if there were any additional errors prior to the latch timeout issue. I see a number of OS Error 1450 reported by the same SPID 148 for the same file handle but different offsets.

spid148     The operating system returned error 1450(Insufficient system resources exist to complete the requested service.) to SQL Server during a write at offset 0x0000156bf36000 in file with handle 0x0000000000001358. This is usually a temporary condition and the SQL Server will keep retrying the operation. If the condition persists then immediate action must be taken to correct it.

Additionally, I see prior and post (within 5-15 minutes) the latch timeout issue, multiple other SPIDs reporting the same 1450 error message for different offsets but again on the same file.

spid137     The operating system returned error 1450(Insufficient system resources exist to complete the requested service.) to SQL Server during a write at offset 0x000007461f8000 in file with handle 0x0000000000001358. This is usually a temporary condition and the SQL Server will keep retrying the operation. If the condition persists then immediate action must be taken to correct it.

I also see the latch timeout message being reported after every 300 ms for the same page and the database.

spid148     Time out occurred while waiting for buffer latch — type 4, bp 0000000832FE1200, page 3:11234374, stat 0x7c20009, database id: 120, allocation unit id: 72057599731367936, task 0x0000000003C4F2E8 : 0, waittime 82800, flags 0x1a, owning task 0x0000000003C129B8.  Continuing to wait.

Notice the waittime above, it has increased from 300 to 82800!! So the next thing I do is look up issues related to CHECKDB and 1450 error messages on the web using Bing (Yes, I do use BING!!). These are relevant posts related to the above issue.

http://blogs.msdn.com/b/psssql/archive/2008/07/10/sql-server-reports-operating-system-error-1450-or-1452-or-665-retries.aspx
http://blogs.msdn.com/b/psssql/archive/2009/03/04/sparse-file-errors-1450-or-665-due-to-file-fragmentation-fixes-and-workarounds.aspx

As of now, it is quite clear that the issue is related to a possible sparse file issue related to file fragmentation. Now it is time for me to check if there are other threads in the dump waiting on SyncWritePreemptive calls.

Use the location provided in the Errorlog snippet reporting the Latch Timeout message to locate the mini-dump for the issue (in this case SQLDump0001.mdmp).

Now when you load the dump using WinDBG, you will see the following information:

Loading Dump File [D:\Microsoft SQL Server\MSSQL.1\MSSQL\LOG\SQLDump0001.mdmp]
User Mini Dump File: Only registers, stack and portions of memory are available

Comment: ‘Stack Trace’
Comment: ‘Latch timeout’

This dump file has an exception of interest stored in it.

The above tells you that this is a mini-dump for a Latch Timeout condition and the location from where you loaded the dump. Then I use the command to set my symbol path and direct the symbols downloaded from the Microsoft symbol server to a local symbol file cache on my machine.

.sympath srv*D:\PublicSymbols*http://msdl.microsoft.com/download/symbols

Then I issue a reload command to load the symbols for sqlservr.exe. This can also be done using CTRL+L and providing the complete string above (without .sympath), checking the Reload checkbox and clicking on OK. The only difference here is that the all the public symbols for all loaded modules in the dump will be downloaded from the Microsoft Symbol Server which are available.

.reload /f sqlservr.exe

Next thing is to verify that the symbols were correctly loaded using the lmvm sqlservr command. If the symbols were loaded correctly, you should see the following output. Note the text in green.

0:005> lmvm sqlservr

start             end                 module name
00000000`01000000 00000000`03668000   sqlservr T (pdb symbols)          d:\publicsymbols\sqlservr.pdb\2A3969D78EE24FD494837AF090F5EDBC2\sqlservr.pdb

 

If symbols were not loaded, then you will see an output as shown below.

0:005> lmvm sqlservr
start end module name
00000000`01000000 00000000`03668000 sqlservr (export symbols) sqlservr.exe

I will use the !findstack command to locate all threads which have the function call SyncWritePreemptive on their callstack.

0:070> !findstack sqlservr!FCB::SyncWritePreemptive 0

Thread 069, 1 frame(s) match
Thread 074, 1 frame(s) match
Thread 076, 1 frame(s) match
Thread 079, 1 frame(s) match
Thread 081, 1 frame(s) match
Thread 082, 1 frame(s) match
Thread 086, 1 frame(s) match
Thread 089, 1 frame(s) match
Thread 091, 1 frame(s) match
Thread 095, 1 frame(s) match
Thread 098, 1 frame(s) match
Thread 099, 1 frame(s) match
Thread 104, 1 frame(s) match
Thread 106, 1 frame(s) match
Thread 107, 1 frame(s) match
Thread 131, 1 frame(s) match
Thread 136, 1 frame(s) match

0:070> ~81s
ntdll!ZwWaitForSingleObject+0xa:
00000000`77ef0a2a c3              ret
0:081> kL100

ntdll!ZwDelayExecution
kernel32!SleepEx
sqlservr!FCB::SyncWritePreemptive
sqlservr!FCB::PullPageToReplica
sqlservr!alloca_probe
sqlservr!BUF::CopyOnWrite
sqlservr!PageRef::PrepareToDirty
sqlservr!RecoveryMgr::DoCOWPreWrites

You could get all the callstacks with the function that you are searching for using the command: !findstack sqlservr!FCB::SyncWritePreemptive 3

If you look at the thread that raised ended up raising the Latch Timeout warning was also performing a CHECKDB.

0:074> .ecxr

0:074> kL100

kernel32!RaiseException
sqlservr!CDmpDump::Dump
sqlservr!CImageHelper::DoMiniDump
sqlservr!stackTrace
sqlservr!LatchBase::DumpOnTimeoutIfNeeded
sqlservr!LatchBase::PrintWarning
sqlservr!alloca_probe
sqlservr!BUF::AcquireLatch


sqlservr!UtilDbccCreateReplica
sqlservr!UtilDbccCheckDatabase
sqlservr!DbccCheckDB
sqlservr!DbccCommand::Execute

If you cannot find the thread which raised the Latch Timeout warning, you could print out all the callstacks in the dump using ~*kL100 and the searching for the function call in blue above. It is quite clear from the callstack above that the thread was also involved in performing a CHECKDB operation as reported in the SQL Errorlog in the input buffer for the Latch Timeout dump.

If case you were not able to identify the issue right off the bat, you need to check the build that you are on and look for issues that were addressed related to Latch Timeouts for the SQL Server release  that you are using. The symptoms section would have sufficient amount of information for you to compare with your current symptoms and scenario and determine if the KB Article that you are looking at is applicable in your case.

Now is the time, when you need to have some context about the operations that were occurring on the server to actually determine what the actual issue is. Based on what I heard from the system administrators that there was a CHECKDB being executed on the database while the application was executing DML operations on the database. Additionally, the volume on which the disk resides on has fragmentation and the database in question is large (>750GB).

Based on the two MSDN blog posts that I mentioned above, it is quite clear that it is possible to run into sparse file limitations when there is high amount of fragmentation on the disks or if there are a large number of concurrent changes occurring on the database when a CHECKDB is executing on it. A number of Windows and SQL Server updates along with workarounds to run CHECKDB on such databases is mentioned in the second blog post mentioned above. On a separate note, this is not an issue with CHECKDB! It is limitation that you are hitting with sparse files on the OS layer. Remember CHECKDB, starting from SQL Server 2005, creates an internal snapshot (makes sparse file) to execute the consistency check. Paul Randal’s tweet made me add this line to call this out explicitly!

As always… if you are still stuck, contact Microsoft CSS with the mini-dump file, SQL Errorlog and the Windows Event Logs. It might be quite possible that CSS might ask you to collect additional data as most Latch Timeout issues are generally an after-effect of a previous issue. In this case, it was the OS Error 1450.

Well… That’s it for today! Hope this is helpful for the next time you encounter a Latch Timeout issue.

Additional References

Four stages of NTFS File Growth
KB 315263 – How to read the small memory dump files that Windows creates for debugging

 

Debugging that Assert condition: Maybe Not

Last week I had shown how to debug non-yielding scheduler and deadlocked schedulers memory dumps. In this post, I shall talk about Assertions. When an assert condition check fails in the SQL Server code base, a mini-dump of the SQL Server process is created which is found by default in the SQL Server LOG folder.

An assert is basically a predicate (true-false condition) put in a program’s code by the developer which he/she thinks should always evaluate to TRUE. If this fails, then a the assert failure code written by the developer will be executed.

This is one of those scenarios where you will NOT actually need to debug the assert dump and still be able to achieve a lot without opening a debugger. So if you were hoping for some more cool debugging steps, I will have to disappoint you till my next post! Whenever an assert condition fails, the message is logged in the SQL Server Errorlog along with the failing assert condition.

Let me walk you through an example. You would see messages similar to the one below in the Windows Application Event log when an Assertion check fails:

MSSQLSERVER    Error    (2)    3624    N/A    <server name> A system assertion check has failed. Check the SQL Server error log for details. Typically, an assertion failure is caused by a software bug or data corruption. To check for database corruption, consider running DBCC CHECKDB. If you agreed to send dumps to Microsoft during setup, a mini dump will be sent to Microsoft. An update might be available from Microsoft in the latest Service Pack or in a QFE from Technical Support.
MSSQLSERVER    Error    (2)    17066    N/A    <server name> "SQL Server Assertion: File: <""logmgr.cpp"">, line=<line number> Failed Assertion = ‘!(minLSN.m_fSeqNo < lfcb->lfcb_fSeqNo)‘. This error may be timing-related. If the error persists after rerunning the statement, use DBCC CHECKDB to check the database for structural integrity, or restart the server to ensure in-memory data structures are not corrupted."

So if you are monitoring your application event logs, 17066 or 3624 is the error number that you need to monitor for. Once you are aware that an exception occurred, you can check the SQL Server Errorlog from the time period when the issue was reported and see what errors were reported. You will find the same verbiage reported in the SQL Errorlog as in the application logs. (See errorlog excerpt below)

Location: "logmgr.cpp" <line number>
Expression: !(minLSN.m_fSeqNo < lfcb->lfcb_fSeqNo)


While acting as a mirroring partner for database ‘<Database name>’, server instance ‘<database name>’ encountered error 3624, status 1, severity 20. Database mirroring will be suspended.  Try to resolve the error and resume mirroring.

Notice the additional errorlog entry reported above pertaining to Database Mirroring. I find that the database mirroring setup for an user database encountered an assertion which was reported earlier in the SQL Errorlog. The build number for the SQL Server instance in question is 5000 (SQL Server 2005). The first thing you should do is search for the assert expression [!(minLSN.m_fSeqNo < lfcb->lfcb_fSeqNo)] on Bing (Or any other search engine if you are really persistent on using a particular search engine 😉 ).

Additionally, as the error message points out, run a CHECKDB on the database in question and find out if there are any inconsistencies reported. If yes, then the first task is to restore the last known good backup or correct the inconsistencies before doing anything else.

Getting back to the search results, you will find that you get KB981273 and KB2403218 in the search results. Pay close attention to the symptoms section in the KB Article as fixes for assertion failures are very specific. If the symptoms do not match, then the KB Article you found might not apply to your situation.

Symptoms from KB981273:

In Microsoft SQL Server 2005, when you restart the SQL Server service, error messages that resemble the following are logged in the SQL Server Errorlog file.

Symptoms from KB2403218:

Consider the following scenario:

  • You create a database mirroring session between two instances of SQL Server 2005 and SQL Server 2008 and SQL Server 2008 R2 by using the High-Performance (asynchronous) mode. One instance is the principal SQL Server that contains the principal database, and the other instance is the mirror SQL Server that contains the mirror database.
  • A long delay occurs between the principal and mirror databases. For example, one of the following delays occurs:
    • The redo process of the mirror database is slower than the transfer of the transaction from the principal database to the mirror database.
    • A transaction that requires a long time runs on the principal database. For example, an ALTER INDEX REBUILD query runs on the principal database.

In this scenario, the session is suspended, and a mini-dump file is generated in the log folder on the mirror server.

As you can see that KB2403218 is what is applicable in this scenario where as the symptoms described in the first KB Article doesn’t match the issue! Next check the release vehicle for the fix for the SQL Server release that you are using and the releases affected by this issue. You will find that there are two builds for SQL Server 2005 which contain the fix for this issue:

2438344 Cumulative update package 13 for SQL Server 2005 Service Pack 3 (Build: 4315)
2489409 Cumulative update package 2 for SQL Server 2005 Service Pack 4 (Build: 5254)

The interesting point to note here is that the SQL Server version for the instance in question is above the build that had the fix for Cumulative Update #13 for SQL Server 2005 Service Pack 3 which had the issue. However, this fix was present for Service Pack 4 (Build: 5000) but was included for the Cumulative Update #2 for Service Pack 4. So, now you would need to either apply CU2 or the latest Cumulative Update package for Service Pack 4 to resolve this issue.

As always, if in doubt contact Microsoft CSS with assertion mini-dump, SQL Errorlogs and the Windows Event logs. CSS Engineers might request for additional data like profiler traces, filtered dump from the next occurrence or maybe even a repro if the available data is not sufficient to perform a complete root cause analysis and identify the problem.

To summarize:
1. Just because an assert condition matches a KB Article, it doesn’t mean that you have identified your issue.
2. Verify that the symptoms match the condition that you are facing as assert conditions might be common for two completely different issues.
3. Even though you might be on a higher build than the build version for a particular release vehicle, you still might not have the fix on the Service Pack level that you are on as the fix might have been included in a later cumulative update.
4. If you cannot find a KB Article match, don’t’ have any inconsistencies in the database and are able to reproduce the scenario consistently, then check if you can reproduce this on the latest available build for the SQL Server release that you are using or contact CSS.

Debugging is not always a necessity to investigate an issue!! Smile