Debugging that latch timeout

Posted on August 26, 2011 by Amit Banerjee

My last post of debugging an assertion didn’t have any cool debugging tips since there is not much that you can do with an assertion dump unless you have access to private symbols and sometimes even access to the source code. In this post, I am going to not disappoint and show you some more cool things that the windows debugger can do for you with public symbols for a latch timeout issue.

When you encounter a latch timeout (buffer or non-buffer latch), the first occurrence of it’s type generates a mini-dump. If there are further occurrences of the same latch timeout, then that is reported as an error message in the SQL Errorlog.

Buffer latch timeouts are typically reported using Error: 844 and 845. The common reasons for such errors are documented in a KB Article. For a non-buffer latch timeout, you will get the an 847 error.

Error #	Error message template (from sys.messages)
844	Time out occurred while waiting for buffer latch — type %d, bp %p, page %d:%d, stat %#x, database id: %d, allocation unit id: %I64d%ls, task 0x%p : %d, waittime %d, flags 0x%I64x, owning task 0x%p. Continuing to wait.
845	Time-out occurred while waiting for buffer latch type %d for page %S_PGID, database ID %d.
846	A time-out occurred while waiting for buffer latch — type %d, bp %p, page %d:%d, stat %#x, database id: %d, allocation unit Id: %I64d%ls, task 0x%p : %d, waittime %d, flags 0x%I64x, owning task 0x%p. Not continuing to wait.
847	Timeout occurred while waiting for latch: class ‘%ls’, id %p, type %d, Task 0x%p : %d, waittime %d, flags 0x%I64x, owning task 0x%p. Continuing to wait.

This is what you will see in the SQL Errorlog when a latch timeout occurs.

spid148     Time out occurred while waiting for buffer latch — type 4, bp 0000000832FE1200, page 3:11234374, stat 0x7c20009, database id: 120, allocation unit id: 72057599731367936, task 0x0000000003C4F2E8 : 0, waittime 300, flags 0x1a, owning task 0x0000000003C129B8. Continuing to wait.
spid148     **Dump thread – spid = 148, PSS = 0x000000044DC17BD0, EC = 0x000000044DC17BE0
spid148     ***Stack Dump being sent to D:\Microsoft SQL Server\MSSQL.1\MSSQL\LOG\SQLDump0001.txt

spid148     * Latch timeout
spid148     * Input Buffer 84 bytes –
spid148     *             DBCC CHECKDB WITH ALL_ERRORMSGS
External dump process returned no errors.

I have only pasted the relevant portion from the Errorlog for brevity. As I have outlined in my previous blog posts on similar topics, that there is a large opportunity for due diligence that can be done with the help of the Windows Event Logs and the SQL Server Errorlogs before you start spawning off windows debugger to analyze the memory dump on your system. The first few obvious things that you will notice is that SPID 148 encountered the issue while performing a CHECKDB on database ID 120. The timeout occurred while waiting for a buffer latch on a page (Page ID is available in the message above). I don’t see a “Timeout waiting for external dump process” message in the SQL Errorlog which means that I have a good chance of extracting useful information from the mini-dump that was generated by SQLDumper.

Latch timeouts are typically victims of either a system related issue (hardware or drivers or operating system or a previous error encountered by SQL Server). So the next obvious action item would be to look into the SQL Errorlogs and find out if there were any additional errors prior to the latch timeout issue. I see a number of OS Error 1450 reported by the same SPID 148 for the same file handle but different offsets.

spid148 The operating system returned error 1450(Insufficient system resources exist to complete the requested service.) to SQL Server during a write at offset 0x0000156bf36000 in file with handle 0x0000000000001358. This is usually a temporary condition and the SQL Server will keep retrying the operation. If the condition persists then immediate action must be taken to correct it.

Additionally, I see prior and post (within 5-15 minutes) the latch timeout issue, multiple other SPIDs reporting the same 1450 error message for different offsets but again on the same file.

spid137 The operating system returned error 1450(Insufficient system resources exist to complete the requested service.) to SQL Server during a write at offset 0x000007461f8000 in file with handle 0x0000000000001358. This is usually a temporary condition and the SQL Server will keep retrying the operation. If the condition persists then immediate action must be taken to correct it.

I also see the latch timeout message being reported after every 300 ms for the same page and the database.

spid148 Time out occurred while waiting for buffer latch — type 4, bp 0000000832FE1200, page 3:11234374, stat 0x7c20009, database id: 120, allocation unit id: 72057599731367936, task 0x0000000003C4F2E8 : 0, waittime 82800, flags 0x1a, owning task 0x0000000003C129B8. Continuing to wait.

Notice the waittime above, it has increased from 300 to 82800!! So the next thing I do is look up issues related to CHECKDB and 1450 error messages on the web using Bing (Yes, I do use BING!!). These are relevant posts related to the above issue.

http://blogs.msdn.com/b/psssql/archive/2008/07/10/sql-server-reports-operating-system-error-1450-or-1452-or-665-retries.aspx
http://blogs.msdn.com/b/psssql/archive/2009/03/04/sparse-file-errors-1450-or-665-due-to-file-fragmentation-fixes-and-workarounds.aspx

As of now, it is quite clear that the issue is related to a possible sparse file issue related to file fragmentation. Now it is time for me to check if there are other threads in the dump waiting on SyncWritePreemptive calls.

Use the location provided in the Errorlog snippet reporting the Latch Timeout message to locate the mini-dump for the issue (in this case SQLDump0001.mdmp).

Now when you load the dump using WinDBG, you will see the following information:

Loading Dump File [D:\Microsoft SQL Server\MSSQL.1\MSSQL\LOG\SQLDump0001.mdmp]
User Mini Dump File: Only registers, stack and portions of memory are available

Comment: ‘Stack Trace’
Comment: ‘Latch timeout’

This dump file has an exception of interest stored in it.

The above tells you that this is a mini-dump for a Latch Timeout condition and the location from where you loaded the dump. Then I use the command to set my symbol path and direct the symbols downloaded from the Microsoft symbol server to a local symbol file cache on my machine.

.sympath srv*D:\PublicSymbols*http://msdl.microsoft.com/download/symbols

Then I issue a reload command to load the symbols for sqlservr.exe. This can also be done using CTRL+L and providing the complete string above (without .sympath), checking the Reload checkbox and clicking on OK. The only difference here is that the all the public symbols for all loaded modules in the dump will be downloaded from the Microsoft Symbol Server which are available.

.reload /f sqlservr.exe

Next thing is to verify that the symbols were correctly loaded using the lmvm sqlservr command. If the symbols were loaded correctly, you should see the following output. Note the text in green.

0:005> lmvm sqlservr

start end module name
00000000`01000000 00000000`03668000 sqlservr T (pdb symbols) d:\publicsymbols\sqlservr.pdb\2A3969D78EE24FD494837AF090F5EDBC2\sqlservr.pdb

If symbols were not loaded, then you will see an output as shown below.

0:005> lmvm sqlservr
start end module name
00000000`01000000 00000000`03668000 sqlservr (export symbols) sqlservr.exe

I will use the !findstack command to locate all threads which have the function call SyncWritePreemptive on their callstack.

0:070> !findstack sqlservr!FCB::SyncWritePreemptive 0

Thread 069, 1 frame(s) match
Thread 074, 1 frame(s) match
Thread 076, 1 frame(s) match
Thread 079, 1 frame(s) match
Thread 081, 1 frame(s) match
Thread 082, 1 frame(s) match
Thread 086, 1 frame(s) match
Thread 089, 1 frame(s) match
Thread 091, 1 frame(s) match
Thread 095, 1 frame(s) match
Thread 098, 1 frame(s) match
Thread 099, 1 frame(s) match
Thread 104, 1 frame(s) match
Thread 106, 1 frame(s) match
Thread 107, 1 frame(s) match
Thread 131, 1 frame(s) match
Thread 136, 1 frame(s) match

0:070> ~81s
ntdll!ZwWaitForSingleObject+0xa:
00000000`77ef0a2a c3 ret
0:081> kL100

ntdll!ZwDelayExecution
kernel32!SleepEx
sqlservr!FCB::SyncWritePreemptive
sqlservr!FCB::PullPageToReplica
sqlservr!alloca_probe
sqlservr!BUF::CopyOnWrite
sqlservr!PageRef::PrepareToDirty
sqlservr!RecoveryMgr::DoCOWPreWrites

You could get all the callstacks with the function that you are searching for using the command: !findstack sqlservr!FCB::SyncWritePreemptive 3

If you look at the thread that raised ended up raising the Latch Timeout warning was also performing a CHECKDB.

0:074> .ecxr

0:074> kL100

kernel32!RaiseException
sqlservr!CDmpDump::Dump
sqlservr!CImageHelper::DoMiniDump
sqlservr!stackTrace
sqlservr!LatchBase::DumpOnTimeoutIfNeeded
sqlservr!LatchBase::PrintWarning
sqlservr!alloca_probe
sqlservr!BUF::AcquireLatch
…
…
sqlservr!UtilDbccCreateReplica
sqlservr!UtilDbccCheckDatabase
sqlservr!DbccCheckDB
sqlservr!DbccCommand::Execute

If you cannot find the thread which raised the Latch Timeout warning, you could print out all the callstacks in the dump using ~*kL100 and the searching for the function call in blue above. It is quite clear from the callstack above that the thread was also involved in performing a CHECKDB operation as reported in the SQL Errorlog in the input buffer for the Latch Timeout dump.

If case you were not able to identify the issue right off the bat, you need to check the build that you are on and look for issues that were addressed related to Latch Timeouts for the SQL Server release that you are using. The symptoms section would have sufficient amount of information for you to compare with your current symptoms and scenario and determine if the KB Article that you are looking at is applicable in your case.

Now is the time, when you need to have some context about the operations that were occurring on the server to actually determine what the actual issue is. Based on what I heard from the system administrators that there was a CHECKDB being executed on the database while the application was executing DML operations on the database. Additionally, the volume on which the disk resides on has fragmentation and the database in question is large (>750GB).

Based on the two MSDN blog posts that I mentioned above, it is quite clear that it is possible to run into sparse file limitations when there is high amount of fragmentation on the disks or if there are a large number of concurrent changes occurring on the database when a CHECKDB is executing on it. A number of Windows and SQL Server updates along with workarounds to run CHECKDB on such databases is mentioned in the second blog post mentioned above. On a separate note, this is not an issue with CHECKDB! It is limitation that you are hitting with sparse files on the OS layer. Remember CHECKDB, starting from SQL Server 2005, creates an internal snapshot (makes sparse file) to execute the consistency check. Paul Randal’s tweet made me add this line to call this out explicitly!

As always… if you are still stuck, contact Microsoft CSS with the mini-dump file, SQL Errorlog and the Windows Event Logs. It might be quite possible that CSS might ask you to collect additional data as most Latch Timeout issues are generally an after-effect of a previous issue. In this case, it was the OS Error 1450.

Well… That’s it for today! Hope this is helpful for the next time you encounter a Latch Timeout issue.

Additional References

Four stages of NTFS File Growth
KB 315263 – How to read the small memory dump files that Windows creates for debugging

del.icio.us Tags: Latch Timeout,Debugging,Troubleshooting,SQL Server 2005,SQL Server 2008,Stack Dump

Debugging that Assert condition: Maybe Not

Posted on August 24, 2011 by Amit Banerjee

Last week I had shown how to debug non-yielding scheduler and deadlocked schedulers memory dumps. In this post, I shall talk about Assertions. When an assert condition check fails in the SQL Server code base, a mini-dump of the SQL Server process is created which is found by default in the SQL Server LOG folder.

An assert is basically a predicate (true-false condition) put in a program’s code by the developer which he/she thinks should always evaluate to TRUE. If this fails, then a the assert failure code written by the developer will be executed.

This is one of those scenarios where you will NOT actually need to debug the assert dump and still be able to achieve a lot without opening a debugger. So if you were hoping for some more cool debugging steps, I will have to disappoint you till my next post! Whenever an assert condition fails, the message is logged in the SQL Server Errorlog along with the failing assert condition.

Let me walk you through an example. You would see messages similar to the one below in the Windows Application Event log when an Assertion check fails:

MSSQLSERVER Error (2) 3624 N/A <server name> A system assertion check has failed. Check the SQL Server error log for details. Typically, an assertion failure is caused by a software bug or data corruption. To check for database corruption, consider running DBCC CHECKDB. If you agreed to send dumps to Microsoft during setup, a mini dump will be sent to Microsoft. An update might be available from Microsoft in the latest Service Pack or in a QFE from Technical Support.
MSSQLSERVER Error (2) 17066 N/A <server name> "SQL Server Assertion: File: <""logmgr.cpp"">, line=<line number> Failed Assertion = ‘!(minLSN.m_fSeqNo < lfcb->lfcb_fSeqNo)‘. This error may be timing-related. If the error persists after rerunning the statement, use DBCC CHECKDB to check the database for structural integrity, or restart the server to ensure in-memory data structures are not corrupted."

So if you are monitoring your application event logs, 17066 or 3624 is the error number that you need to monitor for. Once you are aware that an exception occurred, you can check the SQL Server Errorlog from the time period when the issue was reported and see what errors were reported. You will find the same verbiage reported in the SQL Errorlog as in the application logs. (See errorlog excerpt below)

Location: "logmgr.cpp" <line number>
Expression: !(minLSN.m_fSeqNo < lfcb->lfcb_fSeqNo)
…
…
While acting as a mirroring partner for database ‘<Database name>’, server instance ‘<database name>’ encountered error 3624, status 1, severity 20. Database mirroring will be suspended. Try to resolve the error and resume mirroring.

Notice the additional errorlog entry reported above pertaining to Database Mirroring. I find that the database mirroring setup for an user database encountered an assertion which was reported earlier in the SQL Errorlog. The build number for the SQL Server instance in question is 5000 (SQL Server 2005). The first thing you should do is search for the assert expression [!(minLSN.m_fSeqNo < lfcb->lfcb_fSeqNo)] on Bing (Or any other search engine if you are really persistent on using a particular search engine 😉 ).

Additionally, as the error message points out, run a CHECKDB on the database in question and find out if there are any inconsistencies reported. If yes, then the first task is to restore the last known good backup or correct the inconsistencies before doing anything else.

Getting back to the search results, you will find that you get KB981273 and KB2403218 in the search results. Pay close attention to the symptoms section in the KB Article as fixes for assertion failures are very specific. If the symptoms do not match, then the KB Article you found might not apply to your situation.

Symptoms from KB981273:

In Microsoft SQL Server 2005, when you restart the SQL Server service, error messages that resemble the following are logged in the SQL Server Errorlog file.

Symptoms from KB2403218:

Consider the following scenario:

You create a database mirroring session between two instances of SQL Server 2005 and SQL Server 2008 and SQL Server 2008 R2 by using the High-Performance (asynchronous) mode. One instance is the principal SQL Server that contains the principal database, and the other instance is the mirror SQL Server that contains the mirror database.

A long delay occurs between the principal and mirror databases. For example, one of the following delays occurs:

The redo process of the mirror database is slower than the transfer of the transaction from the principal database to the mirror database.

A transaction that requires a long time runs on the principal database. For example, an ALTER INDEX REBUILD query runs on the principal database.

In this scenario, the session is suspended, and a mini-dump file is generated in the log folder on the mirror server.

As you can see that KB2403218 is what is applicable in this scenario where as the symptoms described in the first KB Article doesn’t match the issue! Next check the release vehicle for the fix for the SQL Server release that you are using and the releases affected by this issue. You will find that there are two builds for SQL Server 2005 which contain the fix for this issue:

2438344 Cumulative update package 13 for SQL Server 2005 Service Pack 3 (Build: 4315)
2489409 Cumulative update package 2 for SQL Server 2005 Service Pack 4 (Build: 5254)

The interesting point to note here is that the SQL Server version for the instance in question is above the build that had the fix for Cumulative Update #13 for SQL Server 2005 Service Pack 3 which had the issue. However, this fix was present for Service Pack 4 (Build: 5000) but was included for the Cumulative Update #2 for Service Pack 4. So, now you would need to either apply CU2 or the latest Cumulative Update package for Service Pack 4 to resolve this issue.

As always, if in doubt contact Microsoft CSS with assertion mini-dump, SQL Errorlogs and the Windows Event logs. CSS Engineers might request for additional data like profiler traces, filtered dump from the next occurrence or maybe even a repro if the available data is not sufficient to perform a complete root cause analysis and identify the problem.

To summarize:
1. Just because an assert condition matches a KB Article, it doesn’t mean that you have identified your issue.
2. Verify that the symptoms match the condition that you are facing as assert conditions might be common for two completely different issues.
3. Even though you might be on a higher build than the build version for a particular release vehicle, you still might not have the fix on the Service Pack level that you are on as the fix might have been included in a later cumulative update.
4. If you cannot find a KB Article match, don’t’ have any inconsistencies in the database and are able to reproduce the scenario consistently, then check if you can reproduce this on the latest available build for the SQL Server release that you are using or contact CSS.

Debugging is not always a necessity to investigate an issue!! Smile

del.icio.us Tags: Assertion,Debugging,SQL Server 2005,SQL Server 2008,Stack Dump

Debugging the deadlock but for the scheduler: Part 2 of 2

Posted on August 16, 2011 by Amit Banerjee

This is the second part of my Deadlocked Schedulers debugging posts. In the first post, I discussed about the difference between a 17888 and 17884 dump and showed the amount of information that can be extracted from a mini-dump that is generated for the first occurrence of a Deadlocked Scheduler issue. In this post, I shall walk you through a 17884 condition.

The first difference that you will notice with a 17884 message i.e. Deadlocked Schedulers reported in the SQL Errorlog is that no information about the blocking resource is mentioned in the SQL Errorlog unlike the 17888 message. If you inspect the SQL Errorlog, you will see only an error message of the following type:

2010-09-20 14:04:07.78 Server      ***Stack Dump being sent to D:\Program Files\Microsoft SQL Server\MSSQL10.MSSQLServer\MSSQL\LOG\SQLDump0001.txt
2010-09-20 14:04:07.78 Server      *   09/20/10 14:04:07 spid 27276
2010-09-20 14:04:07.78 Server      *
2010-09-20 14:04:07.78 Server      * Deadlocked Schedulers
2010-09-20 14:04:07.78 Server      *
2010-09-20 14:04:07.78 Server      * *******************************************************************************
2010-09-20 14:04:07.78 Server      * ——————————————————————————-
2010-09-20 14:04:07.78 Server      * Short Stack Dump
2010-09-20 14:04:07.82 Server      Stack Signature for the dump is 0x0000000000000285
External dump process returned no errors.

As mentioned in my previous posts, the dump process returned no errors (Note message in green). This means that I can extract some useful information regarding the issue from the dump. One of the important things about a Deadlocked Schedulers (17884) is that you need to look through the Errorlogs for any other Access Violation or Assertion that occurred prior to the Deadlocked Schedulers event. Always check for previous errors first. The 17884 error messages are often symptoms pointing to a problem. Before investigating the dump, always review overall system health information such as performance information, event logs, SQL Server error logs, and other application logging information. Understanding prior problems on the system may quickly clarify why SQL Server detected and reported a health error. Previous errors should always be addressed before the investigation continues.

The common reasons for a Deadlocked Schedulers are:

All schedulers have encountered a 17883 condition.
All workers are blocked on a critical resource.
All workers are executing a long-running query.

I have very rarely seen #1 being the cause of Deadlocked Schedulers. More often than not it is #2 or #3.

Now when you load the dump using WinDBG, as noted in the debugging a non-yielding scheduler issue, you will note similar things. When you load the dump using WinDBG, you will see the following information:

Loading Dump File [D:\Program Files\Microsoft SQL Server\MSSQL10.MSSQLServer\MSSQL\LOG\SQLDump0001.mdmp]
User Mini Dump File: Only registers, stack and portions of memory are available

Comment: ‘Stack Trace’
Comment: ‘Deadlocked Schedulers’

The above tells you that this is a mini-dump for a deadlocked schedulers condition and the location from where you loaded the dump. Then I use the command to set my symbol path and direct the symbols downloaded from the Microsoft symbol server to a local symbol file cache on my machine.

.sympath srv*D:\PublicSymbols*http://msdl.microsoft.com/download/symbols

.reload /f sqlservr.exe

Next thing is to verify that the symbols were correctly loaded using the lmvm sqlservr command. If the symbols were loaded correctly, you should see the following output. Note the text in green.

0:005> lmvm sqlservr
start end module name

00000000`00530000 00000000`040de000 sqlservr T (pdb symbols) d:\publicsymbols\sqlservr.pdbD90C4EC67D442D692A791A19FBA638F2\sqlservr.pdb

If symbols were not loaded, then you will see an output as shown below.

0:005> lmvm sqlservr
start end module name
00000000`00530000 00000000`040de000 sqlservr (export symbols) sqlservr.exe

Since I didn’t have any prior errors to the deadlocked schedulers issue in the Errorlog, I don’t have a search string to look for in the dump. This makes it tricky. So I will have to manually look through the callstacks for all the threads in the dump and establish a common pattern. You can use the !uniqstack command to dump out all the unique callstacks in the dump along with the count of the duplicate callstacks.

0:005> !uniqstack

Output excerpt from previous command:

Total threads: 868
Duplicate callstacks: 810 (windbg thread #s follow):
4, 8, 10, … … 865, 866, 867

As you can see above, there are 868 threads in the dump and there are 810 callstacks which are listed as duplicate.

On manual inspection of the threads using the thread ids mentioned in the Duplicate callstacks list (italicized numbers above), I did find a common pattern. There are multiple threads waiting for locks and there are quite a few threads which are executing in parallel.

Two common thread stacks that emerged were:

Waiting on lock:

0:804> kL100
ntdll!ZwSignalAndWaitForSingleObject
kernel32!SignalObjectAndWait
sqlservr!SOS_Scheduler::Switch
sqlservr!SOS_Scheduler::SuspendNonPreemptive
sqlservr!SOS_Scheduler::Suspend
sqlservr!EventInternal<Spinlock<153,1,0> >::Wait
sqlservr!LockOwner::Sleep
sqlservr!lck_lockInternal
…
…
sqlservr!process_request
sqlservr!process_commands
sqlservr!SOS_Task::Param::Execute
sqlservr!SOS_Scheduler::RunTask
sqlservr!SOS_Scheduler::ProcessTasks
sqlservr!SchedulerManager::WorkerEntryPoint
sqlservr!SystemThread::RunWorker
sqlservr!SystemThreadDispatcher::ProcessWorker
sqlservr!SchedulerManager::ThreadEntryPoint
msvcr80!endthreadex
msvcr80!endthreadex
kernel32!BaseThreadInitThunk
ntdll!RtlUserThreadStart

Parallel thread:

0:852> kL100

ntdll!ZwWaitForSingleObject
KERNELBASE!WaitForSingleObjectEx
sqlservr!SOS_Scheduler::SwitchContext
sqlservr!SOS_Scheduler::SuspendNonPreemptive
sqlservr!SOS_Scheduler::Suspend
sqlservr!EventInternal<Spinlock<153,1,0> >::Wait
sqlservr!EventInternal<Spinlock<153,1,0> >::WaitAllowPrematureWakeup
sqlservr!XchngWait<Spinlock<38,10,0> >
sqlservr!CXPort::Open
sqlservr!CXTransLocal::Synchronize
sqlservr!CQScanXProducerNew::Open
sqlservr!FnProducerOpen
sqlservr!FnProducerThread
sqlservr!SubprocEntrypoint
sqlservr!SOS_Task::Param::Execute
sqlservr!SOS_Scheduler::RunTask
sqlservr!SOS_Scheduler::ProcessTasks
sqlservr!SchedulerManager::WorkerEntryPoint
sqlservr!SystemThread::RunWorker
sqlservr!SystemThreadDispatcher::ProcessWorker
sqlservr!SchedulerManager::ThreadEntryPoint
msvcr80!endthreadex
msvcr80!endthreadex
kernel32!BaseThreadInitThunk
ntdll!RtlUserThreadStart

The next thing that I shall do is pick one of the common relevant functions from the above callstacks and search for the number of occurrences of these frames in the all the threads present in the dump using the !findstack command.

!findstack sqlservr!LockOwner::Sleep 0

Excerpt from the output of the above command:

Thread 054, 1 frame(s) match
Thread 056, 1 frame(s) match
Thread 059, 1 frame(s) match
Thread 061, 1 frame(s) match
Thread 062, 1 frame(s) match
…
…
Thread 799, 1 frame(s) match
Thread 800, 1 frame(s) match
Thread 801, 1 frame(s) match
Thread 802, 1 frame(s) match
Thread 804, 1 frame(s) match
Thread 805, 1 frame(s) match

!findstack sqlservr!FnProducerThread 0

Excerpt from the output of the above command:

Thread 055, 1 frame(s) match
Thread 057, 1 frame(s) match
Thread 066, 1 frame(s) match
Thread 082, 1 frame(s) match
Thread 087, 1 frame(s) match
Thread 093, 1 frame(s) match
…
…
Thread 840, 1 frame(s) match
Thread 843, 1 frame(s) match
Thread 844, 1 frame(s) match
Thread 850, 1 frame(s) match
Thread 852, 1 frame(s) match

I found that there were 584 threads (67%) waiting for locks and 166 threads (19%) involved in parallelism. Note that the lines colored in blue are the callstacks which I had mentioned above to determine that there were waits on locks and parallelism. This confirms that the problem is due to concurrent requests waiting for lock(s) on object(s) and some of the queries are executing in parallelism. If a CSS engineer is examining this dump, then they could probably extract additional information like the queries being executed by the threads (if the information is available in the dump) using private symbols. If you were monitoring the server using a monitoring tool which captured the query requests, then you could go back to your monitoring repository and get the queries that the above threads were executing using the thread ids that you have identified.

TIP: The interesting thing to note is the thread IDs that you see above will directly match the OS Thread ID reported in the sys.dm_os_threads os_thread_id column value or the sys.sysprocesses kpid column value. So if you have a monitoring tool that is capturing the output of queries being executed by the active SQL Server sessions, then you can go back to your monitoring repository and dig up the queries which these threads were executing. If such data is not available from the time of the issue, more often that not you will have to end up collecting additional data (profiler traces along with DMV outputs to track blocking chains) for the next problem occurrence!

Also check if the Deadlocked Schedulers matches any KB Article (update) already released. Remember that deadlocked schedulers is a symptom and the not the cause!! So make sure that if you find a KB Article which talks about a Deadlocked Schedulers issue being addressed in a later build, the symptoms match the symptoms that you are experiencing on your server! If in doubt, ask CSS!

If you are stuck and have no clue why the issue happened, then this would be a good time to engage Microsoft CSS Support and provide them with the SQL Errorlogs, the dump file and the windows event logs in .txt format for further analysis.

Reference:
KB 315263 – How to read the small memory dump files that Windows creates for debugging
How To Diagnose and Correct Errors 17883, 17884, 17887, and 17888 whitepaper

del.icio.us Tags: 17884,17888,Deadlocked Schedulers,Debugging,SQL Server 2005,SQL Server 2008,Stack Dump

Debugging the deadlock but for the scheduler: Part 1 of 2

Posted on August 15, 2011 by Amit Banerjee

I had very recently blogged about debugging a non-yielding scheduler condition for a SQL Server instance. In today’s post, I shall talk about debugging a “deadlocked schedulers” issue which is a.k.a the infamous 17884 to you old-school folks. If ALL schedulers are considered stuck and SQL Server is not under active memory pressure, then the 17884 or 17888 condition is reported starting from SQL Server 2005. In this post, I shall show you how to dig into the root cause for a 17888 message reported in the SQL Errorlog. There are subtle differences between a 17888 and 17884 condition detected and reported by the SchedulerMonitor. A 17884 generally tells you that no work is being processed by a particular NUMA Node (for non-NUMA machines, it denotes the entire instance). On the other hand, 17888 tells you that no new work request is being processed on a particular NUMA node by the SQL server instance due to worker threads waiting on particular resource(s). The whitepaper mentioned at the end of this post covers the nitty-gritty details of the detection algorithm.

Description of the error messages from sys.messages catalog

17784 – New queries assigned to process on Node %d have not been picked up by a worker thread in the last %d seconds. Blocking or long-running queries can contribute to this condition, and may degrade client response time. Use the "max worker threads" configuration option to increase number of allowable threads, or optimize current running queries. SQL Process Utilization: %d%%. System Idle: %d%%.

17888 – All schedulers on Node %d appear deadlocked due to a large number of worker threads waiting on %ls. Process Utilization %d%%.

When a 17888 is reported, you will find messages of the following nature in the SQL Errorlog:

2008-03-25 14:14:53.37 Server      ***Stack Dump being sent to g:\mssql\data\MSSQL\LOG\SQLDump0006.txt
2008-03-25 14:14:53.38 Server      * BEGIN STACK DUMP:
2008-03-25 14:14:53.38 Server      *   03/25/08 14:14:53 spid 0
2008-03-25 14:14:53.38 Server      *
2008-03-25 14:14:53.38 Server      * Deadlocked Schedulers
2008-03-25 14:14:53.38 Server      * Short Stack Dump
2008-03-25 14:14:53.41 Server      Stack Signature for the dump is 0x00000260
2008-03-25 14:14:56.44 Server      External dump process return code 0x20000001.
External dump process returned no errors.

2008-03-25 14:14:56.44 Server      All schedulers on Node 0 appear deadlocked due to a large number of worker threads waiting on LCK_M_S. Process Utilization 1%.

Again note the differently colored lines and text whose significance, if not self-explanatory at this time, I shall talk about them later in the post. The last message in the above excerpt very clearly tells you that you have a blocking issue on the SQL instance and most of the worker threads on Node 0 are waiting for a LCK_M_S which they cannot acquire due to a conflicting lock(s) held on the object(s) by another session. This time around the SQLDumpXXXX.txt filename is mentioned in the Errorlog unlike the non-yielding scheduler message. So what needs to be done is locate the .mdmp file (in this case SQLDump0006.mdmp) from the same location and use WinDBG to load the dump for analysis.

If you encounter this condition on a SQL Server Failover Cluster instance, then this will result in IsAlive check failures and the SQL service being re-started on the same node or could even result in a failover, if the SQL instance doesn’t respond to the stop request in time. Note the “External dump process returned no errors.” in the above output. This means that the dump generation routine didn’t encounter any errors. Before I move onto the interesting part of looking through the data, let me set expectation here that there is very little information that you can coax a mini-dump for a deadlocked schedulers issue to provide you with while using public symbols. Even with the help of private symbols which CSS engineers have access to, it might not be possible to perform a complete root cause analysis of the issue. The main reason for this is that even if CSS engineers are able to find out the queries being executed on the SQL instance from the mini-dump, the blocking chain origin (in this case because it is clear from the error message excerpt in the Errorlog) cannot be determined from the mini-dump. This invariably ends up requiring collection of SQL Server Perf Stats script output (for capturing blocking information) and profiler traces to get the queries being executed by the head blocker which caused the blocking condition.

Now when you load the dump using WinDBG, as noted in the debugging a non-yielding scheduler issue, you will note similar things:

When you load the dump using WinDBG, you will see the following information:

Loading Dump File [g:\mssql\data\MSSQL\LOG\SQLDump0006.mdmp]
User Mini Dump File: Only registers, stack and portions of memory are available

Comment: ‘Stack Trace’
Comment: ‘Deadlocked Schedulers’

.sympath srv*D:\PublicSymbols*http://msdl.microsoft.com/download/symbols

.reload /f sqlservr.exe

Next thing is to verify that the symbols were correctly loaded using the lmvm sqlservr command. If the symbols were loaded correctly, you should see the following output. Note the text in green.

0:019> lmvm sqlservr
start end module name

01000000 02c0a000 sqlservr T (pdb symbols) d:\publicsymbols\sqlservr.pdb\389EF554D94A4947846D85FCDC4233382\sqlservr.pdb

If symbols were not loaded, then you will see an output as shown below.

0:005> lmvm sqlservr
start end module name
01000000 02c0a000 sqlservr (export symbols) sqlservr.exe

Since I know that there is a blocking issue due to which a deadlocked schedulers condition was reported, I shall use the !findstack debugger command to find out all the callstacks which are potentially waiting for a lock:

!findstack sqlservr!LockOwner::Sleep 2

Output for one of the threads from the above command:

Thread 271, 1 frame(s) match
          00 6db5eef8 7c821fc4 ntdll!KiFastSystemCallRet
          01 6db5eefc 77e43c1d ntdll!NtSignalAndWaitForSingleObject
          02 6db5ef70 010034f2 kernel32!SignalObjectAndWait
          03 6db5ef94 01002172 sqlservr!SOS_Scheduler::Switch
          04 6db5f108 01002cb9 sqlservr!SOS_Scheduler::SwitchContext
          05 6db5f120 01001c75 sqlservr!SOS_Scheduler::SuspendNonPreemptive
          06 6db5f138 01002f25 sqlservr!SOS_Scheduler::Suspend
          07 6db5f15c 01056f6a sqlservr!SOS_Event::Wait
       * 08 6db5f198 01057081 sqlservr!LockOwner::Sleep
        09 6db5f238 0100a75d sqlservr!lck_lockInternal
          10 6db5f2b4 0100d506 sqlservr!GetLock
          11 6db5f324 010432c7 sqlservr!BTreeRow::AcquireLock
          12 6db5f3a4 0100dc00 sqlservr!IndexDataSetSession::GetRowByKeyValue
      ….
      ….
          33 6db5feb0 010b9064 sqlservr!SystemThread::RunWorker
          34 6db5ff10 010b9201 sqlservr!SystemThreadDispatcher::ProcessWorker
          35 6db5ff78 781329bb sqlservr!SchedulerManager::ThreadEntryPoint
          36 6db5ffb0 78132a47 msvcr80!_callthreadstartex
          37 6db5ffb8 77e6608b msvcr80!_threadstartex
          38 6db5ffec 00000000 kernel32!BaseThreadStart

As you can see from the above call-stack, the worker thread is waiting for the lock that it requested for to become available. I found that most of the call-stacks reported in the !findstack output were similar. I found that 183 threads (counted the number of match keyword occurrences in the !findstack output) reported LockOwner::Sleep in their call-stack. This confirms that I have a serious blocking issue either due to large number of concurrent requests asking for the same conflicting lock(s) which are held by other thread(s) or it could be a single long running transaction or uncommitted transaction which is causing all threads to wait for the lock to be released post the commit or completion of the query. In my experience of handling such issues, I have seen the latter cause to be most prevalent.

If you need to find out the total number of threads in the dump, then use the ~* command to get the number of threads.

0:278> ~*
   0 Id: 45c.43c Suspend: 1 Teb: 7ffdd000 Unfrozen
      Start: sqlservr!wmainCRTStartup (02432df8)
      Priority: 0 Priority class: 32 Affinity: ff
   1 Id: 45c.51c Suspend: 1 Teb: 7ffda000 Unfrozen
      Start: msvcr80!_threadstartex (781329e1)
      Priority: 0 Priority class: 32 Affinity: 22
…
…
…
.278 Id: 45c.2b4c Suspend: 1 Teb: 7fea3000 Unfrozen
      Start: msvcr80!_threadstartex (781329e1)
      Priority: 0 Priority class: 32 Affinity: ff

This means that 183 threads out of 278 threads (65% of total worker threads) are waiting for locks!!! In case you got a wait type reported in the 17888 message which is not a LOCK waittype, then what keyword do you search for using the !findstack command?? That would lead to an interesting scenario. You can then use another debugger command !uniqstack to list out the unique call-stacks in the dump.

0:278> !uniqstack

Partial output of the above command:
Processing 279 threads, please wait

.134 Id: 45c.30a8 Suspend: 1 Teb: 7ff36000 Unfrozen
Start: msvcr80!_threadstartex (781329e1)
Priority: 0 Priority class: 32 Affinity: 22
ChildEBP RetAddr
6db5eef8 7c821fc4 ntdll!KiFastSystemCallRet
6db5eefc 77e43c1d ntdll!NtSignalAndWaitForSingleObject
6db5ef70 010034f2 kernel32!SignalObjectAndWait
6db5ef94 01002172 sqlservr!SOS_Scheduler::Switch
6db5f108 01002cb9 sqlservr!SOS_Scheduler::SwitchContext
6db5f120 01001c75 sqlservr!SOS_Scheduler::SuspendNonPreemptive
6db5f138 01002f25 sqlservr!SOS_Scheduler::Suspend
6db5f15c 01056f6a sqlservr!SOS_Event::Wait
6db5f198 01057081 sqlservr!LockOwner::Sleep
6db5f238 0100a75d sqlservr!lck_lockInternal
6db5f2b4 0100d506 sqlservr!GetLock
6db5f324 010432c7 sqlservr!BTreeRow::AcquireLock
6db5f3a4 0100dc00 sqlservr!IndexDataSetSession::GetRowByKeyValue
….
….
6db5feb0 010b9064 sqlservr!SystemThread::RunWorker
6db5ff10 010b9201 sqlservr!SystemThreadDispatcher::ProcessWorker
6db5ff78 781329bb sqlservr!SchedulerManager::ThreadEntryPoint
6db5ffb0 78132a47 msvcr80!_callthreadstartex
6db5ffb8 77e6608b msvcr80!_threadstartex
6db5ffec 00000000 kernel32!BaseThreadStart

Total threads: 279
Duplicate callstacks: 219 (windbg thread #s follow):
3, 12, 13, 15, 16, 17, 19, 21, 22, … …. 276, 277, 278

With the thread IDs reported in the duplicate call-stacks output (italicized numbers), you will manually (with some soul searching) have to look through a few thread call stacks using the following commands (below) and check what is most common callstack pattern that you notice from the list of duplicate threads. Disclaimer: This is not an easy task unless you do engage in such kind of debugging on a regular basis! And definitely not for the faint hearted and easily irritated!

~[Thread ID]s
kL100

Example:

0:052> ~53s
eax=553985c0 ebx=7ff2ce38 ecx=00000000 edx=00000000 esi=00001bac edi=00000000
eip=7c82ed54 esp=693aeefc ebp=693aef70 iopl=0         nv up ei pl zr na pe nc
cs=001b ss=0023 ds=0023 es=0023 fs=003b gs=0000             efl=00000246
ntdll!KiFastSystemCallRet:
7c82ed54 c3              ret

0:053> kL100

ChildEBP RetAddr
6db5eef8 7c821fc4 ntdll!KiFastSystemCallRet
6db5eefc 77e43c1d ntdll!NtSignalAndWaitForSingleObject
6db5ef70 010034f2 kernel32!SignalObjectAndWait
6db5ef94 01002172 sqlservr!SOS_Scheduler::Switch
6db5f108 01002cb9 sqlservr!SOS_Scheduler::SwitchContext
6db5f120 01001c75 sqlservr!SOS_Scheduler::SuspendNonPreemptive
6db5f138 01002f25 sqlservr!SOS_Scheduler::Suspend
6db5f15c 01056f6a sqlservr!SOS_Event::Wait
6db5f198 01057081 sqlservr!LockOwner::Sleep
6db5f238 0100a75d sqlservr!lck_lockInternal
6db5f2b4 0100d506 sqlservr!GetLock
6db5f324 010432c7 sqlservr!BTreeRow::AcquireLock
6db5f3a4 0100dc00 sqlservr!IndexDataSetSession::GetRowByKeyValue
….
….
6db5feb0 010b9064 sqlservr!SystemThread::RunWorker
6db5ff10 010b9201 sqlservr!SystemThreadDispatcher::ProcessWorker
6db5ff78 781329bb sqlservr!SchedulerManager::ThreadEntryPoint
6db5ffb0 78132a47 msvcr80!_callthreadstartex
6db5ffb8 77e6608b msvcr80!_threadstartex
6db5ffec 00000000 kernel32!BaseThreadStart

Reference:
KB 315263 – How to read the small memory dump files that Windows creates for debugging
How To Diagnose and Correct Errors 17883, 17884, 17887, and 17888 whitepaper

del.icio.us Tags: Deadlocked Schedulers,SQL Server 2005,SQL Server 2008,Debugging,Stack Dump,17884,17888

Debugging a non-yielding scheduler issue

Posted on August 13, 2011 by Amit Banerjee

If you have read the How To Diagnose and Correct Errors 17883, 17884, 17887, and 17888 whitepaper, you would know what a 17883 error reported in SQL Server 2000 means. If you read the “SQL Server 2005 – SchedulerMonitor” section of the whitepaper, you will understand that the way non-yielding messages are reported for a SQL Server worker thread is much more refined than it’s predecessor. We report a non-yielding worker thread as a Non-yielding Scheduler message in the SQL Server Errorlog. Since the whitepaper is very comprehensive in explaining SQL Server non-yielding detection algorithms, I shall not reiterate the same. In this post, I shall show you how to track down the cause of a non-yielding scheduler using public symbols in conjunction with the SQL Server Errorlog and Windows Event Logs.

Below is an excerpt from an Errorlog which is reporting a non-yielding scheduler message.

2011-07-24 22:17:45.850 Server       * BEGIN STACK DUMP:
2011-07-24 22:17:45.850 Server       *   07/24/11 22:17:45 spid 2388
2011-07-24 22:17:45.850 Server       *
2011-07-24 22:17:45.850 Server       * Non-yielding Scheduler
2011-07-24 22:17:45.850 Server       Stack Signature for the dump is 0x00000000000003B6
2011-07-24 22:17:53.380 Server       External dump process return code 0x20000001.
External dump process returned no errors.

2011-07-24 22:17:53.400 Server Process 0:0:0 (0x990) Worker 0x0000000005A6C1A0 appears to be non-yielding on Scheduler 1. Thread creation time: 12955600342903. Approx Thread CPU Used: kernel 0 ms, user 0 ms. Process Utilization 0%. System Idle 99%. Interval: 70119 ms.

There are lines that I have colored differently in the above excerpt whose significance I shall detail later in this thread. The Windows Application Event Log will report this as an Informational message with the EventID = 17883. This is what I have in my application event log for the above reported non-yielding condition:

Process 0:0:0 (0x990) Worker 0x0000000005A6C1A0 appears to be non-yielding on Scheduler 1. Thread creation time: 12955600342903. Approx Thread CPU Used: kernel 0 ms, user 0 ms. Process Utilization 0%%. System Idle 99%%. Interval: 70119 ms.

The above message is the same message reported in the Errorlog excerpt above. Since the Errorlog doesn’t report the MDMP (dump file) number, you will have to track it down using the timestamp when the issue was reported or using the SQL Dumper log in case you are performing post-mortem analysis. The SQL Dumper log name is “SQLDUMPER_ERRORLOG.log” and it is located by default in the SQL Server LOG folder. You will find a message similar to the one shown below which will point you to the correct .mdmp file.

(2A30:2760) 07/24/11 22:17:52, ACTION, SQLDUMPER_UNKNOWN_APP.EXE, MiniDump completed: C:\Program Files\Microsoft SQL Server\MSSQL10_50.MSSQLSERVER\MSSQL\LOG\SQLDump0025.mdmp

Now I need to use Windows Debugging Tools to load the .mdmp file. Note that the non-yielding message in the SQL Server Errorlog reported “External dump process returned no errors.”. This means that the dump generation routine didn’t encounter any errors. The most common error that occurs while generating dumps (due to resource crunch – memory or CPU), is a timeout. This means that the dump generation timed out and it is a delayed dump. So the information contained in the dump file may or may not be useful to perform a complete post-mortem analysis. This is when Microsoft CSS would ask you to collect additional data for the next occurrence if the current data available is not sufficient for a complete post-mortem analysis.

So let’s move on to the good part now… i.e. analysis of the dump. Keep in mind that this is a mini-dump which means that the is limited amount of information stored in the dump and we are using public symbols to analyze the dump. Even with private symbols that Microsoft CSS Engineers have access to, there might not be a whole lot they can decipher from a mini-dump and might request additional data!

When you load the dump using WinDBG, you will see the following information:

Loading Dump File [C:\Program Files\Microsoft SQL Server\MSSQL10_50.MSSQLSERVER\MSSQL\Log\SQLDump0025.mdmp]
User Mini Dump File: Only registers, stack and portions of memory are available

Comment: ‘Stack Trace’
Comment: ‘Non-yielding Scheduler’

…
…

Debug session time: Sun Jul 24 22:17:52.000 2011 (UTC + 5:30)

The above tells you that this is a mini-dump for a non-yielding scheduler condition and the location from where you loaded the dump. Then I use the command to set my symbol path and direct the symbols downloaded from the Microsoft symbol server to a local symbol file cache on my machine. It also tells me the time when the dump was generated.

What is a minidump?
A memory dump which contains the memory for the call stack of all threads, the CPU registers and information about which modules are loaded.

.sympath srv*D:\PublicSymbols*http://msdl.microsoft.com/download/symbols

.reload /f sqlservr.exe

Next thing is to verify that the symbols were correctly loaded using the lmvm sqlservr command. If the symbols were loaded correctly, you should see the following output. Note the text in green.

0:019> lmvm sqlservr
start             end                 module name
00000000`00c60000 00000000`04820000   sqlservr   (pdb symbols)          D:\publicsymbols\sqlservr.pdb\6A3527657A3F4090AB2C96F7008A55C52\sqlservr.pdb
    Loaded symbol image file: sqlservr.exe
    Mapped memory image file: C:\Program Files\Microsoft SQL Server\MSSQL10_50.MSSQLSERVER\MSSQL\Binn\sqlservr.exe
    Image path: C:\Program Files\Microsoft SQL Server\MSSQL10_50.MSSQLSERVER\MSSQL\Binn\sqlservr.exe
    Product version: 10.50.1790.0

If symbols were not loaded, then you will see an output as shown below.

0:019> lmvm sqlservr
start             end                 module name
00000000`00c60000 00000000`04820000   sqlservr   (export symbols)       sqlservr.exe
    Loaded symbol image file: sqlservr.exe
    Mapped memory image file: C:\Program Files\Microsoft SQL Server\MSSQL10_50.MSSQLSERVER\MSSQL\Binn\sqlservr.exe
    Image path: C:\Program Files\Microsoft SQL Server\MSSQL10_50.MSSQLSERVER\MSSQL\Binn\sqlservr.exe

Next I shall use the Thread ID noted above (0x990) in the Errorlog excerpt to switch to the thread and get the callstack for the thread which was reported as non-yielding.

0:019> ~~[0x990]s
ntdll!NtWaitForSingleObject+0xa:
00000000`76e06bda c3 ret
0:019> kL100

ntdll!NtWaitForSingleObject
kernel32!WaitForSingleObjectEx
sqlservr!SOS_Scheduler::SwitchContext
sqlservr!SOS_Scheduler::SuspendNonPreemptive
sqlservr!SOS_Scheduler::Suspend
…
…
msvcr80!endthreadex
msvcr80!endthreadex
kernel32!BaseThreadInitThunk
ntdll!RtlUserThreadStart

Looking at the callstack above, I can say that this is not the cause of the non-yielding condition (99.99999% true in most cases)!! The reason I can say this is because I find SwitchContext on the thread stack which means that the thread has yielded. The sequence of events is as follows: SQL Server detected a non-yielding worker thread. By the time, SQL Dumper was initiated to capture a memory dump of the SQL Server process, the thread yielded and the processing moved on to the next work request in the queue.

However, if your dump didn’t have the top functions as the ones noted above but was something like the two functions mentioned below, then you definitely have the right callstack and the issue noted below is due to stuck-stalled I/O.

ntdll!NtWriteFileGather
kernel32!WriteFileGather

The non-yielding scheduler message from the dump I investigated which had the above callstack had the following stuck-stalled IO warnings in the SQL Server Errorlog just prior to the dump getting generated:

SQL Server has encountered 218 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [C:\Program Files\Microsoft SQL Server\MSSQL10_50.MSSQLSERVER\MSSQL\DATA\tempdb.mdf] in database [tempdb] (2).

So now I definitely know that the non-yielding condition was due to faulty disk sub-system and the SQL Server instance at this point is just a victim of the issue!

If you see the the following message in the SQL Server Errorlog along with the dump generation message, then it is definitely a delayed dump and it could have some valuable information missing that might have been pertinent to your analysis.

Timeout while waiting for external dump

Most of the common non-yielding conditions are documented in the whitepaper that I referred to above. If you are unable to get a callstack for the non-yielding issue and the symptoms or sequence of events leading up to the issue doesn’t match any KB Article, then it’s time to call CSS to get the memory dump analyzed with the help of the private symbols that the CSS engineers have access to.

Happy debugging!!

Reference:
KB 315263 – How to read the small memory dump files that Windows creates for debugging

del.icio.us Tags: Non-yielding Scheduler,SQL Server 2005,SQL Server 2008,Debugging,Stack Dump

TroubleshootingSQL

Explaining the bits and bytes of data in an AI world

Tag Archives: Stack Dump

Debugging that latch timeout

Debugging that Assert condition: Maybe Not

Debugging the deadlock but for the scheduler: Part 2 of 2

Debugging the deadlock but for the scheduler: Part 1 of 2

Debugging a non-yielding scheduler issue

TroubleshootingSQL

Explaining the bits and bytes of data in an AI world

Share this post:

Share this post:

Share this post:

Share this post:

Share this post: