Debugging the deadlock but for the scheduler: Part 1 of 2

Posted on August 15, 2011 by Amit Banerjee

I had very recently blogged about debugging a non-yielding scheduler condition for a SQL Server instance. In today’s post, I shall talk about debugging a “deadlocked schedulers” issue which is a.k.a the infamous 17884 to you old-school folks. If ALL schedulers are considered stuck and SQL Server is not under active memory pressure, then the 17884 or 17888 condition is reported starting from SQL Server 2005. In this post, I shall show you how to dig into the root cause for a 17888 message reported in the SQL Errorlog. There are subtle differences between a 17888 and 17884 condition detected and reported by the SchedulerMonitor. A 17884 generally tells you that no work is being processed by a particular NUMA Node (for non-NUMA machines, it denotes the entire instance). On the other hand, 17888 tells you that no new work request is being processed on a particular NUMA node by the SQL server instance due to worker threads waiting on particular resource(s). The whitepaper mentioned at the end of this post covers the nitty-gritty details of the detection algorithm.

Description of the error messages from sys.messages catalog

17784 – New queries assigned to process on Node %d have not been picked up by a worker thread in the last %d seconds. Blocking or long-running queries can contribute to this condition, and may degrade client response time. Use the "max worker threads" configuration option to increase number of allowable threads, or optimize current running queries. SQL Process Utilization: %d%%. System Idle: %d%%.

17888 – All schedulers on Node %d appear deadlocked due to a large number of worker threads waiting on %ls. Process Utilization %d%%.

When a 17888 is reported, you will find messages of the following nature in the SQL Errorlog:

2008-03-25 14:14:53.37 Server      ***Stack Dump being sent to g:\mssql\data\MSSQL\LOG\SQLDump0006.txt
2008-03-25 14:14:53.38 Server      * BEGIN STACK DUMP:
2008-03-25 14:14:53.38 Server      *   03/25/08 14:14:53 spid 0
2008-03-25 14:14:53.38 Server      *
2008-03-25 14:14:53.38 Server      * Deadlocked Schedulers
2008-03-25 14:14:53.38 Server      * Short Stack Dump
2008-03-25 14:14:53.41 Server      Stack Signature for the dump is 0x00000260
2008-03-25 14:14:56.44 Server      External dump process return code 0x20000001.
External dump process returned no errors.

2008-03-25 14:14:56.44 Server      All schedulers on Node 0 appear deadlocked due to a large number of worker threads waiting on LCK_M_S. Process Utilization 1%.

Again note the differently colored lines and text whose significance, if not self-explanatory at this time, I shall talk about them later in the post. The last message in the above excerpt very clearly tells you that you have a blocking issue on the SQL instance and most of the worker threads on Node 0 are waiting for a LCK_M_S which they cannot acquire due to a conflicting lock(s) held on the object(s) by another session. This time around the SQLDumpXXXX.txt filename is mentioned in the Errorlog unlike the non-yielding scheduler message. So what needs to be done is locate the .mdmp file (in this case SQLDump0006.mdmp) from the same location and use WinDBG to load the dump for analysis.

If you encounter this condition on a SQL Server Failover Cluster instance, then this will result in IsAlive check failures and the SQL service being re-started on the same node or could even result in a failover, if the SQL instance doesn’t respond to the stop request in time. Note the “External dump process returned no errors.” in the above output. This means that the dump generation routine didn’t encounter any errors. Before I move onto the interesting part of looking through the data, let me set expectation here that there is very little information that you can coax a mini-dump for a deadlocked schedulers issue to provide you with while using public symbols. Even with the help of private symbols which CSS engineers have access to, it might not be possible to perform a complete root cause analysis of the issue. The main reason for this is that even if CSS engineers are able to find out the queries being executed on the SQL instance from the mini-dump, the blocking chain origin (in this case because it is clear from the error message excerpt in the Errorlog) cannot be determined from the mini-dump. This invariably ends up requiring collection of SQL Server Perf Stats script output (for capturing blocking information) and profiler traces to get the queries being executed by the head blocker which caused the blocking condition.

Now when you load the dump using WinDBG, as noted in the debugging a non-yielding scheduler issue, you will note similar things:

When you load the dump using WinDBG, you will see the following information:

Loading Dump File [g:\mssql\data\MSSQL\LOG\SQLDump0006.mdmp]
User Mini Dump File: Only registers, stack and portions of memory are available

Comment: ‘Stack Trace’
Comment: ‘Deadlocked Schedulers’

The above tells you that this is a mini-dump for a deadlocked schedulers condition and the location from where you loaded the dump. Then I use the command to set my symbol path and direct the symbols downloaded from the Microsoft symbol server to a local symbol file cache on my machine.

.sympath srv*D:\PublicSymbols*http://msdl.microsoft.com/download/symbols

Then I issue a reload command to load the symbols for sqlservr.exe. This can also be done using CTRL+L and providing the complete string above (without .sympath), checking the Reload checkbox and clicking on OK. The only difference here is that the all the public symbols for all loaded modules in the dump will be downloaded from the Microsoft Symbol Server which are available.

.reload /f sqlservr.exe

Next thing is to verify that the symbols were correctly loaded using the lmvm sqlservr command. If the symbols were loaded correctly, you should see the following output. Note the text in green.

0:019> lmvm sqlservr
start end module name

01000000 02c0a000 sqlservr T (pdb symbols) d:\publicsymbols\sqlservr.pdb\389EF554D94A4947846D85FCDC4233382\sqlservr.pdb

If symbols were not loaded, then you will see an output as shown below.

0:005> lmvm sqlservr
start end module name
01000000 02c0a000 sqlservr (export symbols) sqlservr.exe

Since I know that there is a blocking issue due to which a deadlocked schedulers condition was reported, I shall use the !findstack debugger command to find out all the callstacks which are potentially waiting for a lock:

!findstack sqlservr!LockOwner::Sleep 2

Output for one of the threads from the above command:

Thread 271, 1 frame(s) match
          00 6db5eef8 7c821fc4 ntdll!KiFastSystemCallRet
          01 6db5eefc 77e43c1d ntdll!NtSignalAndWaitForSingleObject
          02 6db5ef70 010034f2 kernel32!SignalObjectAndWait
          03 6db5ef94 01002172 sqlservr!SOS_Scheduler::Switch
          04 6db5f108 01002cb9 sqlservr!SOS_Scheduler::SwitchContext
          05 6db5f120 01001c75 sqlservr!SOS_Scheduler::SuspendNonPreemptive
          06 6db5f138 01002f25 sqlservr!SOS_Scheduler::Suspend
          07 6db5f15c 01056f6a sqlservr!SOS_Event::Wait
       * 08 6db5f198 01057081 sqlservr!LockOwner::Sleep
        09 6db5f238 0100a75d sqlservr!lck_lockInternal
          10 6db5f2b4 0100d506 sqlservr!GetLock
          11 6db5f324 010432c7 sqlservr!BTreeRow::AcquireLock
          12 6db5f3a4 0100dc00 sqlservr!IndexDataSetSession::GetRowByKeyValue
      ….
      ….
          33 6db5feb0 010b9064 sqlservr!SystemThread::RunWorker
          34 6db5ff10 010b9201 sqlservr!SystemThreadDispatcher::ProcessWorker
          35 6db5ff78 781329bb sqlservr!SchedulerManager::ThreadEntryPoint
          36 6db5ffb0 78132a47 msvcr80!_callthreadstartex
          37 6db5ffb8 77e6608b msvcr80!_threadstartex
          38 6db5ffec 00000000 kernel32!BaseThreadStart

As you can see from the above call-stack, the worker thread is waiting for the lock that it requested for to become available. I found that most of the call-stacks reported in the !findstack output were similar. I found that 183 threads (counted the number of match keyword occurrences in the !findstack output) reported LockOwner::Sleep in their call-stack. This confirms that I have a serious blocking issue either due to large number of concurrent requests asking for the same conflicting lock(s) which are held by other thread(s) or it could be a single long running transaction or uncommitted transaction which is causing all threads to wait for the lock to be released post the commit or completion of the query. In my experience of handling such issues, I have seen the latter cause to be most prevalent.

If you need to find out the total number of threads in the dump, then use the ~* command to get the number of threads.

0:278> ~*
   0 Id: 45c.43c Suspend: 1 Teb: 7ffdd000 Unfrozen
      Start: sqlservr!wmainCRTStartup (02432df8)
      Priority: 0 Priority class: 32 Affinity: ff
   1 Id: 45c.51c Suspend: 1 Teb: 7ffda000 Unfrozen
      Start: msvcr80!_threadstartex (781329e1)
      Priority: 0 Priority class: 32 Affinity: 22
…
…
…
.278 Id: 45c.2b4c Suspend: 1 Teb: 7fea3000 Unfrozen
      Start: msvcr80!_threadstartex (781329e1)
      Priority: 0 Priority class: 32 Affinity: ff

This means that 183 threads out of 278 threads (65% of total worker threads) are waiting for locks!!! In case you got a wait type reported in the 17888 message which is not a LOCK waittype, then what keyword do you search for using the !findstack command?? That would lead to an interesting scenario. You can then use another debugger command !uniqstack to list out the unique call-stacks in the dump.

0:278> !uniqstack

Partial output of the above command:
Processing 279 threads, please wait

.134 Id: 45c.30a8 Suspend: 1 Teb: 7ff36000 Unfrozen
Start: msvcr80!_threadstartex (781329e1)
Priority: 0 Priority class: 32 Affinity: 22
ChildEBP RetAddr
6db5eef8 7c821fc4 ntdll!KiFastSystemCallRet
6db5eefc 77e43c1d ntdll!NtSignalAndWaitForSingleObject
6db5ef70 010034f2 kernel32!SignalObjectAndWait
6db5ef94 01002172 sqlservr!SOS_Scheduler::Switch
6db5f108 01002cb9 sqlservr!SOS_Scheduler::SwitchContext
6db5f120 01001c75 sqlservr!SOS_Scheduler::SuspendNonPreemptive
6db5f138 01002f25 sqlservr!SOS_Scheduler::Suspend
6db5f15c 01056f6a sqlservr!SOS_Event::Wait
6db5f198 01057081 sqlservr!LockOwner::Sleep
6db5f238 0100a75d sqlservr!lck_lockInternal
6db5f2b4 0100d506 sqlservr!GetLock
6db5f324 010432c7 sqlservr!BTreeRow::AcquireLock
6db5f3a4 0100dc00 sqlservr!IndexDataSetSession::GetRowByKeyValue
….
….
6db5feb0 010b9064 sqlservr!SystemThread::RunWorker
6db5ff10 010b9201 sqlservr!SystemThreadDispatcher::ProcessWorker
6db5ff78 781329bb sqlservr!SchedulerManager::ThreadEntryPoint
6db5ffb0 78132a47 msvcr80!_callthreadstartex
6db5ffb8 77e6608b msvcr80!_threadstartex
6db5ffec 00000000 kernel32!BaseThreadStart

Total threads: 279
Duplicate callstacks: 219 (windbg thread #s follow):
3, 12, 13, 15, 16, 17, 19, 21, 22, … …. 276, 277, 278

With the thread IDs reported in the duplicate call-stacks output (italicized numbers), you will manually (with some soul searching) have to look through a few thread call stacks using the following commands (below) and check what is most common callstack pattern that you notice from the list of duplicate threads. Disclaimer: This is not an easy task unless you do engage in such kind of debugging on a regular basis! And definitely not for the faint hearted and easily irritated!

~[Thread ID]s
kL100

Example:

0:052> ~53s
eax=553985c0 ebx=7ff2ce38 ecx=00000000 edx=00000000 esi=00001bac edi=00000000
eip=7c82ed54 esp=693aeefc ebp=693aef70 iopl=0         nv up ei pl zr na pe nc
cs=001b ss=0023 ds=0023 es=0023 fs=003b gs=0000             efl=00000246
ntdll!KiFastSystemCallRet:
7c82ed54 c3              ret

0:053> kL100

ChildEBP RetAddr
6db5eef8 7c821fc4 ntdll!KiFastSystemCallRet
6db5eefc 77e43c1d ntdll!NtSignalAndWaitForSingleObject
6db5ef70 010034f2 kernel32!SignalObjectAndWait
6db5ef94 01002172 sqlservr!SOS_Scheduler::Switch
6db5f108 01002cb9 sqlservr!SOS_Scheduler::SwitchContext
6db5f120 01001c75 sqlservr!SOS_Scheduler::SuspendNonPreemptive
6db5f138 01002f25 sqlservr!SOS_Scheduler::Suspend
6db5f15c 01056f6a sqlservr!SOS_Event::Wait
6db5f198 01057081 sqlservr!LockOwner::Sleep
6db5f238 0100a75d sqlservr!lck_lockInternal
6db5f2b4 0100d506 sqlservr!GetLock
6db5f324 010432c7 sqlservr!BTreeRow::AcquireLock
6db5f3a4 0100dc00 sqlservr!IndexDataSetSession::GetRowByKeyValue
….
….
6db5feb0 010b9064 sqlservr!SystemThread::RunWorker
6db5ff10 010b9201 sqlservr!SystemThreadDispatcher::ProcessWorker
6db5ff78 781329bb sqlservr!SchedulerManager::ThreadEntryPoint
6db5ffb0 78132a47 msvcr80!_callthreadstartex
6db5ffb8 77e6608b msvcr80!_threadstartex
6db5ffec 00000000 kernel32!BaseThreadStart

TIP: The interesting thing to note is the thread IDs that you see above will directly match the OS Thread ID reported in the sys.dm_os_threads os_thread_id column value or the sys.sysprocesses kpid column value. So if you have a monitoring tool that is capturing the output of queries being executed by the active SQL Server sessions, then you can go back to your monitoring repository and dig up the queries which these threads were executing. If such data is not available from the time of the issue, more often that not you will have to end up collecting additional data for the next problem occurrence!

Also check if the Deadlocked Schedulers matches any KB Article (update) already released. Remember that deadlocked schedulers is a symptom and the not the cause!! So make sure that if you find a KB Article which talks about a Deadlocked Schedulers issue being addressed in a later build, the symptoms match the symptoms that you are experiencing on your server! If in doubt, ask CSS!

If you are stuck and have no clue why the issue happened, then this would be a good time to engage Microsoft CSS Support and provide them with the SQL Errorlogs, the dump file and the windows event logs in .txt format for further analysis.

Reference:
KB 315263 – How to read the small memory dump files that Windows creates for debugging
How To Diagnose and Correct Errors 17883, 17884, 17887, and 17888 whitepaper

del.icio.us Tags: Deadlocked Schedulers,SQL Server 2005,SQL Server 2008,Debugging,Stack Dump,17884,17888

Debugging a non-yielding scheduler issue

Posted on August 13, 2011 by Amit Banerjee

If you have read the How To Diagnose and Correct Errors 17883, 17884, 17887, and 17888 whitepaper, you would know what a 17883 error reported in SQL Server 2000 means. If you read the “SQL Server 2005 – SchedulerMonitor” section of the whitepaper, you will understand that the way non-yielding messages are reported for a SQL Server worker thread is much more refined than it’s predecessor. We report a non-yielding worker thread as a Non-yielding Scheduler message in the SQL Server Errorlog. Since the whitepaper is very comprehensive in explaining SQL Server non-yielding detection algorithms, I shall not reiterate the same. In this post, I shall show you how to track down the cause of a non-yielding scheduler using public symbols in conjunction with the SQL Server Errorlog and Windows Event Logs.

Below is an excerpt from an Errorlog which is reporting a non-yielding scheduler message.

2011-07-24 22:17:45.850 Server       * BEGIN STACK DUMP:
2011-07-24 22:17:45.850 Server       *   07/24/11 22:17:45 spid 2388
2011-07-24 22:17:45.850 Server       *
2011-07-24 22:17:45.850 Server       * Non-yielding Scheduler
2011-07-24 22:17:45.850 Server       Stack Signature for the dump is 0x00000000000003B6
2011-07-24 22:17:53.380 Server       External dump process return code 0x20000001.
External dump process returned no errors.

2011-07-24 22:17:53.400 Server Process 0:0:0 (0x990) Worker 0x0000000005A6C1A0 appears to be non-yielding on Scheduler 1. Thread creation time: 12955600342903. Approx Thread CPU Used: kernel 0 ms, user 0 ms. Process Utilization 0%. System Idle 99%. Interval: 70119 ms.

There are lines that I have colored differently in the above excerpt whose significance I shall detail later in this thread. The Windows Application Event Log will report this as an Informational message with the EventID = 17883. This is what I have in my application event log for the above reported non-yielding condition:

Process 0:0:0 (0x990) Worker 0x0000000005A6C1A0 appears to be non-yielding on Scheduler 1. Thread creation time: 12955600342903. Approx Thread CPU Used: kernel 0 ms, user 0 ms. Process Utilization 0%%. System Idle 99%%. Interval: 70119 ms.

The above message is the same message reported in the Errorlog excerpt above. Since the Errorlog doesn’t report the MDMP (dump file) number, you will have to track it down using the timestamp when the issue was reported or using the SQL Dumper log in case you are performing post-mortem analysis. The SQL Dumper log name is “SQLDUMPER_ERRORLOG.log” and it is located by default in the SQL Server LOG folder. You will find a message similar to the one shown below which will point you to the correct .mdmp file.

(2A30:2760) 07/24/11 22:17:52, ACTION, SQLDUMPER_UNKNOWN_APP.EXE, MiniDump completed: C:\Program Files\Microsoft SQL Server\MSSQL10_50.MSSQLSERVER\MSSQL\LOG\SQLDump0025.mdmp

Now I need to use Windows Debugging Tools to load the .mdmp file. Note that the non-yielding message in the SQL Server Errorlog reported “External dump process returned no errors.”. This means that the dump generation routine didn’t encounter any errors. The most common error that occurs while generating dumps (due to resource crunch – memory or CPU), is a timeout. This means that the dump generation timed out and it is a delayed dump. So the information contained in the dump file may or may not be useful to perform a complete post-mortem analysis. This is when Microsoft CSS would ask you to collect additional data for the next occurrence if the current data available is not sufficient for a complete post-mortem analysis.

So let’s move on to the good part now… i.e. analysis of the dump. Keep in mind that this is a mini-dump which means that the is limited amount of information stored in the dump and we are using public symbols to analyze the dump. Even with private symbols that Microsoft CSS Engineers have access to, there might not be a whole lot they can decipher from a mini-dump and might request additional data!

When you load the dump using WinDBG, you will see the following information:

Loading Dump File [C:\Program Files\Microsoft SQL Server\MSSQL10_50.MSSQLSERVER\MSSQL\Log\SQLDump0025.mdmp]
User Mini Dump File: Only registers, stack and portions of memory are available

Comment: ‘Stack Trace’
Comment: ‘Non-yielding Scheduler’

…
…

Debug session time: Sun Jul 24 22:17:52.000 2011 (UTC + 5:30)

The above tells you that this is a mini-dump for a non-yielding scheduler condition and the location from where you loaded the dump. Then I use the command to set my symbol path and direct the symbols downloaded from the Microsoft symbol server to a local symbol file cache on my machine. It also tells me the time when the dump was generated.

What is a minidump?
A memory dump which contains the memory for the call stack of all threads, the CPU registers and information about which modules are loaded.

.sympath srv*D:\PublicSymbols*http://msdl.microsoft.com/download/symbols

.reload /f sqlservr.exe

Next thing is to verify that the symbols were correctly loaded using the lmvm sqlservr command. If the symbols were loaded correctly, you should see the following output. Note the text in green.

0:019> lmvm sqlservr
start             end                 module name
00000000`00c60000 00000000`04820000   sqlservr   (pdb symbols)          D:\publicsymbols\sqlservr.pdb\6A3527657A3F4090AB2C96F7008A55C52\sqlservr.pdb
    Loaded symbol image file: sqlservr.exe
    Mapped memory image file: C:\Program Files\Microsoft SQL Server\MSSQL10_50.MSSQLSERVER\MSSQL\Binn\sqlservr.exe
    Image path: C:\Program Files\Microsoft SQL Server\MSSQL10_50.MSSQLSERVER\MSSQL\Binn\sqlservr.exe
    Product version: 10.50.1790.0

If symbols were not loaded, then you will see an output as shown below.

0:019> lmvm sqlservr
start             end                 module name
00000000`00c60000 00000000`04820000   sqlservr   (export symbols)       sqlservr.exe
    Loaded symbol image file: sqlservr.exe
    Mapped memory image file: C:\Program Files\Microsoft SQL Server\MSSQL10_50.MSSQLSERVER\MSSQL\Binn\sqlservr.exe
    Image path: C:\Program Files\Microsoft SQL Server\MSSQL10_50.MSSQLSERVER\MSSQL\Binn\sqlservr.exe

Next I shall use the Thread ID noted above (0x990) in the Errorlog excerpt to switch to the thread and get the callstack for the thread which was reported as non-yielding.

0:019> ~~[0x990]s
ntdll!NtWaitForSingleObject+0xa:
00000000`76e06bda c3 ret
0:019> kL100

ntdll!NtWaitForSingleObject
kernel32!WaitForSingleObjectEx
sqlservr!SOS_Scheduler::SwitchContext
sqlservr!SOS_Scheduler::SuspendNonPreemptive
sqlservr!SOS_Scheduler::Suspend
…
…
msvcr80!endthreadex
msvcr80!endthreadex
kernel32!BaseThreadInitThunk
ntdll!RtlUserThreadStart

Looking at the callstack above, I can say that this is not the cause of the non-yielding condition (99.99999% true in most cases)!! The reason I can say this is because I find SwitchContext on the thread stack which means that the thread has yielded. The sequence of events is as follows: SQL Server detected a non-yielding worker thread. By the time, SQL Dumper was initiated to capture a memory dump of the SQL Server process, the thread yielded and the processing moved on to the next work request in the queue.

However, if your dump didn’t have the top functions as the ones noted above but was something like the two functions mentioned below, then you definitely have the right callstack and the issue noted below is due to stuck-stalled I/O.

ntdll!NtWriteFileGather
kernel32!WriteFileGather

The non-yielding scheduler message from the dump I investigated which had the above callstack had the following stuck-stalled IO warnings in the SQL Server Errorlog just prior to the dump getting generated:

SQL Server has encountered 218 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [C:\Program Files\Microsoft SQL Server\MSSQL10_50.MSSQLSERVER\MSSQL\DATA\tempdb.mdf] in database [tempdb] (2).

So now I definitely know that the non-yielding condition was due to faulty disk sub-system and the SQL Server instance at this point is just a victim of the issue!

If you see the the following message in the SQL Server Errorlog along with the dump generation message, then it is definitely a delayed dump and it could have some valuable information missing that might have been pertinent to your analysis.

Timeout while waiting for external dump

Most of the common non-yielding conditions are documented in the whitepaper that I referred to above. If you are unable to get a callstack for the non-yielding issue and the symptoms or sequence of events leading up to the issue doesn’t match any KB Article, then it’s time to call CSS to get the memory dump analyzed with the help of the private symbols that the CSS engineers have access to.

Happy debugging!!

Reference:
KB 315263 – How to read the small memory dump files that Windows creates for debugging

del.icio.us Tags: Non-yielding Scheduler,SQL Server 2005,SQL Server 2008,Debugging,Stack Dump

Wiki: Demystifying SQL Server Exception Dumps

Posted on March 9, 2011 by Amit Banerjee

Recently, I had replied to a MSDN post regarding an Access Violation reported by a SQL Server instance. In this post, I shall elucidate on troubleshooting memory dumps generated by SQL Server exceptions and what can be done without the availability of private symbols for debugging.

I had written a post earlier on when a memory dump is generated in SQL Server. If a single Errorlog has multiple stack dumps from multiple exceptions, then the first thing to do is to investigate the first access violation reported in the series of dumps reported. It is also possible that an exception (eg:access violation) could be followed by another exception/assertion or a non-yielding condition which may be a side effect of the earlier access violation.

When SQL encounters an access violation, there will be SQLDumpXXXX.mdmp, SQLDumpXXXX.txt and SQLDumpXXXX.log files created in the LOG folder associated with the exception. The SQL Errorlog will report the occurrence of the exception as well.

The .mdmp file is the memory dump which would be required by CSS to troubleshoot the issue.
The .txt file is the symptom dump file which contains environment information along with other information depending on the nature of the exception.
The .log file contains an extract from the SQL Errorlog when the exception was encountered. This helps in looking at what messages were reported in the Errorlog prior to the issue. Helps during post-mortem analysis if the current Errorlogs have already been recycled.

There can be two types of exceptions: one that is fatal enough to cause a SQL service termination or one that encounters the exception which terminates the thread which was encountered the issue.

Now what can be done in such a situation without having to look into memory dump file. The first step would be to check if the issue is a recurring behavior due to a particular operation to establish some pattern or if there is not identifiable pattern at all. If you look into the SQL Errorlog, you might be able to identify the input buffer which cause the exception (unless it’s an exception which terminated the SQL service, one such example is shown below).

Server * BEGIN STACK DUMP:
Server * spid 5728
Server * ex_handle_except encountered exception C0000005 – Server terminating

If the above snippet contains the T-SQL query which raised the exception, then you would want to execute the same query from a Management Studio query window and check if the exception re-occurs. If yes, then you have a valid repro scenario.

Example snippet from an Errorlog where the Access Violation didn’t cause a service termination:

*   Exception Address = 0042B866
*   Exception Code    = c0000005 EXCEPTION_ACCESS_VIOLATION
*   Access Violation occurred writing address 67192000
* Input Buffer 38 bytes –
* exec sp_updatestats

Next check if you can spot any external components loaded into SQL Server address space. I shall revert back to my MSDN post where I had spotted an anti-virus filter DLL loaded in SQL Server process address space.

Server * bcrypt 000007FEFC900000 000007FEFC921FFF 00022000
Server * PCTLsp64 000000000B1E0000 000000000B251FFF 00072000
Server * mswsock 000007FEFC840000 000007FEFC893FFF 00054000

Based on a few Bing searches, I was able to determine that the PCTLsp64 was part of a virus monitoring program from PC Tools. The next step would be to check if this external component can be moved out of SQL Server process address space. The fact that the DLL is loaded in SQL process address space doesn’t make the DLL a culprit. However, when troubleshooting such issues, it is important to eliminate all unknown or third party elements that could prevent us from narrowing down to the culprit. Other DLLs which can be commonly found in SQL Server process address space would be provider DLLs when linked servers are being used on the SQL Server instance. Additionally execution of Extended Stored Procedures (XSPs) could also load external DLLs into SQL Server process address space depending on the way the XSP DLL was coded. An easy way to identifying a XSP dll is searching for the word “using” in the SQL Server Errorlog. SQL Server reports the first instance of a XSP dll being loaded into SQL Server process address space.

Using ‘xpstar.dll’ version ‘2009.100.1600’ to execute extended stored procedure ‘xp_readerrorlog’

If you do find an external component loaded in SQL process address space, check the callstack reported in the Errorlog below the list of modules printed below to determine if the functions of the exception stack are associated with external component that you identified. For the issue reported in the MSDN post, I did find this:

Module(sqlservr+00000000017A954C)
Module(kernel32+0000000000099380)
Module(ntdll+0000000000096228)
Module(ntdll+0000000000014F48)
Module(ntdll+0000000000034F6D)
Module(ntdll+0000000000015B2C)
Module(ntdll+000000000004F638)
Module(ntdll+000000000004C8F4)
Module(ntdll+000000000004C822)
Module(PCTLsp64+00000000000040AD)
Module(PCTLsp64+0000000000004701)
Module(PCTLsp64+0000000000005434)

Now this makes my story a bit more compelling of removing the external component out of SQL Server process address space. Once that is done and if you have a repro condition which causes the issue, then it would be good to re-run your repro and check if the issue is actually resolved. Not all the exceptions are the same even though they are reported in the same manner.

If all the above hasn’t helped in narrowing down the issue, then check if you are on the latest available cumulative update for the version of SQL Server that you are using. If not, then it might be a good idea to update your instance to the latest available build to ensure that you are not hitting a known issue. If this is possible, then apply the latest cumulative update package and check if the issue re-occurs. I am cognizant of the fact that this might be an easy option due to change management restrictions on production environments. For such a dependency, my recommendation would be engage CSS to assist with a thorough investigation of the root cause of the issue.

As you can see from the steps outline above, you can look into a few things before you need to start poking into the exception memory dump using a Debugging tool.

Happy debugging!

Technorati Tags: Troubleshooting,SQL Server,Debugging

Troubleshooting that darn backup error

Posted on February 9, 2011 by Amit Banerjee

Backup and restore operations even though they are supposed to be a no-brainer, it can leave you wanting to tear your hair out. Considering the different elements that influence the outcome of a successful backup operation like backup media, consistency of the backup, network issues (if backing up to a UNC path), consistency of the database being backed up etc.

I shall attempt to outline a set of troubleshooting steps specifically for non-native backup/restore related errors for SQL Server databases.

Any BACKUP command that completes successfully or with errors has associated messages logged in the SQL Server Errorlog.

A successful backup from the SQL Server Errorlog:

Database backed up. Database: adventureworks, creation date(time): 2010/05/28(03:00:18), pages dumped: 17291, first LSN: 67:7733:56, last LSN: 67:7761:1, number of dump devices: 1, device information: (FILE=1, TYPE=DISK: {‘F:\adventureworks.bak’}).

You will see that the backup completion message has the database name, the number of pages that were contained in the backup, the date the backup was created and the number of devices and their type (in this case files). So you can reconstruct the sequence of backups even if your backup information history is missing from the MSDB system tables and you just have your SQL Server Errorlogs to play with. If you want to get real fancy about this, then you could spin up a nifty little Powershell or VBScript to parse through the SQL Errorlogs and provide the backup sequence to you as well.

Now, let’s look at a failed backup message from the SQL Errorlog.

Error: 18204, Severity: 16, State: 1.
BackupDiskFile::CreateMedia: Backup device ‘R:\adventureworks.bak’ failed to create. Operating system error 3(The system cannot find the path specified.).
Error: 3041, Severity: 16, State: 1.
BACKUP failed to complete the command BACKUP DATABASE adventureworks.

You will notice above that the Operating System error code associated with the backup failure is reported in the error message. More often than not, the error message is self explanatory. If the error code doesn’t throw back an error text along with it, you can get the windows error code associated with the error code using net helpmsg <error number> from a command prompt window.

In the above case, there wasn’t any R: drive on my server. Now that I have finished stating the most obvious of troubleshooting methodologies for SQL native backups which was done to set the context for the next part of this post. The troubleshooting methodology for non-native backups.

Non-native SQL backups of databases use one of the following methods:

1. Use of APIs exposed through SQLVDI.dll to perform a VDI based backup
2. Use of VSS framework to perform a snapshot backup of the SQL Server database

Note: In this post, I shall not be addressing any storage level backup solutions that have options or features to handles SQL database backups.

This post is aimed at discussing the tackling of VSS/VDI related errors that you face while performing a database backup.

If a VDI backup fails, then you can try the following sequence of troubleshooting steps:

1. Attempt a backup to the same location using native SQL backup from a query window or SQLCMD. This might not always be possible as some of the backup tools do not allow any other backup application to connect to the backup share apart from the backup tool’s agents.
2. The next thing that you can do is to ensure that your SQLVDI.DLL is updated to the latest build. For this you can use the SQL Server Backup Simulator available on Code Gallery using the “Validate VDI Installation” option in the tool. This check in the tool will perform basic checks like current DLL version and additional checks based on root causes of common scenarios that CSS has seen in the past for VDI backup failures. Additional information on the usage of the tool is available here.
3. Perform a backup of the database using the tool for which the backup is failing (to the same destination if possible). The parameter tweaks at this point are not available in the current version like striped backups, changes to MAXTRANSFERSIZE/BUFFER COUNT etc. but if the backup from this tool is successful, then you know for a fact that the SQLVDI APIs are working as expected. Then it is either an additional piece of logic in the backup software that is failing or there is an external factors like network/backup media or there is a resource crunch on the SQL instance on which the database resides.

Common errors that you might see during a VDI backup operation failure:

Error message 1
BackupVirtualDeviceFile::ClearError: failure on backup device ‘VDI_ DeviceID ‘. Operating system error 995(The I/O operation has
been aborted because of either a thread exit or an application request.).

Error message 2
Error: 18210, Severity: 16, State: 1.
BackupMedium::ReportIoError: write failure on backup device ‘VDI_ DeviceID ‘. Operating system error 995(The I/O operation has been aborted because of either a thread exit or an application request.)

Error message 3
Error: 18210, Severity: 16, State: 1.
BackupVirtualDeviceFile::RequestDurableMedia: Flush failure on backup device ‘VDI_ DeviceID. Operating system error 995(The I/O operation has been aborted because of either a thread exit or an application request.)

The above errors are mostly due to outdated versions of SQLVDI.DLL on SQL Server 2000 or SQL Server 2005 instances. The “Validate VDI Installation” option in the Backup Simulator should be able to identify any known issues with DLL version mismatch for you.

If you have an application that uses Volume Shadow Service to backup your SQL database and that backup fails, then you your troubleshooting steps would be a bit different. The tell-tale sign of a VSS Snapshot backup is in the I/O Frozen message being written into the SQL Errorlogs for the databases being backed up. In such a failure scenario, take a backup of the database using Windows Server Backup Admin (Windows Server 2008 and above) or NTBackup utility to perform a backup of the database files involved. If this backup is successful, it means that a snapshot backup is successfully completing for the database. Now depending on your backup application’s implementation logic, this might be a resource crunch or a backup storage media incompatibility or even a resource crunch.

Common errors associated with VSS backups:

Issue# 1
Error: 3041, Severity: 16, State: 1.
Backup BACKUP failed to complete the command BACKUP DATABASE <database name>. Check the backup application log for detailed messages.
SubprocessMgr::EnqueueSubprocess: Limit on ‘Max worker threads’ reached.
This error message is raised when your backup application uses VSS to backup a large number of databases simultaneously. More information around this can be found in this post Volume Shadow barfs with 0x80040e14 code

Issue #2
While performing a Bare Metal backup you might run into issues if your SQL Server binaries are located on a different drive other than the C: drive. More information on the same is mentioned here: Location of SQL Binaries can flip out Bare Metal backups

Issue #3

Event Type:    Error
Event Source:    VSS
Event Category:    None
Event ID:    6005
Description:
Sqllib error: Database %1 is stored on multiple volumes, only some of which are being shadowed.

The above error occurs when you backup only one of the volumes that contains SQL database data/log files which are spread over multiple volumes. This is explained in detailed in this post: Why is VSS complaining about SQL Data files

More information on how VSS/VDI work is present in the blog post by Sudarshan: INFORMATIONAL- SHEDDING LIGHT on VSS & VDI Backups in SQL Server

Previous posts related to Volume Shadow Backups
IO Frozen messages while taking NT Backup for SQL databases

Previous posts related to VDI backups
SQL VDI backup fails with 0×80770007

Technorati Tags: Backups,data recovery,Restore,VDI,VSS

Wiki: SQL Database Engine Startup Failures for Clusters

Posted on December 5, 2010 by Amit Banerjee

I had recently written a Wiki article on my blog for a troubleshooting startup methodology for SQL Server startup failures for Standalone instances. In this article, I shall cover a similar roadmap for troubleshooting database engine startup failures for clustered SQL Server instances. Some of the steps are going to be the same as troubleshooting start-up failures for stand-along instance but there are some minute differences. So here goes…

The first thing that you need to do is to try and bring the SQL Server resource online from the Failover Cluster Administrator in case you are using a Windows Server 2003 cluster or below. If you are using a Windows Server 2008 or above, you would need to use the Failover Cluster Management Snap-in. If the SQL Server resource fails to come online, then check the SQL Server Errorlogs for any errors that were reported.

Locate the SQL Server Errorlog location and check if any errors were observed in the SQL Server Errorlog. This can be done by looking up the –e startup parameter value for the SQL Server instance in question. If you are using SQL Server 2005 and above, then right-click on the Service Name in Configuration Manager –> Properties –> Advanced Tab. Alternatively, you can use a VBScript (or Powershell) to fetch this information. Here is an example for obtaining the Startup Parameters using WMI and VBScript.

Verify that the files pointed to by the –d (master data file location), –e (Errorlog location) and –l (master log file location) actually exist and have the correct file paths. Starting from SQL Server 2005, the Errorlog will have the startup parameters printed in it when the SQL instance starts up. If you find no errors in the SQL Errorlogs, look into the Windows Application Event logs and ensure that there are no errors related to the cluster resource for the SQL instance or the cluster group (applicable for Windows 2003 clusters) that the SQL resource is a part of.

If the Windows Application Event logs, don’t provide any useful errors, then look at any warnings or errors pertaining to the SQL Server resource in the Windows Cluster logs.

Refer the following post to generate the Cluster log on a Windows Server 2008 Failover cluster or higher. For a Windows Server 2003 Cluster, the default location of the cluster logs is: <system driver>:\<windows folder>\Cluster\Cluster.log.

Now try to start the SQL Server instance using command prompt using the sqlservr.exe to check if the service comes online without bringing any user database online using -T3608.

For a default SQL Server instance:

sqlservr.exe –c –m –T3608

For a named instance of SQL Server:

sqlservr.exe –c –m –T3608 –s<instance name>

If the above fails, then the console window should provide you with an error message as to why the SQL instance fails to start. If the above succeeds, then try and start the service using the Configuration Manager.

Note: Make sure that you are not falling prey to a Desktop Heap exhaustion issue.

If you determine that the SQL Server registry entries have an erroneous entry, then you would need to follow the steps mentioned in the KB Article below to rectify the problem (due to Registry Synch operations that happen on a cluster):
http://support.microsoft.com/kb/953504

You might also find that the SQL Server resource entries in the Cluster Registry hive (HKLM\Cluster\Resources\GUID) are missing. You would then have to re-create the same using the steps mentioned in the KB Article below (applicable for SQL Server 2000 and SQL Server 2005 only):
http://support.microsoft.com/kb/810056

NOTE: Remember that modifying registry keys incorrectly on a Windows Server can have disastrous effects. Make sure that you take a registry backup before making any registry modifications.

Apart from the following common startup failure root causes, there are an additional few which are applicable to clustered SQL Server instances only (in BOLD below):

The most common root causes for service startup failures are:

Access denied on the folders for the SQL Server Service Account on the DATA folder which contains the SQL Server system databases or due to some other process (like anti-virus holding a lock on the SQL database files)
Insufficient permissions granted to the SQL Server Service account.
File not found errors due to either accidental deletion of files or disk failures.
System database files having inconsistencies prevent SQL Server startup.
Password of the service account was changed but not updated on the server that has the SQL instance installed.
Startup parameters have incorrect file path locations
The dependent resources for the SQL Server clustered instance are failing to come online. (Eg. Shared disks on which the database files reside on)
Incorrect (read:messed up) Registry entries in the CLUSTER registry hive for the concerned SQL instance which have been checkpointed already and due to which the instance fails to start.
Operating system/Resource crunch/Environment related issues on one or more nodes due to which the instance comes online on one node but not on the other.

Additional references:
T-SQL Tuesday: Common SQL Server myths – Series I–Setup and Services
SQL SERVER 2000 CLuster FAILS TO COME ONLINE ON OTHER NODE
SQL SERVER 2005 RESOURCE FAILS TO COME ONLINE ON CLUSTER AFTER CHANGING THE SAN DRIVE TO ADD MORE DISK SPACE
Sql Server 2005 Resource Taking long time to come online on Windows Cluster with Resource Manager Creation Failed reported in Errorlog
Sql Server Resource Fails to come online on Cluster with Error Error Locating Server/Instance Specified [xFFFFFFFF]
SQL Server 2008 service fails to start, after applying Service Pack 1

Versions that this roadmap applies to:
SQL Server 2000, SQL Server 2005, SQL Server 2008, SQL Server 2008 R2

Technorati Tags: SQL Startup Failures,Database Engine,Failover Cluster

TroubleshootingSQL

Explaining the bits and bytes of data in an AI world

Category Archives: Wikis

Debugging the deadlock but for the scheduler: Part 1 of 2

Debugging a non-yielding scheduler issue

Wiki: Demystifying SQL Server Exception Dumps

Troubleshooting that darn backup error

Wiki: SQL Database Engine Startup Failures for Clusters

TroubleshootingSQL

Explaining the bits and bytes of data in an AI world

Share this post:

Share this post:

Share this post:

Share this post:

Share this post: