Handling Deadlocked Schedulers is a piece of cake now


I had written walkthroughs (Part 1 | Part 2) on how to troubleshoot a Deadlocked Schedulers condition for SQL Server instances. Deadlocked Schedulers is a condition where all your SQL Server worker threads are exhausted and no new work requests are being picked up by the SQL Server instance.

Starting from SQL Server 2012, the System Health extended events session tracks deadlocked schedulers condition using the extended event (scheduler_monitor_deadlock_ring_buffer_recorded). The session tracks other useful events which makes it easy to trace back the series of events which led to the deadlocked schedulers condition!

I will be using the Extended Events UI in SQL Server 2012 management studio to show how the target file of the System Health session can be used to track deadlocked schedulers condition experienced by your SQL Server instance.

Continue reading

Advertisements

Fluffy in an Availability Group Failover Scenario


Over the past month or so, I have been dealing with a lot of questions around the troubleshooting failover scenarios for Availability Groups. So I decided that it is now time for me to pen down a post on the data to be collected and analysis options for digging into the root cause for an Availability Group. I did have time on my hands and decided to induce a Hollywood element into this post as well. The availability group name that I would be using in this post is named as Fluffy. Fluffy has two secondary Availability Replicas: one synchronous and the other one an asynchronous replica.

As you can see in the screenshot below, I had initiated a failover for my Availability Group and the AlwaysOn
Extended Events sessions shows a state change. The Extended Events session writes to a target file (.xel) which is present in the SQL Server LOG folder.

The Extended Event session runs by default when an Availability Group is configured on the SQL Server instance. The following extended events are captured by the Event Session:

  • sqlserver.alwayson_ddl_executed,
  • sqlserver.availability_group_lease_expired,
  • sqlserver.availability_replica_automatic_failover_validation,
  • sqlserver.availability_replica_manager_state_change,
  • sqlserver.availability_replica_state_change,
  • sqlserver.error_reported

Note that the Extended Events session will only track the state changes for the local replica. The Extended Events session is NOT a global store for all the state change events for all replicas!

The previous set of logs that you collect from the SQL Server failover cluster instances like the SQL Errorlog, Cluster log and Windows Event logs are still applicable for root cause analysis for failovers. However, now you have additional logs in the SQL Server LOG folder which can assist with a root cause analysis for failover issues. The screenshot below shows two new files that would be of interest when analyzing SQL Server failovers namely, the AlwaysOn_health_* and <server name>_<instance name>_SQLDIAG_* logs. The first set of files are the AlwaysOn Extended Events logs and the second set of logs are called the Failover Cluster Instance Diagnostics Log.

We already saw from the above screenshot what the AlwaysOn Extended Events health session can track. Now, let’s see what the Failover Cluster Instance Diagnostics Log collects. There will be multiple informational messages about the activities performed against the Availability Group. Additionally, there will be messages pertaining to the sp_server_diagnostics data (component_health_resultset) collection and the Availability Group state change (availability_group_state_change).

The T-SQL query below can help you fetch the state change information for your SQL Server instance. Again, this is specific to the instance from which you fetched the failover cluster instance diagnostics log:

select object_name,cast(event_data as xml) as xmldata
from sys.fn_xe_file_target_read_file('<file name/path>', null, null, null)
where object_name = 'availability_group_state_change'

A snippet of the XML data retrieved using the above query for the manual failover that I had done is shown below:

<data name=”target_state“>
<
value>2</value>
<text>Online</text>
</data>
<data name=”failure condition level“>
<value>3</value>
<text >SYSTEM_UNHEALTHY</text>
</data>


<data name=”availability_group_name”>
<value>FLUFFY</value>
</data>

</event>


In summary, the following sets of logs need to be collected from all the Availability Replicas:

  1. SQL Server Errorlog from the time of the failure
  2. Windows Application and System Event logs from the time of the failure
  3. All the Failover Cluster Instance Diagnostics log (upto a maximum of 10 rollover .xel files by default)
  4. All the AlwaysOn Extended Event session log files (upto a maximum of 4 rollover .xel files by default)
  5. System Health Session Extended Event session files (optional as the component health state information is present in #4)
  6. Windows Cluster log

There are some useful queries in the Books Online topic for the failover cluster instance diagnostics log to parsing through the collected data.

Happy troubleshooting!!

P.S. The above blog post was created using a lab environment provided by SQL Server Virtual Labs. This is an online environment which allows you to create virtual machines to practice various SQL Server scenarios. The lab that I used was “SQL Server 2012: AlwaysOn Availability Groups (SQL 142).

Debugging that latch timeout


My last post of debugging an assertion didn’t have any cool debugging tips since there is not much that you can do with an assertion dump unless you have access to private symbols and sometimes even access to the source code. In this post, I am going to not disappoint and show you some more cool things that the windows debugger can do for you with public symbols for a latch timeout issue.

When you encounter a latch timeout (buffer or non-buffer latch), the first occurrence of it’s type generates a mini-dump. If there are further occurrences of the same latch timeout, then that is reported as an error message in the SQL Errorlog.

Buffer latch timeouts are typically reported using Error: 844 and 845. The common reasons for such errors are documented in a KB Article. For a non-buffer latch timeout, you will get the an 847 error.

Error # Error message template (from sys.messages)
844 Time out occurred while waiting for buffer latch — type %d, bp %p, page %d:%d, stat %#x, database id: %d, allocation unit id: %I64d%ls, task 0x%p : %d, waittime %d, flags 0x%I64x, owning task 0x%p.  Continuing to wait.
845 Time-out occurred while waiting for buffer latch type %d for page %S_PGID, database ID %d.
846 A time-out occurred while waiting for buffer latch — type %d, bp %p, page %d:%d, stat %#x, database id: %d, allocation unit Id: %I64d%ls, task 0x%p : %d, waittime %d, flags 0x%I64x, owning task 0x%p. Not continuing to wait.
847

Timeout occurred while waiting for latch: class ‘%ls’, id %p, type %d, Task 0x%p : %d, waittime %d, flags 0x%I64x, owning task 0x%p. Continuing to wait.

This is what you will see in the SQL Errorlog when a latch timeout occurs.

spid148     Time out occurred while waiting for buffer latchtype 4, bp 0000000832FE1200, page 3:11234374, stat 0x7c20009, database id: 120, allocation unit id: 72057599731367936, task 0x0000000003C4F2E8 : 0, waittime 300, flags 0x1a, owning task 0x0000000003C129B8.  Continuing to wait.
spid148     **Dump thread – spid = 148, PSS = 0x000000044DC17BD0, EC = 0x000000044DC17BE0
spid148     ***Stack Dump being sent to D:\Microsoft SQL Server\MSSQL.1\MSSQL\LOG\SQLDump0001.txt

spid148     * Latch timeout
spid148     * Input Buffer 84 bytes –
spid148     *             DBCC CHECKDB WITH ALL_ERRORMSGS
External dump process returned no errors.

I have only pasted the relevant portion from the Errorlog for brevity. As I have outlined in my previous blog posts on similar topics, that there is a large opportunity for due diligence that can be done with the help of the Windows Event Logs and the SQL Server Errorlogs before you start spawning off windows debugger to analyze the memory dump on your system. The first few obvious things that you will notice is that SPID 148 encountered the issue while performing a CHECKDB on database ID 120. The timeout occurred while waiting for a buffer latch on a page (Page ID is available in the message above). I don’t see a “Timeout waiting for external dump process” message in the SQL Errorlog which means that I have a good chance of extracting useful information from the mini-dump that was generated by SQLDumper.

Latch timeouts are typically victims of either a system related issue (hardware or drivers or operating system or a previous error encountered by SQL Server). So the next obvious action item would be to look into the SQL Errorlogs and find out if there were any additional errors prior to the latch timeout issue. I see a number of OS Error 1450 reported by the same SPID 148 for the same file handle but different offsets.

spid148     The operating system returned error 1450(Insufficient system resources exist to complete the requested service.) to SQL Server during a write at offset 0x0000156bf36000 in file with handle 0x0000000000001358. This is usually a temporary condition and the SQL Server will keep retrying the operation. If the condition persists then immediate action must be taken to correct it.

Additionally, I see prior and post (within 5-15 minutes) the latch timeout issue, multiple other SPIDs reporting the same 1450 error message for different offsets but again on the same file.

spid137     The operating system returned error 1450(Insufficient system resources exist to complete the requested service.) to SQL Server during a write at offset 0x000007461f8000 in file with handle 0x0000000000001358. This is usually a temporary condition and the SQL Server will keep retrying the operation. If the condition persists then immediate action must be taken to correct it.

I also see the latch timeout message being reported after every 300 ms for the same page and the database.

spid148     Time out occurred while waiting for buffer latch — type 4, bp 0000000832FE1200, page 3:11234374, stat 0x7c20009, database id: 120, allocation unit id: 72057599731367936, task 0x0000000003C4F2E8 : 0, waittime 82800, flags 0x1a, owning task 0x0000000003C129B8.  Continuing to wait.

Notice the waittime above, it has increased from 300 to 82800!! So the next thing I do is look up issues related to CHECKDB and 1450 error messages on the web using Bing (Yes, I do use BING!!). These are relevant posts related to the above issue.

http://blogs.msdn.com/b/psssql/archive/2008/07/10/sql-server-reports-operating-system-error-1450-or-1452-or-665-retries.aspx
http://blogs.msdn.com/b/psssql/archive/2009/03/04/sparse-file-errors-1450-or-665-due-to-file-fragmentation-fixes-and-workarounds.aspx

As of now, it is quite clear that the issue is related to a possible sparse file issue related to file fragmentation. Now it is time for me to check if there are other threads in the dump waiting on SyncWritePreemptive calls.

Use the location provided in the Errorlog snippet reporting the Latch Timeout message to locate the mini-dump for the issue (in this case SQLDump0001.mdmp).

Now when you load the dump using WinDBG, you will see the following information:

Loading Dump File [D:\Microsoft SQL Server\MSSQL.1\MSSQL\LOG\SQLDump0001.mdmp]
User Mini Dump File: Only registers, stack and portions of memory are available

Comment: ‘Stack Trace’
Comment: ‘Latch timeout’

This dump file has an exception of interest stored in it.

The above tells you that this is a mini-dump for a Latch Timeout condition and the location from where you loaded the dump. Then I use the command to set my symbol path and direct the symbols downloaded from the Microsoft symbol server to a local symbol file cache on my machine.

.sympath srv*D:\PublicSymbols*http://msdl.microsoft.com/download/symbols

Then I issue a reload command to load the symbols for sqlservr.exe. This can also be done using CTRL+L and providing the complete string above (without .sympath), checking the Reload checkbox and clicking on OK. The only difference here is that the all the public symbols for all loaded modules in the dump will be downloaded from the Microsoft Symbol Server which are available.

.reload /f sqlservr.exe

Next thing is to verify that the symbols were correctly loaded using the lmvm sqlservr command. If the symbols were loaded correctly, you should see the following output. Note the text in green.

0:005> lmvm sqlservr

start             end                 module name
00000000`01000000 00000000`03668000   sqlservr T (pdb symbols)          d:\publicsymbols\sqlservr.pdb\2A3969D78EE24FD494837AF090F5EDBC2\sqlservr.pdb

 

If symbols were not loaded, then you will see an output as shown below.

0:005> lmvm sqlservr
start end module name
00000000`01000000 00000000`03668000 sqlservr (export symbols) sqlservr.exe

I will use the !findstack command to locate all threads which have the function call SyncWritePreemptive on their callstack.

0:070> !findstack sqlservr!FCB::SyncWritePreemptive 0

Thread 069, 1 frame(s) match
Thread 074, 1 frame(s) match
Thread 076, 1 frame(s) match
Thread 079, 1 frame(s) match
Thread 081, 1 frame(s) match
Thread 082, 1 frame(s) match
Thread 086, 1 frame(s) match
Thread 089, 1 frame(s) match
Thread 091, 1 frame(s) match
Thread 095, 1 frame(s) match
Thread 098, 1 frame(s) match
Thread 099, 1 frame(s) match
Thread 104, 1 frame(s) match
Thread 106, 1 frame(s) match
Thread 107, 1 frame(s) match
Thread 131, 1 frame(s) match
Thread 136, 1 frame(s) match

0:070> ~81s
ntdll!ZwWaitForSingleObject+0xa:
00000000`77ef0a2a c3              ret
0:081> kL100

ntdll!ZwDelayExecution
kernel32!SleepEx
sqlservr!FCB::SyncWritePreemptive
sqlservr!FCB::PullPageToReplica
sqlservr!alloca_probe
sqlservr!BUF::CopyOnWrite
sqlservr!PageRef::PrepareToDirty
sqlservr!RecoveryMgr::DoCOWPreWrites

You could get all the callstacks with the function that you are searching for using the command: !findstack sqlservr!FCB::SyncWritePreemptive 3

If you look at the thread that raised ended up raising the Latch Timeout warning was also performing a CHECKDB.

0:074> .ecxr

0:074> kL100

kernel32!RaiseException
sqlservr!CDmpDump::Dump
sqlservr!CImageHelper::DoMiniDump
sqlservr!stackTrace
sqlservr!LatchBase::DumpOnTimeoutIfNeeded
sqlservr!LatchBase::PrintWarning
sqlservr!alloca_probe
sqlservr!BUF::AcquireLatch


sqlservr!UtilDbccCreateReplica
sqlservr!UtilDbccCheckDatabase
sqlservr!DbccCheckDB
sqlservr!DbccCommand::Execute

If you cannot find the thread which raised the Latch Timeout warning, you could print out all the callstacks in the dump using ~*kL100 and the searching for the function call in blue above. It is quite clear from the callstack above that the thread was also involved in performing a CHECKDB operation as reported in the SQL Errorlog in the input buffer for the Latch Timeout dump.

If case you were not able to identify the issue right off the bat, you need to check the build that you are on and look for issues that were addressed related to Latch Timeouts for the SQL Server release  that you are using. The symptoms section would have sufficient amount of information for you to compare with your current symptoms and scenario and determine if the KB Article that you are looking at is applicable in your case.

Now is the time, when you need to have some context about the operations that were occurring on the server to actually determine what the actual issue is. Based on what I heard from the system administrators that there was a CHECKDB being executed on the database while the application was executing DML operations on the database. Additionally, the volume on which the disk resides on has fragmentation and the database in question is large (>750GB).

Based on the two MSDN blog posts that I mentioned above, it is quite clear that it is possible to run into sparse file limitations when there is high amount of fragmentation on the disks or if there are a large number of concurrent changes occurring on the database when a CHECKDB is executing on it. A number of Windows and SQL Server updates along with workarounds to run CHECKDB on such databases is mentioned in the second blog post mentioned above. On a separate note, this is not an issue with CHECKDB! It is limitation that you are hitting with sparse files on the OS layer. Remember CHECKDB, starting from SQL Server 2005, creates an internal snapshot (makes sparse file) to execute the consistency check. Paul Randal’s tweet made me add this line to call this out explicitly!

As always… if you are still stuck, contact Microsoft CSS with the mini-dump file, SQL Errorlog and the Windows Event Logs. It might be quite possible that CSS might ask you to collect additional data as most Latch Timeout issues are generally an after-effect of a previous issue. In this case, it was the OS Error 1450.

Well… That’s it for today! Hope this is helpful for the next time you encounter a Latch Timeout issue.

Additional References

Four stages of NTFS File Growth
KB 315263 – How to read the small memory dump files that Windows creates for debugging