SQL Server 2012: Debugging Deadlocked Schedulers

I had blogged about debugging deadlocked schedulers earlier for SQL Server 2008 R2 and below releases. Since there have been some fundamental changes in SQL Server 2012, I thought it would be a good idea to post about debugging the same scenario on SQL Server 2012 instances.

Older posts:
Debugging Deadlocked Schedulers Part 1
Debugging Deadlocked Schedulers Part 2

Continue reading

Handling Deadlocked Schedulers is a piece of cake now

I had written walkthroughs (Part 1 | Part 2) on how to troubleshoot a Deadlocked Schedulers condition for SQL Server instances. Deadlocked Schedulers is a condition where all your SQL Server worker threads are exhausted and no new work requests are being picked up by the SQL Server instance.

Starting from SQL Server 2012, the System Health extended events session tracks deadlocked schedulers condition using the extended event (scheduler_monitor_deadlock_ring_buffer_recorded). The session tracks other useful events which makes it easy to trace back the series of events which led to the deadlocked schedulers condition!

I will be using the Extended Events UI in SQL Server 2012 management studio to show how the target file of the System Health session can be used to track deadlocked schedulers condition experienced by your SQL Server instance.

Continue reading

Fluffy in an Availability Group Failover Scenario

Over the past month or so, I have been dealing with a lot of questions around the troubleshooting failover scenarios for Availability Groups. So I decided that it is now time for me to pen down a post on the data to be collected and analysis options for digging into the root cause for an Availability Group. I did have time on my hands and decided to induce a Hollywood element into this post as well. The availability group name that I would be using in this post is named as Fluffy. Fluffy has two secondary Availability Replicas: one synchronous and the other one an asynchronous replica.

As you can see in the screenshot below, I had initiated a failover for my Availability Group and the AlwaysOn
Extended Events sessions shows a state change. The Extended Events session writes to a target file (.xel) which is present in the SQL Server LOG folder.

The Extended Event session runs by default when an Availability Group is configured on the SQL Server instance. The following extended events are captured by the Event Session:

  • sqlserver.alwayson_ddl_executed,
  • sqlserver.availability_group_lease_expired,
  • sqlserver.availability_replica_automatic_failover_validation,
  • sqlserver.availability_replica_manager_state_change,
  • sqlserver.availability_replica_state_change,
  • sqlserver.error_reported

Note that the Extended Events session will only track the state changes for the local replica. The Extended Events session is NOT a global store for all the state change events for all replicas!

The previous set of logs that you collect from the SQL Server failover cluster instances like the SQL Errorlog, Cluster log and Windows Event logs are still applicable for root cause analysis for failovers. However, now you have additional logs in the SQL Server LOG folder which can assist with a root cause analysis for failover issues. The screenshot below shows two new files that would be of interest when analyzing SQL Server failovers namely, the AlwaysOn_health_* and <server name>_<instance name>_SQLDIAG_* logs. The first set of files are the AlwaysOn Extended Events logs and the second set of logs are called the Failover Cluster Instance Diagnostics Log.

We already saw from the above screenshot what the AlwaysOn Extended Events health session can track. Now, let’s see what the Failover Cluster Instance Diagnostics Log collects. There will be multiple informational messages about the activities performed against the Availability Group. Additionally, there will be messages pertaining to the sp_server_diagnostics data (component_health_resultset) collection and the Availability Group state change (availability_group_state_change).

The T-SQL query below can help you fetch the state change information for your SQL Server instance. Again, this is specific to the instance from which you fetched the failover cluster instance diagnostics log:

select object_name,cast(event_data as xml) as xmldata
from sys.fn_xe_file_target_read_file('<file name/path>', null, null, null)
where object_name = 'availability_group_state_change'

A snippet of the XML data retrieved using the above query for the manual failover that I had done is shown below:

<data name=”target_state“>
<
value>2</value>
<text>Online</text>
</data>
<data name=”failure condition level“>
<value>3</value>
<text >SYSTEM_UNHEALTHY</text>
</data>


<data name=”availability_group_name”>
<value>FLUFFY</value>
</data>

</event>


In summary, the following sets of logs need to be collected from all the Availability Replicas:

  1. SQL Server Errorlog from the time of the failure
  2. Windows Application and System Event logs from the time of the failure
  3. All the Failover Cluster Instance Diagnostics log (upto a maximum of 10 rollover .xel files by default)
  4. All the AlwaysOn Extended Event session log files (upto a maximum of 4 rollover .xel files by default)
  5. System Health Session Extended Event session files (optional as the component health state information is present in #4)
  6. Windows Cluster log

There are some useful queries in the Books Online topic for the failover cluster instance diagnostics log to parsing through the collected data.

Happy troubleshooting!!

P.S. The above blog post was created using a lab environment provided by SQL Server Virtual Labs. This is an online environment which allows you to create virtual machines to practice various SQL Server scenarios. The lab that I used was “SQL Server 2012: AlwaysOn Availability Groups (SQL 142).

Why did the restore fail on the log shipped secondary database

I was doing some random testing for a different log shipping related scenario, when I ran into a restore failure reported by the log shipping restore job. A snippet of the error message is shown below:

*** Error: Could not apply log backup file ‘<backup file name>’ to secondary database ‘<database name>’.(Microsoft.SqlServer.Management.LogShipping) ***
*** Error: The operation cannot be performed on a database with database snapshots or active DBCC replicas.

The above error is reported as Error #5094. The error message is shown in bold above. Now this error occurred because I had created a database snapshot on the log shopped secondary database which was operating in standby (warm standby) mode. In such an event, you can only perform a log shipping restore on the secondary server after the database snapshot is dropped.

This also applies to the active DBCC replicas as well. Which means a DBCC CHECK being executed on the standby database could also cause a log shipping restore to fail. This issue would never occur if the secondary database was in no recovery mode since a database snapshot cannot be created for this log shipping operating mode.

The Ring_Buffer_Exception reports the error as:

<Error>5094</Error>
<Severity>16</Severity>
<State>2</State>

The idea of putting this blog post together was to ensure that you were aware of this behavior before you created database snapshots on your standby log shipped secondary databases for reporting purposes.

Concurrent backups: What is the big deal?

You are probably wondering why am I writing a blog post on concurrent backups! Well this is a topic that brings up a few questions during my customer visits. Apart from that the biggest reason, I want to talk about concurrent backups is because of backup softwares that you have active in your environments.

A lot of the backup softwares available in the market have the option of scheduling backups of the entire system. This involves backing up of SQL Server databases using the interfaces exposed through sqlvdi.dll. Ain’t your software really smart? However, sometimes the right hand doesn’t really talk to the left hand! It is quite possible that your server admins have scheduled backups of your systems around the same time you have scheduled a maintenance plan to take a backup of the SQL Server instance.

If you initiate two differential or full database backups concurrently, then you will experience a blocking condition as shown in the table below. SPID 53 was taking a VDI full-backup using the SQL Server Backup Simulator tool where as SPID 55 was also taking a native full-backup of the same database. In such a scenario, you will find that the backup which was running first will complete and only then will the second backup will continue. The second concurrent backup will be blocked during the entire duration of the first backup.

spid blocked waittime lastwaittype waitresource cmd dbid
53 0 6302 ASYNC_IO_COMPLETION   BACKUP DATABASE 8
53 0 0 MISCELLANEOUS   BACKUP DATABASE 8
53 0 412 BACKUPBUFFER   BACKUP DATABASE 8
53 0 412 BACKUPIO   BACKUP DATABASE 8
55 53 2386 LCK_M_U DB: 8 [BULKOP_BACKUP_DB] BACKUP DATABASE 8

This issue doesn’t manifest itself when you are you taking a volume snapshot backup of the database. The wait resource will always be DB: <dbid> [BULKOP_BACKUP_DB] with a wait type of LCK_M_U (update lock) for the second session that is running the concurrent backup operation. Concurrent log backups are allowed though starting from SQL Server 2005.

One of the situations where this can pose to be problematic is when the application that installed the database creates a maintenance plan during the application setup process. This can cause problems if you have concurrent full or differential backups scheduled from another software during the same time when your backup maintenance plan is running. Another big pitfall of concurrent backups running is that sometimes, the same job is responsible for both full and log backups of the database. If your log backup job doesn’t kick-off because your database backup is still being blocked, then you could run into a bloated transaction log issue for a database in full or bulk-logged recovery model. My colleague, Arvind, has also blogged about the same issue a few years back.

The following T-SQL script can help you determine if you have overlapping backups being taken for a SQL Server 2012 function. This script can also work for releases lower than SQL Server 2012 if you remove the LAG function as that T-SQL enhancement was introduced in SQL Server 2012.

select backup_start_date, backup_finish_date,

LAG(backup_start_date, 1) OVER (PARTITION BY database_name ORDER BY backup_start_date) as previous_start_date,

CASE datediff(mi,(LAG(backup_start_date, 1) OVER (PARTITION BY database_name ORDER BY backup_start_date)),backup_start_date) WHEN 0 THEN 'OVERLAPPING' ELSE '-' END as [Status] ,

datediff(mi,backup_start_date,backup_finish_date) as backup_time_mins, type

from msdb..backupset

where database_name = '<database name>' -- Replace with appropriate database name

and type <> 'L'

order by backup_start_date