Failover | TroubleshootingSQL

Over the past month or so, I have been dealing with a lot of questions around the troubleshooting failover scenarios for Availability Groups. So I decided that it is now time for me to pen down a post on the data to be collected and analysis options for digging into the root cause for an Availability Group. I did have time on my hands and decided to induce a Hollywood element into this post as well. The availability group name that I would be using in this post is named as Fluffy. Fluffy has two secondary Availability Replicas: one synchronous and the other one an asynchronous replica.

As you can see in the screenshot below, I had initiated a failover for my Availability Group and the AlwaysOn
Extended Events sessions shows a state change. The Extended Events session writes to a target file (.xel) which is present in the SQL Server LOG folder.

The Extended Event session runs by default when an Availability Group is configured on the SQL Server instance. The following extended events are captured by the Event Session:

sqlserver.alwayson_ddl_executed,
sqlserver.availability_group_lease_expired,
sqlserver.availability_replica_automatic_failover_validation,
sqlserver.availability_replica_manager_state_change,
sqlserver.availability_replica_state_change,
sqlserver.error_reported

Note that the Extended Events session will only track the state changes for the local replica. The Extended Events session is NOT a global store for all the state change events for all replicas!

The previous set of logs that you collect from the SQL Server failover cluster instances like the SQL Errorlog, Cluster log and Windows Event logs are still applicable for root cause analysis for failovers. However, now you have additional logs in the SQL Server LOG folder which can assist with a root cause analysis for failover issues. The screenshot below shows two new files that would be of interest when analyzing SQL Server failovers namely, the AlwaysOn_health_* and <server name>_<instance name>_SQLDIAG_* logs. The first set of files are the AlwaysOn Extended Events logs and the second set of logs are called the Failover Cluster Instance Diagnostics Log.

We already saw from the above screenshot what the AlwaysOn Extended Events health session can track. Now, let’s see what the Failover Cluster Instance Diagnostics Log collects. There will be multiple informational messages about the activities performed against the Availability Group. Additionally, there will be messages pertaining to the sp_server_diagnostics data (component_health_resultset) collection and the Availability Group state change (availability_group_state_change).

The T-SQL query below can help you fetch the state change information for your SQL Server instance. Again, this is specific to the instance from which you fetched the failover cluster instance diagnostics log:

select object_name,cast(event_data as xml) as xmldata
from sys.fn_xe_file_target_read_file('<file name/path>', null, null, null)
where object_name = 'availability_group_state_change'

A snippet of the XML data retrieved using the above query for the manual failover that I had done is shown below:

<data name=”target_state“>
<value>2</value>
<text>Online</text>
</data>
<data name=”failure condition level“>
<value>3</value>
<text >SYSTEM_UNHEALTHY</text>
</data>
…
…
<data name=”availability_group_name”>
<value>FLUFFY</value>
</data>
…
</event>

In summary, the following sets of logs need to be collected from all the Availability Replicas:

SQL Server Errorlog from the time of the failure
Windows Application and System Event logs from the time of the failure
All the Failover Cluster Instance Diagnostics log (upto a maximum of 10 rollover .xel files by default)
All the AlwaysOn Extended Event session log files (upto a maximum of 4 rollover .xel files by default)
System Health Session Extended Event session files (optional as the component health state information is present in #4)
Windows Cluster log

There are some useful queries in the Books Online topic for the failover cluster instance diagnostics log to parsing through the collected data.

Happy troubleshooting!!

P.S. The above blog post was created using a lab environment provided by SQL Server Virtual Labs. This is an online environment which allows you to create virtual machines to practice various SQL Server scenarios. The lab that I used was “SQL Server 2012: AlwaysOn Availability Groups (SQL 142)“.

I have heard this question multiple times:

How do I generate a condition which will cause an Automatic Failover of a SQL Server Mirrored database?

First of all, you would need to setup mirroring under Full Safety mode with a Witness instance.

The only condition that triggers an automatic database failover would be if the Mirror and Witness instance are unable to connect to the Principal server instance.

Figure: Synchronized database mirroring session

As per the TechNet article, Automatic Failover would be under the following condition:

When safety is FULL, if neither the mirror nor the witness can see the principal, the mirror server can form a quorum with the witness and change its role to become a new principal server, assuming a the mirroring session was SYNCHRONIZED when the principal went away.

So, the condition above is self explanatory.

How can this be done?

1. If your Witness and Mirror are on different boxes or on the same box and the Principal instance is on a different box. Then:

Stop the Database Mirroring endpoint using the following command on the Principal instance: ALTER ENDPOINT <mirroring endpoint name> STATE = STOPPED. Note: This would put the database mirroring into a disconnected state. To bring database mirroring out of disconnected state, you need to start the endpoint using the following command: ALTER ENDPOINT <mirroring endpoint name> STATE = STARTED
Plug out the network cable for the Principal box provided the Principal and (Mirror & Witness) instances are on two different boxes.
Block all TCP traffic on the port that the server has established the Mirroring session.
Pause the Principal Instance service (Have not tried this one yet but theoretically it should work!)

2. Attach a Visual Studio Debugger or Windows Debugger to the Principal Server instance. Do not hit GO to continue the execution of the process being debugged.

THE ABOVE METHODS SHOULD ONLY BE TRIED ON A TEST ENVIRONMENT. THESE ARE NOT SOMETHING THAT SHOULD BE ATTEMTPED ON A PRODUCTION BOX.

This is just an informational article for simulating automatic database failover for testing any alert jobs or any other business logic that you need to kick off during an automatic failover.

What messages would be present in the SQL Server Errorlogs?

When the Automatic Failover is initiated, it would have the following messages printed in the SQL Server Errorlog of the Mirror instance:

2010-04-19 22:08:03.590 spid15s      Error: 1479, Severity: 16, State: 1.
2010-04-19 22:08:03.590 spid15s      The mirroring connection to “TCP://<FQDN>:5022” has timed out for database “dbMirror” after 10 seconds without a response. Check the service and network connections.
2010-04-19 22:08:03.590 spid15s      Database mirroring is inactive for database ‘dbMirror’. This is an informational message only. No user action is required.
2010-04-19 22:08:03.590 spid15s      The mirrored database “dbMirror” is changing roles from “MIRROR” to “PRINCIPAL” due to Auto Failover.
2010-04-19 22:08:03.720 spid15s      Recovery is writing a checkpoint in database ‘dbMirror’ (15). This is an informational message only. No user action is required.

The Witness instance would have the following messages in the Errorlog:

2010-04-19 22:08:03.360 spid25s Error: 1479, Severity: 16, State: 1.
2010-04-19 22:08:03.360 spid25s The mirroring connection to “TCP://<FQDN>:5022” has timed out for database “dbMirror” after 10 seconds without a response. Check the service and network connections.

On the Principal server, the system event logs might report that the SQL Server service is unresponsive based on the Service Timeout value configured on the machine (Default is 30 seconds):

Log Name:      System
Source:        Service Control Manager
Date:          4/19/2010 10:04:02 PM
Event ID:      7011
Task Category: None
Level:         Error
Keywords:      Classic
Description:
A timeout (30000 milliseconds) was reached while waiting for a transaction response from the MSSQLSERVER service.

After the Principal instance comes back online, the following messages would be printed in it’s Errorlog:

2010-04-19 22:28:19.190 spid26s      Error: 1474, Severity: 16, State: 1.
2010-04-19 22:28:19.190 spid26s      Database mirroring connection error 4 ‘An error occurred while receiving data: ’64(The specified network name is no longer available.)’.’ for ‘TCP://<FQDN>:5023’.
2010-04-19 22:28:19.200 spid27s      Error: 1474, Severity: 16, State: 1.
2010-04-19 22:28:19.200 spid27s      Database mirroring connection error 4 ‘An error occurred while receiving data: ’64(The specified network name is no longer available.)’.’ for ‘TCP://<FQDN>:5024’.
2010-04-19 22:28:19.360 spid20s      Database mirroring is inactive for database ‘dbMirror’. This is an informational message only. No user action is required.
2010-04-19 22:28:19.490 spid35s      Starting up database ‘dbMirror’.
2010-04-19 22:28:19.530 spid20s      The mirrored database “dbMirror” is changing roles from “PRINCIPAL” to “MIRROR” due to Role Syncronization.
2010-04-19 22:28:19.590 spid35s      Bypassing recovery for database ‘dbMirror’ because it is marked as a mirror database, which cannot be recovered. This is an informational message only. No user action is required.
2010-04-19 22:28:19.810 spid20s      Database mirroring is active with database ‘dbMirror’ as the mirror copy. This is an informational message only. No user action is required.
2010-04-19 22:28:45.440 spid20s      Database mirroring is inactive for database ‘dbMirror’. This is an informational message only. No user action is required.

If you notice above, you will see that even though the Principal was not available during the failover, it assumed the role of the Mirror after it came back online. This Role Synchronization is what keeps the Principal and the Mirror database status updated to prevent Split Brain scenarios.

References:

SQL Server Best Practices Article

Database Mirroring Best Practices and Performance

TroubleshootingSQL

Explaining the bits and bytes of data in an AI world

Tag Archives: Failover

Fluffy in an Availability Group Failover Scenario

How to generate an Automatic Failover for a Mirrored Database

TroubleshootingSQL

Explaining the bits and bytes of data in an AI world

Share this post:

Share this post: