Finding out root cause for Cluster Failovers

We do get quite a few issues regarding root cause analysis for Cluster Failovers. Failovers mostly happen due to the IsAlive check failing for the SQL Server resource after which the following two conditions arise:
1. SQL Server service restarts on the same cluster node
2. SQL Server resource fails over to a member cluster node

So, for looking into the possible root causes of a cluster failover, a SQL version of the MPS Reports capture is required from the node on which SQL is currently active. From the data requested by the PSS Engineer, the following files would be of utmost importance:
1. All the SQL Server ERRORLOGs
2. Windows Event Logs (System/Application)
3. Cluster Log

Based on the SQL Server ERRORLOGs, we would check for any errors or tell-tale signs which would point us to why the IsAlive check failed for the SQL Server resource. After that, we would look into the cluster log and the windows event logs to find out co-relation among the events during the failover time on the server.

Since, the cluster log rolls over and also the SQL Server ERRORLOGs can roll over very quickly if a job is in place to recycle if after a certain size, it is a very good idea to save the cluster log and the SQL Server ERRORLOG(s) right after the failover to prevent them from rolling over and overwriting valuable data from the problem time period.

Sometimes, a post mortem analysis provides us a hypothesis of what happened but doesn’t paint the picture completely due to lack of data from the period the problem happened. Based on the nature of the problem, the PSS Engineer might ask to you to do the following for the the next problem occurrence along :
1. Capture a light-weight PSSDIAG round the clock with file rollover so that we can track what sort of events were happening on the SQL instance right before the failover.
2. Or a filtered dump of the SQL Process during the problem period if there is heavy blocking on the server or if the failover had occurred due to memory dump(s) on the server
3. OR a round the clock Perfmon log if there were possible external memory pressure on the server.

NOTE: Keep in mind that the cluster logs are always in GMT format. So you need to add/substract the time difference between your timezone and GMT when analyzing the cluster logs.


One thought on “Finding out root cause for Cluster Failovers

  1. Pingback: Fluffy in an Availability Group Failover Scenario « TroubleshootingSQL

Comments are closed.