I had recently written a Wiki article on my blog for a troubleshooting startup methodology for SQL Server startup failures for Standalone instances. In this article, I shall cover a similar roadmap for troubleshooting database engine startup failures for clustered SQL Server instances. Some of the steps are going to be the same as troubleshooting start-up failures for stand-along instance but there are some minute differences. So here goes…
The first thing that you need to do is to try and bring the SQL Server resource online from the Failover Cluster Administrator in case you are using a Windows Server 2003 cluster or below. If you are using a Windows Server 2008 or above, you would need to use the Failover Cluster Management Snap-in. If the SQL Server resource fails to come online, then check the SQL Server Errorlogs for any errors that were reported.
Locate the SQL Server Errorlog location and check if any errors were observed in the SQL Server Errorlog. This can be done by looking up the –e startup parameter value for the SQL Server instance in question. If you are using SQL Server 2005 and above, then right-click on the Service Name in Configuration Manager –> Properties –> Advanced Tab. Alternatively, you can use a VBScript (or Powershell) to fetch this information. Here is an example for obtaining the Startup Parameters using WMI and VBScript.
Verify that the files pointed to by the –d (master data file location), –e (Errorlog location) and –l (master log file location) actually exist and have the correct file paths. Starting from SQL Server 2005, the Errorlog will have the startup parameters printed in it when the SQL instance starts up. If you find no errors in the SQL Errorlogs, look into the Windows Application Event logs and ensure that there are no errors related to the cluster resource for the SQL instance or the cluster group (applicable for Windows 2003 clusters) that the SQL resource is a part of.
If the Windows Application Event logs, don’t provide any useful errors, then look at any warnings or errors pertaining to the SQL Server resource in the Windows Cluster logs.
Refer the following post to generate the Cluster log on a Windows Server 2008 Failover cluster or higher. For a Windows Server 2003 Cluster, the default location of the cluster logs is: <system driver>:\<windows folder>\Cluster\Cluster.log.
Now try to start the SQL Server instance using command prompt using the sqlservr.exe to check if the service comes online without bringing any user database online using -T3608.
For a default SQL Server instance:
sqlservr.exe –c –m –T3608
For a named instance of SQL Server:
sqlservr.exe –c –m –T3608 –s<instance name>
If the above fails, then the console window should provide you with an error message as to why the SQL instance fails to start. If the above succeeds, then try and start the service using the Configuration Manager.
Note: Make sure that you are not falling prey to a Desktop Heap exhaustion issue.
If you determine that the SQL Server registry entries have an erroneous entry, then you would need to follow the steps mentioned in the KB Article below to rectify the problem (due to Registry Synch operations that happen on a cluster):
You might also find that the SQL Server resource entries in the Cluster Registry hive (HKLM\Cluster\Resources\GUID) are missing. You would then have to re-create the same using the steps mentioned in the KB Article below (applicable for SQL Server 2000 and SQL Server 2005 only):
NOTE: Remember that modifying registry keys incorrectly on a Windows Server can have disastrous effects. Make sure that you take a registry backup before making any registry modifications.
Apart from the following common startup failure root causes, there are an additional few which are applicable to clustered SQL Server instances only (in BOLD below):
The most common root causes for service startup failures are:
- Access denied on the folders for the SQL Server Service Account on the DATA folder which contains the SQL Server system databases or due to some other process (like anti-virus holding a lock on the SQL database files)
- Insufficient permissions granted to the SQL Server Service account.
- File not found errors due to either accidental deletion of files or disk failures.
- System database files having inconsistencies prevent SQL Server startup.
- Password of the service account was changed but not updated on the server that has the SQL instance installed.
- Startup parameters have incorrect file path locations
- The dependent resources for the SQL Server clustered instance are failing to come online. (Eg. Shared disks on which the database files reside on)
- Incorrect (read:messed up) Registry entries in the CLUSTER registry hive for the concerned SQL instance which have been checkpointed already and due to which the instance fails to start.
- Operating system/Resource crunch/Environment related issues on one or more nodes due to which the instance comes online on one node but not on the other.
T-SQL Tuesday: Common SQL Server myths – Series I–Setup and Services
SQL SERVER 2000 CLuster FAILS TO COME ONLINE ON OTHER NODE
SQL SERVER 2005 RESOURCE FAILS TO COME ONLINE ON CLUSTER AFTER CHANGING THE SAN DRIVE TO ADD MORE DISK SPACE
Sql Server 2005 Resource Taking long time to come online on Windows Cluster with Resource Manager Creation Failed reported in Errorlog
Sql Server Resource Fails to come online on Cluster with Error Error Locating Server/Instance Specified [xFFFFFFFF]
SQL Server 2008 service fails to start, after applying Service Pack 1
Versions that this roadmap applies to:
SQL Server 2000, SQL Server 2005, SQL Server 2008, SQL Server 2008 R2