Wiki: SQL Database Engine Startup Failures for Clusters

I had recently written a Wiki article on my blog for a troubleshooting startup methodology for SQL Server startup failures for Standalone instances. In this article, I shall cover a similar roadmap for troubleshooting database engine startup failures for clustered SQL Server instances. Some of the steps are going to be the same as troubleshooting start-up failures for stand-along instance but there are some minute differences. So here goes…

The first thing that you need to do is to try and bring the SQL Server resource online from the Failover Cluster Administrator in case you are using a Windows Server 2003 cluster or below. If you are using a Windows Server 2008 or above, you would need to use the Failover Cluster Management Snap-in. If the SQL Server resource fails to come online, then check the SQL Server Errorlogs for any errors that were reported.

Locate the SQL Server Errorlog location and check if any errors were observed in the SQL Server Errorlog. This can be done by looking up the –e startup parameter value for the SQL Server instance in question. If you are using SQL Server 2005 and above, then right-click on the Service Name in Configuration Manager –> Properties –> Advanced Tab. Alternatively, you can use a VBScript (or Powershell) to fetch this information. Here is an example for obtaining the Startup Parameters using WMI and VBScript.

Verify that the files pointed to by the –d (master data file location), –e (Errorlog location) and –l (master log file location) actually exist and have the correct file paths. Starting from SQL Server 2005, the Errorlog will have the startup parameters printed in it when the SQL instance starts up. If you find no errors in the SQL Errorlogs, look into the Windows Application Event logs and ensure that there are no errors related to the cluster resource for the SQL instance or the cluster group (applicable for Windows 2003 clusters) that the SQL resource is a part of.

If the Windows Application Event logs, don’t provide any useful errors, then look at any warnings or errors pertaining to the SQL Server resource in the Windows Cluster logs.

Refer the following post to generate the Cluster log on a Windows Server 2008 Failover cluster or higher. For a Windows Server 2003 Cluster, the default location of the cluster logs is: <system driver>:\<windows folder>\Cluster\Cluster.log.

Now try to start the SQL Server instance using command prompt using the sqlservr.exe to check if the service comes online without bringing any user database online using -T3608.

For a default SQL Server instance:

sqlservr.exe –c –m –T3608

For a named instance of SQL Server:

sqlservr.exe –c –m –T3608 –s<instance name>

If the above fails, then the console window should provide you with an error message as to why the SQL instance fails to start. If the above succeeds, then try and start the service using the Configuration Manager.

Note: Make sure that you are not falling prey to a Desktop Heap exhaustion issue.

If you determine that the SQL Server registry entries have an erroneous entry, then you would need to follow the steps mentioned in the KB Article below to rectify the problem (due to Registry Synch operations that happen on a cluster):
http://support.microsoft.com/kb/953504

You might also find that the SQL Server resource entries in the Cluster Registry hive (HKLM\Cluster\Resources\GUID) are missing. You would then have to re-create the same using the steps mentioned in the KB Article below (applicable for SQL Server 2000 and SQL Server 2005 only):
http://support.microsoft.com/kb/810056

NOTE: Remember that modifying registry keys incorrectly on a Windows Server can have disastrous effects. Make sure that you take a registry backup before making any registry modifications.

Apart from the following common startup failure root causes, there are an additional few which are applicable to clustered SQL Server instances only (in BOLD below):

The most common root causes for service startup failures are:

  1. Access denied on the folders for the SQL Server Service Account on the DATA folder which contains the SQL Server system databases or due to some other process (like anti-virus holding a lock on the SQL database files)
  2. Insufficient permissions granted to the SQL Server Service account.
  3. File not found errors due to either accidental deletion of files or disk failures.
  4. System database files having inconsistencies prevent SQL Server startup.
  5. Password of the service account was changed but not updated on the server that has the SQL instance installed.
  6. Startup parameters have incorrect file path locations
  7. The dependent resources for the SQL Server clustered instance are failing to come online. (Eg. Shared disks on which the database files reside on)
  8. Incorrect (read:messed up) Registry entries in the CLUSTER registry hive for the concerned SQL instance which have been checkpointed already and due to which the instance fails to start.
  9. Operating system/Resource crunch/Environment related issues on one or more nodes due to which the instance comes online on one node but not on the other.

Additional references:
T-SQL Tuesday: Common SQL Server myths – Series I–Setup and Services
SQL SERVER 2000 CLuster FAILS TO COME ONLINE ON OTHER NODE
SQL SERVER 2005 RESOURCE FAILS TO COME ONLINE ON CLUSTER AFTER CHANGING THE SAN DRIVE TO ADD MORE DISK SPACE
Sql Server 2005 Resource Taking long time to come online on Windows Cluster with Resource Manager Creation Failed reported in Errorlog
Sql Server Resource Fails to come online on Cluster with Error Error Locating Server/Instance Specified [xFFFFFFFF]
SQL Server 2008 service fails to start, after applying Service Pack 1

Versions that this roadmap applies to:
SQL Server 2000, SQL Server 2005, SQL Server 2008, SQL Server 2008 R2

Wiki: SQL Database Engine Startup Failures

The startup failures for SQL Server can be classified into two types:

1. Service startup failures for stand-alone instances

2. Services startup failures for clustered instances

I shall document a road-map for a step-by-step troubleshooter for handling SQL Server startup failures for both stand-alone and clustered instances. By no means, is this the only way to troubleshoot startup failures but for a novice SQL Server user (hopefully, you don’t attempt the steps below for the first time on a production instance), this would be a good starting point in determining the root cause of the startup failures.

Services startup failures for stand-alone instances

If you are using SQL Server 2000, then try and start the SQL instance from the Services Manager and check if the service starts up. If you are using SQL Server 2005 or above, then try and start the service using SQL Server Configuration Manager.SQL Server Startup Paramaters

If the service fails to start, locate the SQL Server Errorlog location and check if any errors were observed in the SQL Server Errorlog. This can be done by looking up the –e startup parameter value for the SQL Server instance in question. If you are using SQL Server 2005 and above, then right-click on the Service Name in Configuration Manager –> Properties –> Advanced Tab (Refer Pic for more details). If you want to act smart in front of your co-worker Smile with tongue out, then you can use a VBScript (or Powershell) to fetch this information. Here is an example for obtaining the Startup Parameters using WMI and VBScript.

Verify that the files pointed to by the –d (master data file location), –e (Errorlog location) and –l (master log file location) actually exist and have the correct file paths. Starting from SQL Server 2005, the Errorlog will have the startup parameters printed in it when the SQL instance starts up.

Once you have located the current ERRORLOG (file with the same name will be present by the same name in the LOG folder), check for any errors present in the SQL Server Errorlog. An example of an error would be:

2010-11-20 07:50:58.250 Server Error: 17113, Severity: 16, State: 1.
2010-11-20 07:50:58.250 Server Error 2(The system cannot find the file specified.) occurred while opening file ‘C:\Program Files\Microsoft SQL Server\MSSQL10_50.MSSQLSERVER\MSSQL\DATA\master.mdf’ to obtain configuration information at startup. An invalid startup option might have caused the error. Verify your startup options, and correct or remove them if necessary.

The above error message clearly indicates that the master.mdf file was not found in the location that it was expected to be found. If your system database files are not consistent, the SQL Server instance will not startup.

Note: If you see that the timestamp in the SQL Server Errorlog file doesn’t reflect the current system time when you attempted to start the SQL instance, then it is highly likely that the path to the SQL Errorlog is incorrect. If that is true, then the SQL Server instance will fail to start and a pretty unhelpful error message will be shown to you. It is always one of the two:

“The service failed to respond in a timely fashion”

OR

“The service failed to start”

The most common root causes for service startup failures are:

  1. Access denied on the folders for the SQL Server Service Account on the DATA folder which contains the SQL Server system databases or due to some other process (like anti-virus holding a lock on the SQL database files)
  2. Insufficient permissions granted to the SQL Server Service account.
  3. File not found errors due to either accidental deletion of files or disk failures.
  4. System database files having inconsistencies prevent SQL Server startup.
  5. Password of the service account was changed but not updated on the server that has the SQL instance installed.
  6. Startup parameters have incorrect file path locations.
  7. The SQL Server service is disabled in Service Control Manager.

Find out if the SQL Server service account has all the necessary permissions required to start the SQL Server service. This would cover the ACLs, Security Privileges for the Windows Account, and Windows Registry permissions. Refer the Books Online article which documents all the permissions required by a SQL Server service account.

Even if that doesn’t help, then check the Windows System and Application Event logs and check for any errors related to the SQL Server instance or service that is currently failing to start.

Note: Make sure that you are not falling prey to a Desktop Heap exhaustion issue.

Still stuck!! Wow!! Aren’t we in a soup. No errors in the Errorlog (very very rare case scenario) or in the Windows Event Logs.

Then you need to try and start the service using command prompt in console mode. There is yet another Books Online article which explains how to start the SQL Server instance using sqlservr.exe located in the BINN folder.

Example for Default instance:

sqlservr.exe –c –m -T3608

Example for Named instance:

sqlservr.exe –c –m –s <instance name> -T3608

Check if the command prompt window spews out any helpful error message which will point you to the root cause of the issue. The Trace Flag 3608 prevents any database startup apart from the master database. This will help you determine if the SQL instance is starting up correctly and the problem might lie with the other system databases.

During my course of troubleshooting, I have also encountered some pretty weird startup failures. Of the top of my head, here are a few which I could find links to:

By Balmukund Lakhani (Blog)

http://blogs.msdn.com/b/blakhani/archive/2009/11/24/sql-server-2005-express-setup-failure-with-error-an-error-occurred-during-encryption.aspx

The famous (infuriating) TDSSNICLIENT client initialization failures during startup. The SQL Protocols team has documented a bunch of them here:

http://blogs.msdn.com/b/sql_protocols/archive/2006/04/28/585835.aspx

The case of the obnoxious locks value:

https://troubleshootingsql.com/2011/04/01/the-case-of-the-obnoxious-locks-value/

In case you have any more interesting startup failures that you find blog posts about, feel free to leave a comment for this post.

Startup Failures for Clustered Instances in another blog post.

Versions that this road-map applies to:

SQL Server 2000, SQL Server 2005, SQL Server 2008, SQL Server 2008 R2

Additional references:
T-SQL Tuesday: Common SQL Server myths – Series I–Setup and Services

How to fetch all Login Failed messages from default traces

I was recently investigating a login failed issue on a SQL instance and required to quickly find out all the Login Failed events that had occurred on the server. I could have scoured the SQL Errorlogs but I didn’t want to do this due to the bloated size of the Errorlogs. An easier way would be to read the Default Traces (available from SQL Server 2005 and above) and look for all the Audit Login Failed events (event class = 20) using fn_trace_gettable function.

The query that can be used to get this information is given below:


declare @path varchar(1000)

select @path = path from sys.traces where id = 1

select trc.Spid, TextData, State, Error, StartTime, EventSubClass, LoginName, NTDomainName, NTUserName, ClientProcessID, HostName, ApplicationName

from fn_trace_gettable(@path,DEFAULT) trc

where EventClass = 20

order by StartTime desc

Approach to SQL Performance issues

Performance degradation can happen due to multiple reasons. The main bottlenecks that would affect performance are:

1. Memory: Physical or Virtual – Your server could be experiencing a physical crunch which would point us to the Available Physical RAM or a crunch on the Virtual Address Space. This would involve finding out the queries which are consuming high memory during compilation or execution and narrow down on the single biggest contributor and work towards rectifying the problem with the query.

2. Disk IO: There could be an issue with SQL Server queries running slowly due to large amount of IO being done by them or a disk performance issue.

3. CPU: It could be because the queries executing on your server are consuming high amount of CPU time. In such a case, it be statistics that are out-of-date or data that has increased which introduces data skew.

4. Blocking: The queries executing on the server could be blocked and waiting for a particular resource and thus increasing the query duration.

How to perform post-mortem analysis for SQL performance problems?

If you are facing a performance degradation on the SQL instance after an upgrade, we would need to analyze the PSSDIAG collected to find out the top bottlenecks. There are two ways to do this:

1. Comparison Analysis: You identify the a query batch or stored procedure which is a problem and provide us a baseline (w.r.t. normal execution time for the query in question). We capture a “fast” and a “slow” run and we perform a comparison between the two runs and we tell you what is biggest bottleneck: Memory, Disk, CPU or Database/Environment Settings like database statistics, query SET options, difference in parameters etc. Based on what is identified as the culprit, you will be able to identify/suggest action items to help remove the culprit.

2. Bottleneck Analysis: In case you do not have a baseline, you can perform a bottleneck analysis and then stack rank them for you in case there is more than one bottleneck. Based on the analysis, you will be able to identify the biggest bottleneck and can implement action items to mitigate the single largest bottleneck.

If you need to perform live troubleshooting on the server, then you can make use of DMVs if you are on SQL Server 2005 or above or use Performance Dashboard (mentioned below).

RML Utilities Download:

http://www.microsoft.com/downloads/details.aspx?FamilyId=7EDFA95A-A32F-440F-A3A8-5160C8DBE926&displaylang=en

More Info on how to use the RML Utilities:

http://blogs.msdn.com/psssql/archive/2007/12/18/rml-utilities-for-microsoft-sql-server-released.aspx

http://sqlcat.com/technicalnotes/archive/2008/02/01/precision-performance-for-microsoft-sql-server-using-rml-utilities-9-0.aspx

If you need to analyze blocking data on the server, then you can use SQL Nexus for the same.

SQL Nexus Download Link:

http://www.codeplex.com/sqlnexus

Another option would be to use SQL Server 2005 Performance Dashboard in case you are testing your application on SQL Server 2005. This tool lets you view your server status without running PSSDIAG and provides reports to identify Long Running queries and also identify which queries are consuming the highest amount of resources (IO/CPU/Memory).

Performance Dashboard:

http://www.microsoft.com/downloads/details.aspx?FamilyId=1d3a4a0d-7e0c-4730-8204-e419218c1efc

Performance Dashboard for SQL Server 2008

http://blogs.msdn.com/b/sqlserverfaq/archive/2010/05/27/sql-server-performance-dashboard-reports-in-ssms-introduction-install-datediff-error-amp-modified-for-sql-2008.aspx

Did you know about the –m startup parameter

I am sure –m is startup parameter that most number of DBAs have used at one point of time or the other during a disaster recovery scenario involving an issue with system databases or even while moving system database files.

There is a hidden gem about the –m startup parameter which most of us are not aware of or do not use often when required. That is the Client Application Name filter along with the –m parameter. I have gotten frustrated multiple times when I started the SQL instance in single user mode and a sysadmin connection latched onto the only connection (read: SQL Performance monitoring app, CRM apps, or any other 24X7 LOB apps).

If you want to start SQL Server in Single User mode using –m startup parameter, then you can now prevent other applications from connecting to the instance by specifying the restricted Client Application Name which can only connect to the instance. Note that this is a case sensitive parameter. Before specifying the Client Application Name filter, check what the program_name is showing your application name in the sys.dm_exec_sessions DMV output.

If you are doing this for SQL instance maintenance operations, then specify the filter as “SQLCMD” or “Microsoft SQL Server Management Studio – Query” as you should be able to run your scripts from either of the two applications SQLCMD or SSMS.

No more tearing your hair out to figure out who took up the sole connection to the SQL instance when you don’t want that lone wolf to be in demand by other applications.