Wiki: SQL Database Engine Startup Failures

The startup failures for SQL Server can be classified into two types:

1. Service startup failures for stand-alone instances

2. Services startup failures for clustered instances

I shall document a road-map for a step-by-step troubleshooter for handling SQL Server startup failures for both stand-alone and clustered instances. By no means, is this the only way to troubleshoot startup failures but for a novice SQL Server user (hopefully, you don’t attempt the steps below for the first time on a production instance), this would be a good starting point in determining the root cause of the startup failures.

Services startup failures for stand-alone instances

If you are using SQL Server 2000, then try and start the SQL instance from the Services Manager and check if the service starts up. If you are using SQL Server 2005 or above, then try and start the service using SQL Server Configuration Manager.SQL Server Startup Paramaters

If the service fails to start, locate the SQL Server Errorlog location and check if any errors were observed in the SQL Server Errorlog. This can be done by looking up the –e startup parameter value for the SQL Server instance in question. If you are using SQL Server 2005 and above, then right-click on the Service Name in Configuration Manager –> Properties –> Advanced Tab (Refer Pic for more details). If you want to act smart in front of your co-worker Smile with tongue out, then you can use a VBScript (or Powershell) to fetch this information. Here is an example for obtaining the Startup Parameters using WMI and VBScript.

Verify that the files pointed to by the –d (master data file location), –e (Errorlog location) and –l (master log file location) actually exist and have the correct file paths. Starting from SQL Server 2005, the Errorlog will have the startup parameters printed in it when the SQL instance starts up.

Once you have located the current ERRORLOG (file with the same name will be present by the same name in the LOG folder), check for any errors present in the SQL Server Errorlog. An example of an error would be:

2010-11-20 07:50:58.250 Server Error: 17113, Severity: 16, State: 1.
2010-11-20 07:50:58.250 Server Error 2(The system cannot find the file specified.) occurred while opening file ‘C:\Program Files\Microsoft SQL Server\MSSQL10_50.MSSQLSERVER\MSSQL\DATA\master.mdf’ to obtain configuration information at startup. An invalid startup option might have caused the error. Verify your startup options, and correct or remove them if necessary.

The above error message clearly indicates that the master.mdf file was not found in the location that it was expected to be found. If your system database files are not consistent, the SQL Server instance will not startup.

Note: If you see that the timestamp in the SQL Server Errorlog file doesn’t reflect the current system time when you attempted to start the SQL instance, then it is highly likely that the path to the SQL Errorlog is incorrect. If that is true, then the SQL Server instance will fail to start and a pretty unhelpful error message will be shown to you. It is always one of the two:

“The service failed to respond in a timely fashion”

OR

“The service failed to start”

The most common root causes for service startup failures are:

  1. Access denied on the folders for the SQL Server Service Account on the DATA folder which contains the SQL Server system databases or due to some other process (like anti-virus holding a lock on the SQL database files)
  2. Insufficient permissions granted to the SQL Server Service account.
  3. File not found errors due to either accidental deletion of files or disk failures.
  4. System database files having inconsistencies prevent SQL Server startup.
  5. Password of the service account was changed but not updated on the server that has the SQL instance installed.
  6. Startup parameters have incorrect file path locations.
  7. The SQL Server service is disabled in Service Control Manager.

Find out if the SQL Server service account has all the necessary permissions required to start the SQL Server service. This would cover the ACLs, Security Privileges for the Windows Account, and Windows Registry permissions. Refer the Books Online article which documents all the permissions required by a SQL Server service account.

Even if that doesn’t help, then check the Windows System and Application Event logs and check for any errors related to the SQL Server instance or service that is currently failing to start.

Note: Make sure that you are not falling prey to a Desktop Heap exhaustion issue.

Still stuck!! Wow!! Aren’t we in a soup. No errors in the Errorlog (very very rare case scenario) or in the Windows Event Logs.

Then you need to try and start the service using command prompt in console mode. There is yet another Books Online article which explains how to start the SQL Server instance using sqlservr.exe located in the BINN folder.

Example for Default instance:

sqlservr.exe –c –m -T3608

Example for Named instance:

sqlservr.exe –c –m –s <instance name> -T3608

Check if the command prompt window spews out any helpful error message which will point you to the root cause of the issue. The Trace Flag 3608 prevents any database startup apart from the master database. This will help you determine if the SQL instance is starting up correctly and the problem might lie with the other system databases.

During my course of troubleshooting, I have also encountered some pretty weird startup failures. Of the top of my head, here are a few which I could find links to:

By Balmukund Lakhani (Blog)

http://blogs.msdn.com/b/blakhani/archive/2009/11/24/sql-server-2005-express-setup-failure-with-error-an-error-occurred-during-encryption.aspx

The famous (infuriating) TDSSNICLIENT client initialization failures during startup. The SQL Protocols team has documented a bunch of them here:

http://blogs.msdn.com/b/sql_protocols/archive/2006/04/28/585835.aspx

The case of the obnoxious locks value:

https://troubleshootingsql.com/2011/04/01/the-case-of-the-obnoxious-locks-value/

In case you have any more interesting startup failures that you find blog posts about, feel free to leave a comment for this post.

Startup Failures for Clustered Instances in another blog post.

Versions that this road-map applies to:

SQL Server 2000, SQL Server 2005, SQL Server 2008, SQL Server 2008 R2

Additional references:
T-SQL Tuesday: Common SQL Server myths – Series I–Setup and Services

My Transaction Log is Growing!!!

The transaction log growth issue… What do we do?? A few things NOT TO do when this happens:
1. Delete the LDF file of the database
2. Create a new LDF file if there is no space on the disk drives and if the SQL Server ERRORLOG is reporting 900* messages. (Personal opinion.. Seen a lot of situations where this is has gone horribly wrong on production environments)
3. Take the database offline
4. Restart the SQL Server service

Believe me the above options are exercised some times on production scenarios.

Now that the above steps are avoided, what is to be done. Simple set of steps can be done to shrink the T-LOG of the database in question using the steps mentioned in the KB Article below. My troubleshooting steps involve the following:
1. Check the current SQL Server ERRORLOG for any 900* error messages related to the T-LOG
2. If you have enough disk space and the database is not in SIMPLE recovery, make a backup of the transaction log so that the you can most of the current active virtual log files
3. Use DBCC SQLPERF (LOGSPACE) to find out the current used percentage of the transaction log for the database in question
4. If it’s SQL Server 2005, find out the log reuse value in the sys.databases catalog view output. This will tell you why a database transaction log is not being reused.
5. Find out if there are any open transactions using DBCC OPENTRAN command. KILL the transaction if you can ascertain that this open transaction can be redone or recovered later.
6. Try shrinking the T-LOG using DBCC SHRINKFILE or if it is in SIMPLE recovery, try using BACKUP LOG WITH TRUNCATE_ONLY command. But be advised that this truncate command will truncate your transaction log. OR You can even try firing a CHECKPOINT into the database and then trying to truncate the transaction log. This works for only SQL Server 2000 instances.

** I know a lot of people frown on the shrink and truncate options but when you are strapped for time, this might be your ONLY option.

If you are truncating the log, then you need to take a Full Backup and then continue your log backups (if FULL/BULK LOGGED recovery model is used) since truncation breaks your log chain. You should always try to find out the RCA for the issue (if possible and if you get the chance) because you don’t want the same (as Bruce Willis says in Die Hard Smile) thing happen to you twice!!
Another important thing is to make sure that your autogrowth considerations are taken into account. The following article should help you decide what needs to done in deciding the autogrowth of a database file: http://support.microsoft.com/kb/315512/en-us

For more information, see the following topics in SQL Server Books Online:
• Shrinking the transaction log
• DBCC SHRINKFILE (Transact-SQL)
• Truncating the transaction log

Useful articles
How to use the DBCC SHRINKFILE statement to shrink the transaction log file in SQL Server 2005
http://support.microsoft.com/kb/907511/en-us
INF: Shrinking the Transaction Log in SQL Server 2000 with DBCC SHRINKFILE
http://support.microsoft.com/kb/272318/en-us
INF: How to Shrink the SQL Server 7.0 Transaction Log
http://support.microsoft.com/kb/256650/en-us
How to stop the transaction log of a SQL Server database from growing unexpectedly
http://support.microsoft.com/kb/873235/en-us
A transaction log grows unexpectedly or becomes full on a computer that is running SQL Server
http://support.microsoft.com/kb/317375/en-us\

Common Reasons for T-LOG growth
1. Wrong autogrowth values (Mostly high percentage values)

2. Index rebuild operations for database done infrequently with database in full recovery model or with log shipping/mirroring enabled

3. Something failing in your replication topology

4. BULK insert/update operations

5. Long running transactions

In case, the above mentioned steps and articles do not help you, please call Microsoft SQL Support.

Wikis: Disk Contention Issues

I had written this post a while back and decided to add a bit of more context w.r.t. Perfmon and other DMVs that can be used to fetch information.

It is said that the recommendation was that Average Disk Queue Length (perfmon counter in Logical/Physical Disk object) for disks be less than 2 * number of spindles on the disk. So if your disk had 10 spindles, then values above 20 woud be a cause of concern. But on a SAN or on virtualized environments, this particular counter can be misleading at times. So what was valid a few years has become somewhat of an urban legend. I personally am not a very big fan of calling a disk performance bad just based on this counter alone and without knowing how the underlying physical disk structure looks like.

To get a complete picture of your I/O performance, you might want to the check the following counters under perfmon too (keep in mind IO vendor logging and if you are using Virtual Machines, logging I/O calls at the virtualization bus layer can become equally important in some cases):
Physical Disk:
Avg. Disk Secs/Write
Avg. Disk Secs/Read
Avg. Disk Secs/Transfer

Logical Disk:
Avg. Disk Secs/Write
Avg. Disk Secs/Read
Avg. Disk Secs/Transfer

Also, you would want to find out if there are a lot of Split I/O (another Perfmon disk counter) occurring on the server. If that is the case, then most probably, you have disk level fragmentation on the drives storing SQL database files. That doesn’t mean you start de-fragmenting your disk drives on which SQL Server database files reside on.

For disks on which SQL Server Data (MDF,NDF) and Log (LDF) files reside, we recommend that these counters (Avg Disk Secs/Read,Write,Transfer) show values less that 30 milli-seconds or they do not represent frequent spikes or prolonged response times above 30 milli-seconds. If they are constantly showing values of 0.5 and above, then we have a serious disk contention issue on our hands. What this means is that your disk is roughly taking 0.5 second to perform an I/O operation which on database server environments would be a performance bottleneck. However, a low average value and very frequent spikes for the counters mentioned above which occur during the the timeframe of your performance issues is also not a good sign and would require a further investigation.

The other counters of interest in Perfmon are:

Disk Bytes/Sec
Disk Read Bytes/Sec
Disk Write Bytes/Sec

You can drill down into the above counters on the basis of logical disk or physical disk to determine which disk is receiving the highest amount of I/O activity. As always with performance issues, what is a good value *depends* on the throughput of the disks. Depending on how much I/O these disks are receiving and the throughput of the disks, you will be able to determine if the disks are capable of hanlding the load.

The disk contention could be due to these reasons:
1. There is an underlying H/W issue. This could mean that some driver or firmware needs an upgrade (essentially all your firmware/drivers should always be on the latest builds available) or it could be a problem with the SAN configuration or it could be that this particular SAN is not dedicated to SQL Server.
2. OR it could be that the disk contention is due to the fact that the queries executing on SQL Server are performing excessive amount of reads due to sub-optimal plans. Sub-optimal plans would be generated if your indexes are out of shape or if your statistics are skewed. This would require a statistics update with a full scan. (Please refer my statistics update post.) You should be able to find out which process is driving IO by looking at the Process performance object in Perfmon and the I/O Data Bytes counter. There is no fixed value for good and bad performance for this counter. The amount of I//O being pushed through should be compared with the response time of the disks and the I/O thresholds which the disk can handle.
3. OR it could be that the disk activity on the disks has increased over a period of time due to increase in data or addition of new databases. In such a case, it would be a good idea to provide a separate disk or storage area for some of the databases as the current activity might be too much for the disk to handle. Sometimes, load balancing for physical storage also helps in alleviating a disk contention issue.

Also, it would be a good idea to check if you are receiving any stuck/stalled IO warnings for any particular database in the SQL Server ERRORLOGs. (Reference: Stuck Stalled IO post)

What kind of waits would you monitor for when looking at wait statistics while troubleshooting disk contention. The most common waits that you would see in the sys.sysprocesses or sys.dm_exec_requests DMV output are PAGEIOLATCH_XX waits (XX = EX or SH most commonly) or WRITELOG under the wait types. Keep in mind, that the wait duration for these wait types should be non-zero. It could either keep increasing over a period of time for a particular request. Or it could show non-zero values which are constantly dropping to zero and increasing which means that there are multiple requests which are waiting on IO to complete. If you experience WRITELOG waittypes for high durations for user transactions, then that would have a cascading effect as SQL Server follows a Write-Ahead-Logging (WAL) protocol which means that the log is written into in sequential order. If one transaction is waiting for a log write operation to complete, then another concurrent transaction on the same database would also experience delays.

You can identify the queries performing high amount of I/O by collecting profiler traces and tracking the reads/writes done by the statements and batches executing on the server during that time. You only need Statement and Batch Completed events in case you are not encountering any attentions. Refer my post on profiler for the set of profiler events to be avoided.

Addition: May 5, 2011:

You can also use the query below to find out which database is receiving the maximum amount of I/O and on which database are the IO stalls being experienced. This information is retrieved from sys.dm_io_virtual_file_stats.


SELECT DB_NAME (f.database_id) AS database_name, f.database_id, f.name AS logical_file_name, f.[file_id], f.type_desc,

CAST (CASE WHEN LEFT (LTRIM (f.physical_name), 2) = '\\' THEN LEFT (LTRIM (f.physical_name), CHARINDEX ('\', LTRIM (f.physical_name), CHARINDEX ('\', LTRIM (f.physical_name), 3) + 1) - 1)

WHEN CHARINDEX ('\', LTRIM(f.physical_name), 3) > 0 THEN UPPER (LEFT (LTRIM (f.physical_name), CHARINDEX ('\', LTRIM (f.physical_name), 3) - 1))

ELSE f.physical_name END AS nvarchar(255)) AS logical_disk,

fs.num_of_reads, fs.num_of_bytes_read, fs.io_stall_read_ms, fs.num_of_writes, fs.num_of_bytes_written,

fs.io_stall_write_ms, fs.size_on_disk_bytes

FROM sys.dm_io_virtual_file_stats (default, default) AS fs

INNER JOIN sys.master_files AS f ON fs.database_id = f.database_id AND fs.[file_id] = f.[file_id]

*The above query is the same query that is used by Management Data Warehouse to collect a sub-set of the data using the “Server Activity” collection set.

If you want to drill-down further into the current low-level I/O, locking, latching, and access method activity for each partition of a table or index in the database identified in the above query, then you can use the query below to get that information for a particular database(s) and also identify if their is a missing index recommended for that table. The information for this is retrieved using sys.dm_db_index_operational_stats. Keep in mind that the data returned by sys.dm_db_index_operational_stats exists only as long as the metadata cache object that represents the heap or index is available. This data is neither persistent nor transactionally consistent. This means you cannot use these counters to determine whether an index has been used or not, or when the index was last used.

-- Shows objects which have page_io_latch_wait_count > 0

select db_name(d.database_id) as database_name,

quotename(object_schema_name(d.object_id, d.database_id)) + N'.' + quotename(object_name(d.object_id, d.database_id)) as object_name,

d.database_id,

d.object_id,

d.page_io_latch_wait_count,

d.page_io_latch_wait_in_ms,

d.range_scans,

d.index_lookups,

case when mid.database_id is null then 'N' else 'Y' end as missing_index_identified

from (select

database_id,

object_id,

row_number() over (partition by database_id order by sum(page_io_latch_wait_in_ms) desc) as row_number,

sum(page_io_latch_wait_count) as page_io_latch_wait_count,

sum(page_io_latch_wait_in_ms) as page_io_latch_wait_in_ms,

sum(range_scan_count) as range_scans,

sum(singleton_lookup_count) as index_lookups

from sys.dm_db_index_operational_stats(NULL, NULL, NULL, NULL)

where page_io_latch_wait_count > 0

group by database_id, object_id ) as d

left join (select distinct database_id, object_id from sys.dm_db_missing_index_details) as mid on mid.database_id = d.database_id and mid.object_id = d.object_id

where d.row_number <= 20 and d.database_id = DB_ID('AdventureWorks') -- Replace database name with the database that you are interested in

-- Comment out the line below if you want to look at I/O for system tables as well

and d.object_id not in (select object_id from sys.objects where is_ms_shipped = 1)

*The query above is  a variation of the query that is used by the “IO Statistics” report present in SQL Server Performance Dashboard.

If you are interested in index usage statistics for that database, then you do so using sys.dm_db_index_usage_stats.

Example:


select OBJECT_NAME(object_id,database_id) as ObjectName, object_id, index_id, user_seeks, user_scans, user_lookups, user_updates, last_user_seek, last_user_scan, last_user_lookup, last_user_update

from sys.dm_db_index_usage_stats

where database_id = DB_ID('AdventureWorks') -- Replace database name with the database that you are interested in

and (user_lookups <> 0 or user_scans <> 0 or user_seeks <> 0)

order by user_scans desc

Related Articles

Sometimes, it’s also worthwhile checking the activity on tempdb and finding out if the contention is due to tempdb activity. The links below could be helpful:

Troubleshooting Log Shipping Issues

Log Shipping is a feature in SQL Server by which you can ship transaction log backups to a different server and restore the backups onto a standby database for disaster recovery or reporting purposes.

One of the major fundamental differences in SQL Server 2000 and SQL Server 2005 log shipping is that SQL Server 2005 uses linked servers to communicate between the primary and monitor and the secondary and monitor. The log shipping jobs are executed via linked server queries to update information about the backup, copy and restore jobs in case you have a remote monitor server.

So, if you find that your log shipping reports are defunct, then two additional things that you can do apart from the basic log shipping troubleshooting steps are:

1. Check if remote connections are enabled on the primary, secondary and monitor. For SQL Server 2000, check connectivity between the instances.

2. Check if a linked server using the same authentication settings as your log shipping setup can be setup between the primary and monitor or secondary or monitor depending on which part of your log shipping setup is broken. This is again applicable for SQL Server 2005.

Basic Log Shipping Troubleshooting Steps

1. Look into the SQL Server ERRORLOGs, Application/System Event Logs for any errors related to log shipping. Check the job history of the log shipping Backup/Copy/Restore jobs for any errors.

2. Check the Backup, Copy & Restore job history details for any errors.

3. If you are using SQL Server 2005, check what details are being displayed in the Log Shipping Report under Standard Reports in Management Studio.

4. If you want to take your troubleshooting to the next level, then you can even look into the log shipping meta data by querying the log shipping tables on the primary/secondary/monitor(if configured).

Addendum: April 26th, 2011

The log shipping configuration information can be found using the following methods:

1. Standard Reports – Transaction Log Shipping Status (right click Server Name in Management Studio-> Reports -> Standard Reports -> Transaction Log Shipping Status)
2. Use the Stored Procedure which is called by the above report: EXEC master..sp_help_log_shipping_monitor

Query to check log shipping job errors using the MSDB log shipping system tables


--List of Log Shipping jobs
SELECT * from dbo.sysjobs WHERE category_id = 6
SELECT * FROM [msdb].[dbo].[sysjobhistory]
where [message] like '%Operating system error%'
order by [run_date] , [run_time]

SELECT * FROM [msdb].[dbo].[log_shipping_monitor_error_detail]
where [message] like '%Operating system error%'

SELECT * FROM [msdb].[dbo].[restorehistory] 

Known issues with Log Shipping

1. You might find that the last backed up/copied/restored files do not reflect correctly in the log shipping reports when you use a remote monitor server. In such a scenario, check if the following issue documented in the blog post below is applicable in your case:
http://blogs.msdn.com/b/sqlserverfaq/archive/2009/03/27/transaction-log-shipping-status-report-for-monitor-server-will-not-pull-up-information-if-alias-is-used-for-monitor-server.aspx

The last copied and restored file will show up as null if the monitor instance is not on the same box as the secondary instance. The last backed up file will show up as null if the monitor instance is not on the same box as the primary instance if the select @@servername value is not used as the monitor server name while configuring the log shipping monitor.

2. If “REMOTE ACCESS” (sp_configure will show if it is enabled or not) is not enabled or the LOG SHIPPING LINKED SERVER (to the monitor server) is not working for the primary and secondary servers, then last backup file/last copy file/last restored file information will not get populated if a remote monitor server instance is being used. The easiest way to identify this issue would be to capture a profiler trace (on primary instance when the backup job is running and on the secondary instance when the copy/restore job is running). The profiler trace will report errors if an update operation pertaining to the log shipping monitor tables fails provided all “Errors and Warnings” profiler events are captured.

3. Another issue that you could run into while using Log Shipping is Orphaned Users if you have Database Users on the Primary Database mapped to SQL Authenticated Logins. This happens because the SIDs of the SQL Authenticated Users on the Primary and Secondary instance would be different. I documented the workaround to this issue in the following blog post: http://blogs.msdn.com/b/sqlserverfaq/archive/2009/04/13/orphaned-users-with-database-mirroring-and-log-shipping.aspx

4. When you are scripting out an existing log shipping configuration, ensure that you have Cumulative Update Package 9 applied for SQL Server 2005 Service Pack 2 applied for Management Studio. If that is already done, then use one of the options mentioned in the more information section in the KB Article below:
955693 FIX: In SQL Server 2005, the file information about the transaction log that was last copied and the file information about the transaction log that was last restored are missing
http://support.microsoft.com/default.aspx?scid=kb;EN-US;955693

5. If you have configured LogShipping with STANDBY mode on SQL Server 2008 and the destination folder for the TLOGS uses a remote server (on which the sqlservice/sqlagent is not a Local Admin), then the restore job will fail everytime with following error :

2008-12-12 14:44:58.53 *** Error: During startup of warm standby database ‘testdb’ (database ID 7), its standby file (‘<UNC path of the TUF file>’) was inaccessible to the RESTORE statement. The operating system error was ‘5(Access is denied.)’.

TUF = Transaction Undo File which is required for applying the next T-LOG backup. This issued is fixed in the cumulative update mentioned in the KB Article below:

FIX: Error message when you use log shipping in SQL Server 2008: “During startup of warm standby database ‘<Database Name>’ (database ID <N>), its standby file (‘<File Name>’) was inaccessible to the RESTORE statement”
http://support.microsoft.com/kb/962008

6. Log shipping restore will fail if there is a snapshot or an active DBCC replica on the secondary database on which the restore is being done.
https://troubleshootingsql.com/2012/09/12/why-did-the-restore-fail-on-the-log-shipped-secondary-database/

Addition: September 12, 2012

Special cases
In case you need to speed up the transaction log restore for your log shipping secondary database in standby mode, then follow the steps mentioned in this post.

In case you need to move your secondary log shipped database files to a new physical location, then you can use the steps mentioned in this post.

Finding out root cause for Cluster Failovers

We do get quite a few issues regarding root cause analysis for Cluster Failovers. Failovers mostly happen due to the IsAlive check failing for the SQL Server resource after which the following two conditions arise:
1. SQL Server service restarts on the same cluster node
2. SQL Server resource fails over to a member cluster node

So, for looking into the possible root causes of a cluster failover, a SQL version of the MPS Reports capture is required from the node on which SQL is currently active. From the data requested by the PSS Engineer, the following files would be of utmost importance:
1. All the SQL Server ERRORLOGs
2. Windows Event Logs (System/Application)
3. Cluster Log

Based on the SQL Server ERRORLOGs, we would check for any errors or tell-tale signs which would point us to why the IsAlive check failed for the SQL Server resource. After that, we would look into the cluster log and the windows event logs to find out co-relation among the events during the failover time on the server.

Since, the cluster log rolls over and also the SQL Server ERRORLOGs can roll over very quickly if a job is in place to recycle if after a certain size, it is a very good idea to save the cluster log and the SQL Server ERRORLOG(s) right after the failover to prevent them from rolling over and overwriting valuable data from the problem time period.

Sometimes, a post mortem analysis provides us a hypothesis of what happened but doesn’t paint the picture completely due to lack of data from the period the problem happened. Based on the nature of the problem, the PSS Engineer might ask to you to do the following for the the next problem occurrence along :
1. Capture a light-weight PSSDIAG round the clock with file rollover so that we can track what sort of events were happening on the SQL instance right before the failover.
2. Or a filtered dump of the SQL Process during the problem period if there is heavy blocking on the server or if the failover had occurred due to memory dump(s) on the server
3. OR a round the clock Perfmon log if there were possible external memory pressure on the server.

NOTE: Keep in mind that the cluster logs are always in GMT format. So you need to add/substract the time difference between your timezone and GMT when analyzing the cluster logs.