Wiki: Demystifying SQL Server Exception Dumps

Recently, I had replied to a MSDN post regarding an Access Violation reported by a SQL Server instance. In this post, I shall elucidate on troubleshooting memory dumps generated by SQL Server exceptions and what can be done without the availability of private symbols for debugging.

I had written a post earlier on when a memory dump is generated in SQL Server. If a single Errorlog has multiple stack dumps from multiple exceptions, then the first thing to do is to investigate the first access violation reported in the series of dumps reported. It is also possible that an exception (eg:access violation) could be followed by another exception/assertion or a non-yielding condition which may be a side effect of the earlier access violation.

When SQL encounters an access violation, there will be  SQLDumpXXXX.mdmp, SQLDumpXXXX.txt and SQLDumpXXXX.log files created in the LOG folder associated with the exception. The SQL Errorlog will report the occurrence of the exception as well.

  1. The .mdmp file is the memory dump which would be required by CSS to troubleshoot the issue.
  2. The .txt file is the symptom dump file which contains environment information along with other information depending on the nature of the exception.
  3. The .log file contains an extract from the SQL Errorlog when the exception was encountered. This helps in looking at what messages were reported in the Errorlog prior to the issue. Helps during post-mortem analysis if the current Errorlogs have already been recycled.

There can be two types of exceptions: one that is fatal enough to cause a SQL service termination or one that encounters the exception which terminates the thread which was encountered the issue.

Now what can be done in such a situation without having to look into memory dump file. The first step would be to check if the issue is a recurring behavior due to a particular operation to establish some pattern or if there is not identifiable pattern at all. If you look into the SQL Errorlog, you might be able to identify the input buffer which cause the exception (unless it’s an exception which terminated the SQL service, one such example is shown below).

Server * BEGIN STACK DUMP:
Server * spid 5728
Server * ex_handle_except encountered exception C0000005 – Server terminating

If the above snippet contains the T-SQL query which raised the exception, then you would want to execute the same query from a Management Studio query window and check if the exception re-occurs. If yes, then you have a valid repro scenario.

Example snippet from an Errorlog where the Access Violation didn’t cause a service termination:

*   Exception Address = 0042B866
*   Exception Code    = c0000005 EXCEPTION_ACCESS_VIOLATION
*   Access Violation occurred writing address 67192000
* Input Buffer 38 bytes –
*  exec sp_updatestats

Next check if you can spot any external components loaded into SQL Server address space. I shall revert back to my MSDN post where I had spotted an anti-virus filter DLL loaded in SQL Server process address space.

Server * bcrypt 000007FEFC900000 000007FEFC921FFF 00022000
Server * PCTLsp64 000000000B1E0000 000000000B251FFF 00072000
Server * mswsock 000007FEFC840000 000007FEFC893FFF 00054000

Based on a few Bing searches, I was able to determine that the PCTLsp64 was part of a virus monitoring program from PC Tools. The next step would be to check if this external component can be moved out of SQL Server process address space. The fact that the DLL is loaded in SQL process address space doesn’t make the DLL a culprit. However, when troubleshooting such issues, it is important to eliminate all unknown or third party elements that could prevent us from narrowing down to the culprit. Other DLLs which can be commonly found in SQL Server process address space would be provider DLLs when linked servers are being used on the SQL Server instance. Additionally execution of Extended Stored Procedures (XSPs) could also load external DLLs into SQL Server process address space depending on the way the XSP DLL was coded. An easy way to identifying a XSP dll is searching for the word “using” in the SQL Server Errorlog. SQL Server reports the first instance of a XSP dll being loaded into SQL Server process address space.

Using ‘xpstar.dll’ version ‘2009.100.1600’ to execute extended stored procedure ‘xp_readerrorlog’

If you do find an external component loaded in SQL process address space, check the callstack reported in the Errorlog below the list of modules printed below to determine if the functions of the exception stack are associated with external component that you identified. For the issue reported in the MSDN post, I did find this:

Module(sqlservr+00000000017A954C)
Module(kernel32+0000000000099380)
Module(ntdll+0000000000096228)
Module(ntdll+0000000000014F48)
Module(ntdll+0000000000034F6D)
Module(ntdll+0000000000015B2C)
Module(ntdll+000000000004F638)
Module(ntdll+000000000004C8F4)
Module(ntdll+000000000004C822)
Module(PCTLsp64+00000000000040AD)
Module(PCTLsp64+0000000000004701)
Module(PCTLsp64+0000000000005434)

Now this makes my story a bit more compelling of removing the external component out of SQL Server process address space. Once that is done and if you have a repro condition which causes the issue, then it would be good to re-run your repro and check if the issue is actually resolved. Not all the exceptions are the same even though they are reported in the same manner.

If all the above hasn’t helped in narrowing down the issue, then check if you are on the latest available cumulative update for the version of  SQL Server that you are using. If not, then it might be a good idea to update your instance to the latest available build to ensure that you are not hitting a known issue. If this is possible, then apply the latest cumulative update package and check if the issue re-occurs. I am cognizant of the fact that this might be an easy option due to change management restrictions on production environments. For such a dependency, my recommendation would be engage CSS to assist with a thorough investigation of the root cause of the issue.

As you can see from the steps outline above, you can look into a few things before you need to start poking into the exception memory dump using a Debugging tool.

Happy debugging!

Wiki: SQL Database Engine Startup Failures for Clusters

I had recently written a Wiki article on my blog for a troubleshooting startup methodology for SQL Server startup failures for Standalone instances. In this article, I shall cover a similar roadmap for troubleshooting database engine startup failures for clustered SQL Server instances. Some of the steps are going to be the same as troubleshooting start-up failures for stand-along instance but there are some minute differences. So here goes…

The first thing that you need to do is to try and bring the SQL Server resource online from the Failover Cluster Administrator in case you are using a Windows Server 2003 cluster or below. If you are using a Windows Server 2008 or above, you would need to use the Failover Cluster Management Snap-in. If the SQL Server resource fails to come online, then check the SQL Server Errorlogs for any errors that were reported.

Locate the SQL Server Errorlog location and check if any errors were observed in the SQL Server Errorlog. This can be done by looking up the –e startup parameter value for the SQL Server instance in question. If you are using SQL Server 2005 and above, then right-click on the Service Name in Configuration Manager –> Properties –> Advanced Tab. Alternatively, you can use a VBScript (or Powershell) to fetch this information. Here is an example for obtaining the Startup Parameters using WMI and VBScript.

Verify that the files pointed to by the –d (master data file location), –e (Errorlog location) and –l (master log file location) actually exist and have the correct file paths. Starting from SQL Server 2005, the Errorlog will have the startup parameters printed in it when the SQL instance starts up. If you find no errors in the SQL Errorlogs, look into the Windows Application Event logs and ensure that there are no errors related to the cluster resource for the SQL instance or the cluster group (applicable for Windows 2003 clusters) that the SQL resource is a part of.

If the Windows Application Event logs, don’t provide any useful errors, then look at any warnings or errors pertaining to the SQL Server resource in the Windows Cluster logs.

Refer the following post to generate the Cluster log on a Windows Server 2008 Failover cluster or higher. For a Windows Server 2003 Cluster, the default location of the cluster logs is: <system driver>:\<windows folder>\Cluster\Cluster.log.

Now try to start the SQL Server instance using command prompt using the sqlservr.exe to check if the service comes online without bringing any user database online using -T3608.

For a default SQL Server instance:

sqlservr.exe –c –m –T3608

For a named instance of SQL Server:

sqlservr.exe –c –m –T3608 –s<instance name>

If the above fails, then the console window should provide you with an error message as to why the SQL instance fails to start. If the above succeeds, then try and start the service using the Configuration Manager.

Note: Make sure that you are not falling prey to a Desktop Heap exhaustion issue.

If you determine that the SQL Server registry entries have an erroneous entry, then you would need to follow the steps mentioned in the KB Article below to rectify the problem (due to Registry Synch operations that happen on a cluster):
http://support.microsoft.com/kb/953504

You might also find that the SQL Server resource entries in the Cluster Registry hive (HKLM\Cluster\Resources\GUID) are missing. You would then have to re-create the same using the steps mentioned in the KB Article below (applicable for SQL Server 2000 and SQL Server 2005 only):
http://support.microsoft.com/kb/810056

NOTE: Remember that modifying registry keys incorrectly on a Windows Server can have disastrous effects. Make sure that you take a registry backup before making any registry modifications.

Apart from the following common startup failure root causes, there are an additional few which are applicable to clustered SQL Server instances only (in BOLD below):

The most common root causes for service startup failures are:

  1. Access denied on the folders for the SQL Server Service Account on the DATA folder which contains the SQL Server system databases or due to some other process (like anti-virus holding a lock on the SQL database files)
  2. Insufficient permissions granted to the SQL Server Service account.
  3. File not found errors due to either accidental deletion of files or disk failures.
  4. System database files having inconsistencies prevent SQL Server startup.
  5. Password of the service account was changed but not updated on the server that has the SQL instance installed.
  6. Startup parameters have incorrect file path locations
  7. The dependent resources for the SQL Server clustered instance are failing to come online. (Eg. Shared disks on which the database files reside on)
  8. Incorrect (read:messed up) Registry entries in the CLUSTER registry hive for the concerned SQL instance which have been checkpointed already and due to which the instance fails to start.
  9. Operating system/Resource crunch/Environment related issues on one or more nodes due to which the instance comes online on one node but not on the other.

Additional references:
T-SQL Tuesday: Common SQL Server myths – Series I–Setup and Services
SQL SERVER 2000 CLuster FAILS TO COME ONLINE ON OTHER NODE
SQL SERVER 2005 RESOURCE FAILS TO COME ONLINE ON CLUSTER AFTER CHANGING THE SAN DRIVE TO ADD MORE DISK SPACE
Sql Server 2005 Resource Taking long time to come online on Windows Cluster with Resource Manager Creation Failed reported in Errorlog
Sql Server Resource Fails to come online on Cluster with Error Error Locating Server/Instance Specified [xFFFFFFFF]
SQL Server 2008 service fails to start, after applying Service Pack 1

Versions that this roadmap applies to:
SQL Server 2000, SQL Server 2005, SQL Server 2008, SQL Server 2008 R2

Wiki: SQL Database Engine Startup Failures

The startup failures for SQL Server can be classified into two types:

1. Service startup failures for stand-alone instances

2. Services startup failures for clustered instances

I shall document a road-map for a step-by-step troubleshooter for handling SQL Server startup failures for both stand-alone and clustered instances. By no means, is this the only way to troubleshoot startup failures but for a novice SQL Server user (hopefully, you don’t attempt the steps below for the first time on a production instance), this would be a good starting point in determining the root cause of the startup failures.

Services startup failures for stand-alone instances

If you are using SQL Server 2000, then try and start the SQL instance from the Services Manager and check if the service starts up. If you are using SQL Server 2005 or above, then try and start the service using SQL Server Configuration Manager.SQL Server Startup Paramaters

If the service fails to start, locate the SQL Server Errorlog location and check if any errors were observed in the SQL Server Errorlog. This can be done by looking up the –e startup parameter value for the SQL Server instance in question. If you are using SQL Server 2005 and above, then right-click on the Service Name in Configuration Manager –> Properties –> Advanced Tab (Refer Pic for more details). If you want to act smart in front of your co-worker Smile with tongue out, then you can use a VBScript (or Powershell) to fetch this information. Here is an example for obtaining the Startup Parameters using WMI and VBScript.

Verify that the files pointed to by the –d (master data file location), –e (Errorlog location) and –l (master log file location) actually exist and have the correct file paths. Starting from SQL Server 2005, the Errorlog will have the startup parameters printed in it when the SQL instance starts up.

Once you have located the current ERRORLOG (file with the same name will be present by the same name in the LOG folder), check for any errors present in the SQL Server Errorlog. An example of an error would be:

2010-11-20 07:50:58.250 Server Error: 17113, Severity: 16, State: 1.
2010-11-20 07:50:58.250 Server Error 2(The system cannot find the file specified.) occurred while opening file ‘C:\Program Files\Microsoft SQL Server\MSSQL10_50.MSSQLSERVER\MSSQL\DATA\master.mdf’ to obtain configuration information at startup. An invalid startup option might have caused the error. Verify your startup options, and correct or remove them if necessary.

The above error message clearly indicates that the master.mdf file was not found in the location that it was expected to be found. If your system database files are not consistent, the SQL Server instance will not startup.

Note: If you see that the timestamp in the SQL Server Errorlog file doesn’t reflect the current system time when you attempted to start the SQL instance, then it is highly likely that the path to the SQL Errorlog is incorrect. If that is true, then the SQL Server instance will fail to start and a pretty unhelpful error message will be shown to you. It is always one of the two:

“The service failed to respond in a timely fashion”

OR

“The service failed to start”

The most common root causes for service startup failures are:

  1. Access denied on the folders for the SQL Server Service Account on the DATA folder which contains the SQL Server system databases or due to some other process (like anti-virus holding a lock on the SQL database files)
  2. Insufficient permissions granted to the SQL Server Service account.
  3. File not found errors due to either accidental deletion of files or disk failures.
  4. System database files having inconsistencies prevent SQL Server startup.
  5. Password of the service account was changed but not updated on the server that has the SQL instance installed.
  6. Startup parameters have incorrect file path locations.
  7. The SQL Server service is disabled in Service Control Manager.

Find out if the SQL Server service account has all the necessary permissions required to start the SQL Server service. This would cover the ACLs, Security Privileges for the Windows Account, and Windows Registry permissions. Refer the Books Online article which documents all the permissions required by a SQL Server service account.

Even if that doesn’t help, then check the Windows System and Application Event logs and check for any errors related to the SQL Server instance or service that is currently failing to start.

Note: Make sure that you are not falling prey to a Desktop Heap exhaustion issue.

Still stuck!! Wow!! Aren’t we in a soup. No errors in the Errorlog (very very rare case scenario) or in the Windows Event Logs.

Then you need to try and start the service using command prompt in console mode. There is yet another Books Online article which explains how to start the SQL Server instance using sqlservr.exe located in the BINN folder.

Example for Default instance:

sqlservr.exe –c –m -T3608

Example for Named instance:

sqlservr.exe –c –m –s <instance name> -T3608

Check if the command prompt window spews out any helpful error message which will point you to the root cause of the issue. The Trace Flag 3608 prevents any database startup apart from the master database. This will help you determine if the SQL instance is starting up correctly and the problem might lie with the other system databases.

During my course of troubleshooting, I have also encountered some pretty weird startup failures. Of the top of my head, here are a few which I could find links to:

By Balmukund Lakhani (Blog)

http://blogs.msdn.com/b/blakhani/archive/2009/11/24/sql-server-2005-express-setup-failure-with-error-an-error-occurred-during-encryption.aspx

The famous (infuriating) TDSSNICLIENT client initialization failures during startup. The SQL Protocols team has documented a bunch of them here:

http://blogs.msdn.com/b/sql_protocols/archive/2006/04/28/585835.aspx

The case of the obnoxious locks value:

https://troubleshootingsql.com/2011/04/01/the-case-of-the-obnoxious-locks-value/

In case you have any more interesting startup failures that you find blog posts about, feel free to leave a comment for this post.

Startup Failures for Clustered Instances in another blog post.

Versions that this road-map applies to:

SQL Server 2000, SQL Server 2005, SQL Server 2008, SQL Server 2008 R2

Additional references:
T-SQL Tuesday: Common SQL Server myths – Series I–Setup and Services