Wiki: Demystifying SQL Server Exception Dumps

Recently, I had replied to a MSDN post regarding an Access Violation reported by a SQL Server instance. In this post, I shall elucidate on troubleshooting memory dumps generated by SQL Server exceptions and what can be done without the availability of private symbols for debugging.

I had written a post earlier on when a memory dump is generated in SQL Server. If a single Errorlog has multiple stack dumps from multiple exceptions, then the first thing to do is to investigate the first access violation reported in the series of dumps reported. It is also possible that an exception (eg:access violation) could be followed by another exception/assertion or a non-yielding condition which may be a side effect of the earlier access violation.

When SQL encounters an access violation, there will be  SQLDumpXXXX.mdmp, SQLDumpXXXX.txt and SQLDumpXXXX.log files created in the LOG folder associated with the exception. The SQL Errorlog will report the occurrence of the exception as well.

  1. The .mdmp file is the memory dump which would be required by CSS to troubleshoot the issue.
  2. The .txt file is the symptom dump file which contains environment information along with other information depending on the nature of the exception.
  3. The .log file contains an extract from the SQL Errorlog when the exception was encountered. This helps in looking at what messages were reported in the Errorlog prior to the issue. Helps during post-mortem analysis if the current Errorlogs have already been recycled.

There can be two types of exceptions: one that is fatal enough to cause a SQL service termination or one that encounters the exception which terminates the thread which was encountered the issue.

Now what can be done in such a situation without having to look into memory dump file. The first step would be to check if the issue is a recurring behavior due to a particular operation to establish some pattern or if there is not identifiable pattern at all. If you look into the SQL Errorlog, you might be able to identify the input buffer which cause the exception (unless it’s an exception which terminated the SQL service, one such example is shown below).

Server * BEGIN STACK DUMP:
Server * spid 5728
Server * ex_handle_except encountered exception C0000005 – Server terminating

If the above snippet contains the T-SQL query which raised the exception, then you would want to execute the same query from a Management Studio query window and check if the exception re-occurs. If yes, then you have a valid repro scenario.

Example snippet from an Errorlog where the Access Violation didn’t cause a service termination:

*   Exception Address = 0042B866
*   Exception Code    = c0000005 EXCEPTION_ACCESS_VIOLATION
*   Access Violation occurred writing address 67192000
* Input Buffer 38 bytes –
*  exec sp_updatestats

Next check if you can spot any external components loaded into SQL Server address space. I shall revert back to my MSDN post where I had spotted an anti-virus filter DLL loaded in SQL Server process address space.

Server * bcrypt 000007FEFC900000 000007FEFC921FFF 00022000
Server * PCTLsp64 000000000B1E0000 000000000B251FFF 00072000
Server * mswsock 000007FEFC840000 000007FEFC893FFF 00054000

Based on a few Bing searches, I was able to determine that the PCTLsp64 was part of a virus monitoring program from PC Tools. The next step would be to check if this external component can be moved out of SQL Server process address space. The fact that the DLL is loaded in SQL process address space doesn’t make the DLL a culprit. However, when troubleshooting such issues, it is important to eliminate all unknown or third party elements that could prevent us from narrowing down to the culprit. Other DLLs which can be commonly found in SQL Server process address space would be provider DLLs when linked servers are being used on the SQL Server instance. Additionally execution of Extended Stored Procedures (XSPs) could also load external DLLs into SQL Server process address space depending on the way the XSP DLL was coded. An easy way to identifying a XSP dll is searching for the word “using” in the SQL Server Errorlog. SQL Server reports the first instance of a XSP dll being loaded into SQL Server process address space.

Using ‘xpstar.dll’ version ‘2009.100.1600’ to execute extended stored procedure ‘xp_readerrorlog’

If you do find an external component loaded in SQL process address space, check the callstack reported in the Errorlog below the list of modules printed below to determine if the functions of the exception stack are associated with external component that you identified. For the issue reported in the MSDN post, I did find this:

Module(sqlservr+00000000017A954C)
Module(kernel32+0000000000099380)
Module(ntdll+0000000000096228)
Module(ntdll+0000000000014F48)
Module(ntdll+0000000000034F6D)
Module(ntdll+0000000000015B2C)
Module(ntdll+000000000004F638)
Module(ntdll+000000000004C8F4)
Module(ntdll+000000000004C822)
Module(PCTLsp64+00000000000040AD)
Module(PCTLsp64+0000000000004701)
Module(PCTLsp64+0000000000005434)

Now this makes my story a bit more compelling of removing the external component out of SQL Server process address space. Once that is done and if you have a repro condition which causes the issue, then it would be good to re-run your repro and check if the issue is actually resolved. Not all the exceptions are the same even though they are reported in the same manner.

If all the above hasn’t helped in narrowing down the issue, then check if you are on the latest available cumulative update for the version of  SQL Server that you are using. If not, then it might be a good idea to update your instance to the latest available build to ensure that you are not hitting a known issue. If this is possible, then apply the latest cumulative update package and check if the issue re-occurs. I am cognizant of the fact that this might be an easy option due to change management restrictions on production environments. For such a dependency, my recommendation would be engage CSS to assist with a thorough investigation of the root cause of the issue.

As you can see from the steps outline above, you can look into a few things before you need to start poking into the exception memory dump using a Debugging tool.

Happy debugging!

Advertisement

T-SQL Tuesday #016 : Aggregate Functions

This month’s T-SQL Tuesday is being hosted by Jes Borland (Twitter | Blog) and the topic for this month is “Aggregate Functions”. What really caught my eye was the play on the T-SQL functions to accentuate the fact why Jes justifies her twitter handle of grrrl_geek:

 

Let’s SUM(thoughts), COUNT(ways we’ve done things), and set MAX(awesome) on our posts!

So at this point, I shall SELECT Content FROM PastExperience WHERE Type = ‘Aggregate Functions’ and pen it together in this blog post!

Aggregate functions have been a great way to slice-and-dice data stored in any RDBMS product and present it in the form which the business wants or would want to look at it. I have seen and also written multiple T-SQL queries in the past which use the HAVING clause along with aggregation functions to filter out data from the final dataset. I was very excited about the use of GROUPING option using ROLLUP for the set of canned queries that I had for aggregating analysis for SQL Performance cases while using SQL Nexus/RML Utilities.

The query below gives me an aggregation of the SUM of all the CPU/Duration/Reads/Writes which completed execution during the time the profiler traces were captured.


select a.hashid,b.OrigText,
SUM(a.duration) as SumDuration, SUM(a.Reads) as SumReads, Sum(a.Writes) as SumWrites,SUM(a.CPU) as SumCPU,
MAX(a.duration) as MAXDuration, MAX(a.Reads) as MAXReads, MAX(a.Writes) as MAXWrites,MAX(a.CPU) as MAXCPU,
MIN(a.duration) as MINDuration, MIN(a.Reads) as MINReads, MIN(a.Writes) as SumWrites,MIN(a.CPU) as MINCPU,
AVG(a.duration) as AVGDuration, SUM(a.Reads) as AVGReads, Sum(a.Writes) as AVGWrites,SUM(a.CPU) as AVGCPU,
GROUPING(a.hashid) AS [GroupNo]
from readtrace.tblBatches a
inner join readtrace.tblUniqueBatches b
on a.HashID = b.HashID
where EndTime is not null
group by  a.hashid,b.OrigText WITH ROLLUP

With the use of the above query, I get the all the queries with their average, minimum, maximum and total statistics of all the queries along with a rollup of the reads, writes, duration and CPU.

COMPUTE is another new introduction into the world of T-SQL which can really help with data analysis especially from my perspective, since I spend a significant part of my day looking into SQL Performance issues. The query below will give me all the long running batches along with the duration/CPU information for each run with a computed sub-total for each such aggregated row.

select a.spid,b.origtext,a.duration, a.cpu
from readtrace.tblBatches a
inner join readtrace.tblUniqueBatches b
on a.HashID = b.HashID
where EndTime is not null
order by a.duration,a.spid
compute sum(duration) , sum(cpu)  BY a.duration,a.spid;

An example output is:

image

Another example where Aggregation helps is identifying the amount of CPU usage by queries which were captured using profiler traces. Why is this important, you ask? This is very important when I need to compare the CPU usage time reported by SQL Server (captured using Perfmon/WMI etc.) and the actual time the user queries spent executing on CPU. The Statement, RPC, Batch and SP completed events capture the CPU consumption for each request. Aggregating them together gives you the ability to figure out the amount of time that was spent by user queries while executing on CPU. The approach used below uses UNIONS and SUM to figure out the total percentage of CPU used by the queries which completed during the profiler trace capture. The only parameter value that you need to provide is the number of CPUs i.e. the number of logical CPUs being used by the SQL instance. The reason I highlighted the phrase in the previous line is because that the below T-SQL batch works on the presumption that the profiler traces which were captured had all the completed events and captured completion events for all queries that were executing during the timeframe of the data collection. This batch would provide incorrect results if you have a long running CPU driven query which continues executing even after the profiler trace capture is stopped. This batch is ideal for scenarios where a profiler trace was started, a workload executed on the server and then the profiler trace was stopped. Under such a scenario, this would help you in identifying the percentage CPU usage for all queries executed by the workload.

DECLARE @MinTime DATETIME
DECLARE @MaxTime DATETIME
DECLARE @ms_diff BIGINT
DECLARE @number_of_cpus INTEGER

SET @number_of_cpus = 2  --Change this as necessary

SELECT
@MinTime = MIN(ISNULL(u.MinTime, '12-31-9999'))
FROM
(
SELECT MIN(StartTime) AS MinTime
FROM ReadTrace.tblBatches
UNION ALL
SELECT MIN(EndTime) AS MinTime
FROM ReadTrace.tblBatches
UNION ALL
SELECT MIN(StartTime) AS MinTime
FROM ReadTrace.tblStatements
UNION ALL
SELECT MIN(EndTime) AS MinTime
FROM ReadTrace.tblStatements
) AS u

SELECT
@MaxTime = MAX(ISNULL(u.MaxTime, '01-01-1900'))
FROM
(
SELECT MAX(StartTime) AS MaxTime
FROM ReadTrace.tblBatches
UNION ALL
SELECT MAX(EndTime) AS MaxTime
FROM ReadTrace.tblBatches
UNION ALL
SELECT MAX(StartTime) AS MaxTime
FROM ReadTrace.tblStatements
UNION ALL
SELECT MAX(EndTime) AS MaxTime
FROM ReadTrace.tblStatements
) AS u

IF (@MinTime = '12-31-9999' OR @MaxTime = '01-01-1900')
BEGIN
RAISERROR ('ERROR: The trace does not contain appropriate columns to determine the min/max time values for events in the trace.',
16, 1)
RETURN
END

SELECT
@ms_diff = DATEDIFF(ms, @MinTime, @MaxTime) * @number_of_cpus
FROM
ReadTrace.tblBatches AS b

SELECT @MinTime AS MinTime, @MaxTime AS MaxTime, @ms_diff AS total_CPU_milliseconds, @number_of_cpus AS NumCpus

SELECT
(CAST(SUM(cpu) AS FLOAT) / @ms_diff) * 100 AS cpu_percentage,
SUM(cpu) AS sum_cpu, COUNT(*) AS executions, ub.normtext
INTO
#temp_cpu_percentages
FROM
ReadTrace.tblBatches AS b
INNER JOIN ReadTrace.tblUniqueBatches AS ub
ON b.HashID = ub.HashId
GROUP BY
ub.normtext
HAVING
SUM(cpu) > 0
ORDER BY
sum_cpu DESC

 

SELECT SUM(t.cpu_percentage) AS SQL_cpu_usage_percent
FROM #temp_cpu_percentages AS t

SELECT TOP 10 *
FROM #temp_cpu_percentages AS t
ORDER BY t.cpu_percentage DESC

DROP TABLE #temp_cpu_percentages

What was the objective of this post?

1. Point out two not-so-commonly used T-SQL functionality (GROUPING/COMPUTE/ROLLUP) which can be quite helpful.

2. Show a few ways of aggregating, slicing and dicing data imported into a SQL Nexus database for quick analysis.

Books Online references:

GROUP BY (Transact-SQL)

COMPUTE (Transact-SQL)

What does cntr_type mean?

Have you ever wondered if the cntr_type column value in the sys.sysperfinfo or sys.dm_os_performance_counters output has a significant meaning or not. Well since the column value is there, it obviously has a meaning. Additionally, if the raw values represented by the output of some of the counter types is considered at face value, then your performance base lining can take a severe beating.

Each cntr_type value meaning can be found from the WMI Performance Counter Type or Windows Server Performance Counter Type documentation on MSDN. 

The common counter types in SQL Server are:
PERF_COUNTER_RAWCOUNT | Decimal | 65536
Raw counter value that does not require calculations, and represents one sample.

PERF_COUNTER_LARGE_RAWCOUNT | Decimal | 65792
Same as PERF_COUNTER_RAWCOUNT, but a 64-bit representation for larger values.

PERF_COUNTER_COUNTER | Decimal | 272696320
Average number of operations completed during each second of the sample interval. NOTE: For "per-second counters", this value is cumulative. The rate value must be calculated by sampling the value at discrete time intervals. The difference between any two successive sample values is equal to the rate for the time interval used. For example, batch requests/sec is a per-second counter, it would show cumulative values.

PERF_COUNTER_BULK_COUNT | Decimal | 272696576
Average number of operations completed during each second of the sample interval. This counter type is the same as the PERF_COUNTER_COUNTER type, but it uses larger fields to accommodate larger values.

PERF_AVERAGE_BULK | Decimal | 1073874176 | Decimal | 537003264
Number of items processed, on average, during an operation. This counter type displays a ratio of the items processed (such as bytes sent) to the number of operations completed, and requires a base property with PERF_AVERAGE_BASE as the counter type.

PERF_LARGE_RAW_BASE | Decimal | 1073939712
Base value found in the calculation of PERF_RAW_FRACTION, 64 bits.

Example:
If you had the following values:
SQLServer:Plan Cache | Cache Hit Ratio | Temporary Tables & Table Variables | 381
SQLServer:Plan Cache | Cache Hit Ratio Base | Temporary Tables & Table Variables | 386
Then the Temp Table/Variable cache hit ratio percentage would be: 98.7% (approx.)

You can use the query below to get the comments for each counter type as discussed above:

select object_name,counter_name,instance_name,cntr_value,
case cntr_type 
	when 65792 then 'Absolute Meaning' 
	when 65536 then 'Absolute Meaning' 
	when 272696576 then 'Per Second counter and is Cumulative in Nature'
	when 1073874176 then 'Bulk Counter. To get correct value, this value needs to be divided by Base Counter value'
	when 537003264 then 'Bulk Counter. To get correct value, this value needs to be divided by Base Counter value' 
end as counter_comments
from sys.dm_os_performance_counters
where cntr_type not in (1073939712)

 

Documentation on MSDN:

WMI Performance Counter Types

http://msdn.microsoft.com/en-us/library/aa394569(VS.85).aspx

SQL Server 2005 BOL Topic

sys.dm_os_performance_counters (Transact-SQL) 

The broad classes of counters are as follows:

Non-computational Counter Types

http://msdn.microsoft.com/en-us/library/aa392713(VS.85).aspx

Basic Algorithm Counter Types

http://msdn.microsoft.com/en-us/library/aa384813(VS.85).aspx

Counter Algorithm Counter Types

http://msdn.microsoft.com/en-us/library/aa389384(VS.85).aspx

Timer Algorithm Counter Types

http://msdn.microsoft.com/en-us/library/aa393909(VS.85).aspx

Precision Timer Algorithm Counter Types

http://msdn.microsoft.com/en-us/library/aa392755(VS.85).aspx

Queue-length Algorithm Counter Types

http://msdn.microsoft.com/en-us/library/aa392905(VS.85).aspx

Base Counter Types

http://msdn.microsoft.com/en-us/library/aa384811(VS.85).aspx

Statistical Counter Types

http://msdn.microsoft.com/en-us/library/aa393663(VS.85).aspx

Setup – RPC, please be nice to me

Recently, I had looked into a SQL Server 2008 cluster setup failure where the following information was noted in the Detail.txt file:

2011-02-21 11:58:37 Slp: Configuration action failed for feature SQL_Engine_Core_Inst during timing ConfigRC and scenario ConfigRC.
2011-02-21 11:58:37 Slp: The RPC server is unavailable
2011-02-21 11:58:37 Slp: Configuration action failed for feature SQL_Engine_Core_Inst during timing ConfigRC and scenario ConfigRC.
2011-02-21 11:58:37 Slp: System.ComponentModel.Win32Exception: The RPC server is unavailable
2011-02-21 11:58:37 Slp:    at Microsoft.SqlServer.Configuration.Cluster.ClusterResource.UpgradeResourceDLL(String nodeName, String dllPathName)
2011-02-21 11:58:37 Slp:    at Microsoft.SqlServer.Configuration.SqlEngine.SQLEngineClusterFeature.UpgradeResourceDLL(SQLServiceResource sqlResource)
2011-02-21 11:58:37 Slp:    at Microsoft.SqlServer.Configuration.SqlEngine.SQLEngineClusterFeature.ConfigureSQLEngineResourceType()
2011-02-21 11:58:37 Slp:    at Microsoft.SqlServer.Configuration.SqlEngine.SqlEngineSetupPrivate.Patch_ConfigRC(EffectiveProperties properties)
2011-02-21 11:58:37 Slp:    at Microsoft.SqlServer.Configuration.SqlEngine.SqlEngineSetupPrivate.Patch(ConfigActionTiming timing, Dictionary`2 actionData, PublicConfigurationBase spcb)
2011-02-21 11:58:37 Slp:    at Microsoft.SqlServer.Configuration.SqlConfigBase.SqlFeatureConfigBase.Execute(ConfigActionScenario scenario, ConfigActionTiming timing, Dictionary`2 actionData, PublicConfigurationBase spcbCurrent)
2011-02-21 11:58:37 Slp:    at Microsoft.SqlServer.Configuration.SqlConfigBase.SlpConfigAction.ExecuteAction(String actionId)
2011-02-21 11:58:37 Slp:    at Microsoft.SqlServer.Configuration.SqlConfigBase.SlpConfigAction.Execute(String actionId, TextWriter errorStream)
2011-02-21 11:58:37 Slp: Exception: System.ComponentModel.Win32Exception.
2011-02-21 11:58:37 Slp: Source: Microsoft.SqlServer.Configuration.Cluster.
2011-02-21 11:58:37 Slp: Message: The RPC server is unavailable.

The RPC Server is unavailable is a pretty generic error and can happen due to a lot of reasons. One of the common reasons that we have seen in CSS while troubleshooting setup issues is with incorrect DNS entries. This can happen if DNS entry for the cluster network name wasn’t configured properly and is pointing to an incorrect IP Address.

Another point to keep in mind is Windows Firewall. Make sure that the Firewall configuration allows this RPC call through.

I worked with Shahryar (Twitter) on a similar issue last week and it was identified that a PING request to the cluster name or IP returned no response.

If all else fails, then network tracing would help in figuring out what is failing and where.

References:

Error message when you connect to a cluster virtual server by using the named pipes protocol: "The machines selected for remote communication is not available at this time."
http://support.microsoft.com/kb/306985/