How to retrieve data from Showplan XML for Query Compile Event

Posted on December 30, 2009 by Amit Banerjee

The Showplan XML for Query Compile profiler event in SQL Server can be very helpful when you see a lot of SPIDs in SQL Server 2005 waiting for Resource Semaphore Query Compile waits.

RESOURCE_SEMAPHORE_QUERY_COMPILE waits:

A large number of waits of type RESOURCE_SEMAPHORE_QUERY_COMPILE indicates a large number of concurrent compiles. In SQL Server 2005 we limit the number of compiles than can start at any given time to prevent the following scenario: a large number of compilations are initiated and memory partially allocated. A large percentage of queries land up being suspended (waiting) for resources to complete compilation, eventually timing out and releasing memory allocated to them.

select * from sys.dm_exec_requests 

where wait_duration_ms <> 0 order by wait_duration_ms desc

If you see a large number of Session IDs waiting on this wait type, then it is time to identify which queries on your server are causing large amount of memory. In this post, I shall not talk in detail about what conditions lead to this kind of wait. However, some information can be found in this blog post by the SQL Dev team:

http://blogs.msdn.com/sqlprogrammability/archive/2007/01/21/2-0-diagnosing-plan-cache-related-performance-problems-and-suggested-solutions.aspx

You can also use this query to identify all the queries on the server instance waiting during Compilation:

select sp.*, st.text from sys.sysprocesses sp CROSS APPLY sys.dm_exec_sql_text(sp.sql_handle) AS st WHERE sp.lastwaittype LIKE 'RESOURCE_SEMAPHORE_QUERY_COMPILE%' ORDER BY sp.waittime DESC;

In brief, the two conditions that can lead to this issue are:

1. Multiple large ad-hoc queries being submitted (from the application) to the engine which leads to a compilation every time.

2. A single large query being submitted to the engine which is throttling the execution of all other queries.

You can use a server side profiler trace to collect a trace with the Showplan XML for Query Compile Event.

Once you have the profiler trace with you, you need to load the profiler trace into a database table using the query below.

select identity(int,1,1) as rownumber,* into xmltraceall from fn_trace_gettable ('<location of the trace file>',default where eventclass = 168

The Eventclass filter will only import the Showplan XML for Query Compile events into the database table. After that would need to run the query below to get the compile memory statistics for the queries running on your instance.

select databasename,objectname,sum(convert(int,substring(textdata, charindex('compilememory', textdata)+len('CompileMemory="'), charindex ('"', textdata, charindex('compilememory', textdata)+len('CompileMemory="'))-(charindex('compilememory', textdata)+len('CompileMemory="'))))) as Total_Compiled_Memory, 

avg(convert(int,substring(textdata, charindex('compilememory', textdata)+len('CompileMemory="'),charindex ('"', textdata, charindex('compilememory', textdata)+len('CompileMemory="'))-(charindex('compilememory', textdata)+len('CompileMemory="'))))) as Avg_Compiled_Memory, 

max(convert(int,substring(textdata, charindex('compilememory', textdata)+len('CompileMemory="'),charindex ('"', textdata, charindex('compilememory', textdata)+len('CompileMemory="'))-(charindex('compilememory', textdata)+len('CompileMemory="'))))) as Max_Compiled_Memory, 

min(convert(int,substring(textdata, charindex('compilememory', textdata)+len('CompileMemory="'), charindex ('"', textdata, charindex('compilememory', textdata)+len('CompileMemory="'))-(charindex('compilememory', textdata)+len('CompileMemory="'))))) as Min_Compiled_Memory, 

count(*) as Counts 

from xmltraceall 

group by databasename,objectname 

order by 3 desc

Alternatively, on the server, you can use the query below to get details of the queries causing havoc on your SQL Server instance:

WITH XMLNAMESPACES ('http://schemas.microsoft.com/sqlserver/2004/07/showplan' AS sp) Select stmt.stmt_details.value('(./sp:QueryPlan/@CompileMemory)[1]', 'bigint') as compile_memory_kb, 

stmt.stmt_details.value('(./sp:QueryPlan/@CompileTime)[1]', 'bigint') as compile_time_ms, 

stmt.stmt_details.value('(./sp:QueryPlan/@CompileCPU)[1]', 'bigint') as compile_cpu_ms, 

stmt.stmt_details.value('(./sp:QueryPlan/@CachedPlanSize)[1]', 'bigint') as CachePlanSize, 

p.objname, replace (replace (stmt.stmt_details.value('@StatementText', 'nvarchar(max)'), CHAR(10), ' '), CHAR(13), ' ') as query_text 

from (select cast (t.query_plan as Xml) showplan,t.dbid ,t.objectid, OBJECT_NAME (t.objectid, t.dbid) 'objname' from sys.dm_exec_cached_plans p1 cross apply sys.dm_exec_text_query_plan (plan_handle,DEFAULt,DEFAULT) t) as p 

cross apply showplan.nodes('//sp:StmtSimple') as stmt (stmt_details) 

order by 3 desc, 1 desc

However, please be advised that this is quite an expensive query and can cause additional performance issues on the server. I would suggest using the server side profiler trace method.

Performance issues due to USERSTORE_TOKENPERM

Posted on December 30, 2009 by Amit Banerjee

On versions of SQL Server 2005, a known issue can degrade query performance on 32-bit and 64-bit systems with 2 gigabytes (GB) or more of memory. When you execute queries under the context of a login that is not a member of the sysadmin fixed server role, you might encounter performance degradation symptoms that arise from a large and growing Security Token cache. These issues can include performance degradation of queries, high CPU usage for the SQL Server process, and a sharp increase in worker threads and SQL user connections. Systems that have less than 2 GB of memory should not experience these issues because normal memory usage requirements keep the Security Token cache from growing too large.

If you are facing the symptoms below, then it is a good time to discuss applying SP3 with CU1 with your management or apply the workarounds mentioned in the articles below. The most common workaround is having a job running at periodic intervals and cleaning up the TOKENPERM cache using DBCC FREESYSTEMCACHE (‘TokenAndPermUserStore’).

Symptoms

A large number of users running queries on the SQL instance and intermittently you find that the duration of queries increases intermittently
Most of the queries in the environment exhibit this pattern for random parameters and not any specific query/proc with any specific parameter
DBCC MEMORYSTATUS output for the SQL instance shows that the memory usage for SQL is gradually increasing over a period of time
High CPU during the problem period
DBCC FREEPROCCACHE or DBCC FREESYSTEMCACHE fixes the issue temporarily

Related Articles

KB article 927396: Queries take a longer time to finish running when the size of the TokenAndPermUserStore cache grows in SQL Server 2005

KB article 933564: FIX: A gradual increase in memory consumption for the USERSTORE_TOKENPERM cache store occurs in SQL Server 2005

KB article 937745: FIX: You may receive error messages when you try to log in to an instance of SQL Server 2005 and SQL Server handles many concurrent connections

KB Article 959823: How to customize the quota for the TokenAndPermUserStore cache store in SQL Server 2005 Service Pack 3

http://blogs.msdn.com/psssql/archive/2008/06/16/query-performance-issues-associated-with-a-large-sized-security-cache.aspx

Wikis: Disk Contention Issues

Posted on December 30, 2009 by Amit Banerjee

I had written this post a while back and decided to add a bit of more context w.r.t. Perfmon and other DMVs that can be used to fetch information.

It is said that the recommendation was that Average Disk Queue Length (perfmon counter in Logical/Physical Disk object) for disks be less than 2 * number of spindles on the disk. So if your disk had 10 spindles, then values above 20 woud be a cause of concern. But on a SAN or on virtualized environments, this particular counter can be misleading at times. So what was valid a few years has become somewhat of an urban legend. I personally am not a very big fan of calling a disk performance bad just based on this counter alone and without knowing how the underlying physical disk structure looks like.

To get a complete picture of your I/O performance, you might want to the check the following counters under perfmon too (keep in mind IO vendor logging and if you are using Virtual Machines, logging I/O calls at the virtualization bus layer can become equally important in some cases):
Physical Disk:
Avg. Disk Secs/Write
Avg. Disk Secs/Read
Avg. Disk Secs/Transfer

Logical Disk:
Avg. Disk Secs/Write
Avg. Disk Secs/Read
Avg. Disk Secs/Transfer

Also, you would want to find out if there are a lot of Split I/O (another Perfmon disk counter) occurring on the server. If that is the case, then most probably, you have disk level fragmentation on the drives storing SQL database files. That doesn’t mean you start de-fragmenting your disk drives on which SQL Server database files reside on.

For disks on which SQL Server Data (MDF,NDF) and Log (LDF) files reside, we recommend that these counters (Avg Disk Secs/Read,Write,Transfer) show values less that 30 milli-seconds or they do not represent frequent spikes or prolonged response times above 30 milli-seconds. If they are constantly showing values of 0.5 and above, then we have a serious disk contention issue on our hands. What this means is that your disk is roughly taking 0.5 second to perform an I/O operation which on database server environments would be a performance bottleneck. However, a low average value and very frequent spikes for the counters mentioned above which occur during the the timeframe of your performance issues is also not a good sign and would require a further investigation.

The other counters of interest in Perfmon are:

Disk Bytes/Sec
Disk Read Bytes/Sec
Disk Write Bytes/Sec

You can drill down into the above counters on the basis of logical disk or physical disk to determine which disk is receiving the highest amount of I/O activity. As always with performance issues, what is a good value *depends* on the throughput of the disks. Depending on how much I/O these disks are receiving and the throughput of the disks, you will be able to determine if the disks are capable of hanlding the load.

The disk contention could be due to these reasons:
1. There is an underlying H/W issue. This could mean that some driver or firmware needs an upgrade (essentially all your firmware/drivers should always be on the latest builds available) or it could be a problem with the SAN configuration or it could be that this particular SAN is not dedicated to SQL Server.
2. OR it could be that the disk contention is due to the fact that the queries executing on SQL Server are performing excessive amount of reads due to sub-optimal plans. Sub-optimal plans would be generated if your indexes are out of shape or if your statistics are skewed. This would require a statistics update with a full scan. (Please refer my statistics update post.) You should be able to find out which process is driving IO by looking at the Process performance object in Perfmon and the I/O Data Bytes counter. There is no fixed value for good and bad performance for this counter. The amount of I//O being pushed through should be compared with the response time of the disks and the I/O thresholds which the disk can handle.
3. OR it could be that the disk activity on the disks has increased over a period of time due to increase in data or addition of new databases. In such a case, it would be a good idea to provide a separate disk or storage area for some of the databases as the current activity might be too much for the disk to handle. Sometimes, load balancing for physical storage also helps in alleviating a disk contention issue.

Also, it would be a good idea to check if you are receiving any stuck/stalled IO warnings for any particular database in the SQL Server ERRORLOGs. (Reference: Stuck Stalled IO post)

What kind of waits would you monitor for when looking at wait statistics while troubleshooting disk contention. The most common waits that you would see in the sys.sysprocesses or sys.dm_exec_requests DMV output are PAGEIOLATCH_XX waits (XX = EX or SH most commonly) or WRITELOG under the wait types. Keep in mind, that the wait duration for these wait types should be non-zero. It could either keep increasing over a period of time for a particular request. Or it could show non-zero values which are constantly dropping to zero and increasing which means that there are multiple requests which are waiting on IO to complete. If you experience WRITELOG waittypes for high durations for user transactions, then that would have a cascading effect as SQL Server follows a Write-Ahead-Logging (WAL) protocol which means that the log is written into in sequential order. If one transaction is waiting for a log write operation to complete, then another concurrent transaction on the same database would also experience delays.

You can identify the queries performing high amount of I/O by collecting profiler traces and tracking the reads/writes done by the statements and batches executing on the server during that time. You only need Statement and Batch Completed events in case you are not encountering any attentions. Refer my post on profiler for the set of profiler events to be avoided.

Addition: May 5, 2011:

You can also use the query below to find out which database is receiving the maximum amount of I/O and on which database are the IO stalls being experienced. This information is retrieved from sys.dm_io_virtual_file_stats.


SELECT DB_NAME (f.database_id) AS database_name, f.database_id, f.name AS logical_file_name, f.[file_id], f.type_desc,

CAST (CASE WHEN LEFT (LTRIM (f.physical_name), 2) = '\\' THEN LEFT (LTRIM (f.physical_name), CHARINDEX ('\', LTRIM (f.physical_name), CHARINDEX ('\', LTRIM (f.physical_name), 3) + 1) - 1)

WHEN CHARINDEX ('\', LTRIM(f.physical_name), 3) > 0 THEN UPPER (LEFT (LTRIM (f.physical_name), CHARINDEX ('\', LTRIM (f.physical_name), 3) - 1))

ELSE f.physical_name END AS nvarchar(255)) AS logical_disk,

fs.num_of_reads, fs.num_of_bytes_read, fs.io_stall_read_ms, fs.num_of_writes, fs.num_of_bytes_written,

fs.io_stall_write_ms, fs.size_on_disk_bytes

FROM sys.dm_io_virtual_file_stats (default, default) AS fs

INNER JOIN sys.master_files AS f ON fs.database_id = f.database_id AND fs.[file_id] = f.[file_id]

*The above query is the same query that is used by Management Data Warehouse to collect a sub-set of the data using the “Server Activity” collection set.

If you want to drill-down further into the current low-level I/O, locking, latching, and access method activity for each partition of a table or index in the database identified in the above query, then you can use the query below to get that information for a particular database(s) and also identify if their is a missing index recommended for that table. The information for this is retrieved using sys.dm_db_index_operational_stats. Keep in mind that the data returned by sys.dm_db_index_operational_stats exists only as long as the metadata cache object that represents the heap or index is available. This data is neither persistent nor transactionally consistent. This means you cannot use these counters to determine whether an index has been used or not, or when the index was last used.

-- Shows objects which have page_io_latch_wait_count > 0

select db_name(d.database_id) as database_name,

quotename(object_schema_name(d.object_id, d.database_id)) + N'.' + quotename(object_name(d.object_id, d.database_id)) as object_name,

d.database_id,

d.object_id,

d.page_io_latch_wait_count,

d.page_io_latch_wait_in_ms,

d.range_scans,

d.index_lookups,

case when mid.database_id is null then 'N' else 'Y' end as missing_index_identified

from (select

database_id,

object_id,

row_number() over (partition by database_id order by sum(page_io_latch_wait_in_ms) desc) as row_number,

sum(page_io_latch_wait_count) as page_io_latch_wait_count,

sum(page_io_latch_wait_in_ms) as page_io_latch_wait_in_ms,

sum(range_scan_count) as range_scans,

sum(singleton_lookup_count) as index_lookups

from sys.dm_db_index_operational_stats(NULL, NULL, NULL, NULL)

where page_io_latch_wait_count > 0

group by database_id, object_id ) as d

left join (select distinct database_id, object_id from sys.dm_db_missing_index_details) as mid on mid.database_id = d.database_id and mid.object_id = d.object_id

where d.row_number <= 20 and d.database_id = DB_ID('AdventureWorks') -- Replace database name with the database that you are interested in

-- Comment out the line below if you want to look at I/O for system tables as well

and d.object_id not in (select object_id from sys.objects where is_ms_shipped = 1)

*The query above is a variation of the query that is used by the “IO Statistics” report present in SQL Server Performance Dashboard.

If you are interested in index usage statistics for that database, then you do so using sys.dm_db_index_usage_stats.

Example:


select OBJECT_NAME(object_id,database_id) as ObjectName, object_id, index_id, user_seeks, user_scans, user_lookups, user_updates, last_user_seek, last_user_scan, last_user_lookup, last_user_update

from sys.dm_db_index_usage_stats

where database_id = DB_ID('AdventureWorks') -- Replace database name with the database that you are interested in

and (user_lookups <> 0 or user_scans <> 0 or user_seeks <> 0)

order by user_scans desc

Related Articles

Sometimes, it’s also worthwhile checking the activity on tempdb and finding out if the contention is due to tempdb activity. The links below could be helpful:

Missing Indexes Feature in SQL Server 2005

Posted on December 30, 2009 by Amit Banerjee

Indexes are essential in making sure that your queries have efficient query plans and for SELECTS, you don’t end up doing searches on HEAPS.

One of the biggest improvements in SQL Server 2005 is that it tracks all the transactions happening on the server and makes a list of indexes which could prove beneficial for those queries. Of course, one index might be beneficial for one query but detrimental to another. So, it is highly essential that we test the feasibility of implementing these indexes on a production environment before rolling out changes to our indexes.

The missing indexes feature is on by default. No controls are provided to turn the feature on or off, or to reset any of the tables returned when the dynamic management objects are queried. When SQL Server is restarted, all of the missing index information is dropped. This feature can only be disabled if an instance of SQL Server is started by using the -x argument (one of the startup parameters) with the sqlservr command-prompt utility. However, along with the missing indexes, a lot of other useful events for performance troubleshooting will not be captured which is why this startup option is not recommended.

So, the first thing I would do if a query or a set of queries are running slowly in SQL Server, I would query the DMVs related to these missing indexes and find out if there are an indexes related to the tables on which those slow running queries are executing. This feature becomes highly useful when the following conditions are true:
1. NO CPU bottleneck
2. NO Blocking on the server
3. NO Disk bottleneck

You can view a list of all the missing indexes using Performance Dashboard which can be used with SQL Server 2005 Service Pack 2 and above.

You can get the CREATE INDEX commands for the missing indexes recommended on your instance using the query mentioned here.

If you want to test an automatic determination of the index candidates for your workload, then have a look at the post below:
http://blogs.msdn.com/b/queryoptteam/archive/2006/06/01/613516.aspx

Related Links for Missing Indexes
About Missing Indexes
http://msdn2.microsoft.com/en-us/library/ms345524.aspx
Finding Missing Indexes
http://msdn2.microsoft.com/en-us/library/ms345417.aspx
Limitations of this feature
http://msdn2.microsoft.com/en-us/library/ms345485.aspx

Stuck Stalled IO Messages

Posted on December 30, 2009 by Amit Banerjee

Sometimes, we see the following messages in the SQL Server 2000 (Service Pack 4 and above) and SQL Server 2005 ERRORLOGs:

2004-11-11 00:21:25.26 spid1 SQL Server has encountered 192 occurrence(s) of IO requests taking longer than 15 seconds to complete on file [E:\db_data2.ndf] in database [user_db] (7). The OS file handle is 0x00000000000074D4. The offset of the latest long IO is:
0x00000000022000″.

A given I/O request that becomes stalled or stuck is only reported one time. If the message reports 10 I/Os as stalled, these 10 reports will not occur again. If the next message reports 15 I/Os stalled, it indicates that 15 new I/O requests have become stalled.
Most of these messages if occur frequently in a SQL Server, then it could be due to one of the following:

1. There is an underlying hardware issue where in your firmware and disk drivers need to be upgraded. This is mostly done by the Internal Storage Team or preferably by the Hardware Vendor
2. Or the disks are overwhelmed with the IO Requests being posted on the risks. In this case, you would want to go ahead and make sure that if possible some of the data files and log files would need to be moved to a different drive
3. Or it could be due to bad plans SQL Server ends up performing extra number reads than required

Also, monitoring the disks using the following counters under performance monitor under the would be helpful (under the performance objects – Logical Disk and Physical Disk):
1. Average Disk Queue Length
2. Average Disk Secs\Transfer
3. Average Disk Secs\ Write
4. Average Disk Secs\ Reads
The last three counters should never show a prolonged peak period which shows values of 30 milli-seconds and above. Please refer my blog article Setting up Perfmon Logs on how to setup a perfmon trace.

One important point that needs to be kept is the amount of I/O being performed by SQL Server. You can find out the amount of I/O being driven by SQL Server by using the Process performance object with the SQL Server instance (sqlservr#) that you are concerned with and looking at the values for I/O Data Bytes value. It might be a case where your queries have sub-optimal plans due to which they are performing an excessive amount of reads/writes which is causing the disk to be throttled.

If you find out that this is actually an issue with the disk sub-system, then it is always a good idea to involve the disk vendor/or your internal storage team to find out if all the drivers/firmware are updated with the latest patches.

Addition: May 26, 2011:
If you want to get your hands dirty and debug a Stuck-Stalled IO request, then Bob Dorr, Principal Escalation Engineer at Microsoft, has a blog post on this:
http://blogs.msdn.com/b/psssql/archive/2008/03/03/how-it-works-debugging-sql-server-stalled-or-stuck-i-o-problems-root-cause.aspx

Helpful Articles:
SQL IO Basics

SQL Stuck Stalled IO KB Article

Stuck Stalled IO Messages

SQL Server 2005 IO Basics

Working with TempDB in SQL Server 2005

Troubleshooting Performance Problems (section on I/O)

Physical Database Storage

Storage Top 10 Best Practices

Ruling Out Disk-Bound Problems