24HOP Session: How to be a Ninja – Troubleshooting SQL PERF on Azure VMs

I have been a bit quiet on this blog but that is partly due to the fact that I have moved into a new role and a new country. I am now a part of the SQL Server Product Group [b|t] and based out of Redmond.

I am actually really excited about this. Sourabh Agarwal [b|t] and I are going to be presenting a preview to our SQL PASS Summit 2015 pre-con session "How to be a Ninja: Troubleshooting SQL performance on Azure Virtual Machines". Yes, we are starting the initiation program of becoming a SQL performance troubleshooting NINJA on SQL Server!

Troubleshooting is an art but the tricks of the trade changed with the advent of Azure Virtual Machines. Performance troubleshooting is different and at the same time very similar to what you have been used to for SQL Server. SQL Server performance on Azure VMs can be a sore point for many as the host troubleshooting entry points are limited and the knowledge of the internal workings scarce.

In this session, we will show you what best practices should be known for SQL Server instances running on Azure Virtual Machines! We will talk about tips on automating the implementation of all these best practices during deployment making this a single one-click deployment. This session will be a pre-cursor to our pre-con where we will go the whole nine yards and detail how to automate deployments from scratch, implement best practices automatically and analyze performance issues magically!

We hope you can join us for this session online and we do hope to see you during our pre-con! The 24Hop sessions are full of great sessions from great speakers in the SQL Family. See the full list here. I would recommend looking through the list and signing up for the ones that you are interested in. This will also give you a preview of what you can expect in the SQL PASS Summit this year.

This 24 Hours of PASS: Summit 2015 Preview event takes place over 24 hours, beginning September 17, 2015, 12:00 UTC. Featuring 24 webcasts delivered over 24 hours, this event provides a glimpse into the unparalleled content on offer at PASS Summit 2015, October 27-30, in Seattle, WA.

WHEN: September 17th at 8AM PST (3PM GMT)
WHERE: ONLINE
Facebook Event for our session: https://www.facebook.com/events/938656286172663/
Registration link for the event: http://www.sqlpass.org/24hours/2015/summitpreview/Registration.aspx

[UPDATE] September 29th, 2015

Thank you for the feedback that you shared after the session. It is always great to know what people liked in the session and even better to know where we need to improve. This helps ensure that our next iteration has the necessary tweaks. We received an overall 90% positive feedback and we thank everyone who attended for that!

The replies to the questions from the session are available below.

Q. Regarding the performance fixes as best practices(hotfixes/CU), do we have separate hotfixes(.msi/msp) for azure environment when compared to on premises environment?
A. The SQL Server installation bits that you would run on Azure VMs and on virtualized/physical on-premises environment are the same. So there aren’t any different set of fixes that need to run on Azure VMs.

Q. Are these Cheat Sheets available online?
A. The cheat sheets are available in the presentation PDF on the 24HOP site.

Q. Is using "Lock Pages in Memory" lead to that total allocated memory amount of SQL Server process is not seen in Windows Task Manager?
A. Task manager is not a good place to look for allocated memory when you want to find out allocations made after enabling Lock Pages in Memory privilege for the SQL Server service account. You could either look at Total Server Memory perfmon counter or the memory DMVs to track SQL Server memory usage. Additional reference: https://msdn.microsoft.com/en-us/library/ms176018.aspx 

Q. Why are you disabling caching on the log file drive?
A. This is due to the IO patterns that the SQL Server transaction log file receives and how Azure storage is structured. We have seen in tests that the performance for SQL Server transaction log is best when write caching is disabled for disks which hosts transaction log files. We will talk about this in detail during our pre-con session.

Q. For Datawarehousing workloads, do you recommend lock pages in memory setting on on-premise/azure VM hosting SQL Server?
A. For on-premise workloads, we recommend you test and ascertain the needs before enabling Lock Pages in Memory (LPIM) privilege. For Azure VM workloads, the first important task is to pick the machine with the right SKU. We recommend enabling LPIM to prevent paging to the local disk on the rack which can negatively affect performance.

Q. Why are there different storage options based on Windows version? Is there any dependency on SQL versions?
A. There aren’t different storage options based on Windows version. The different storage options are based on the performance tier that you want to be on. It is Windows and SQL version and release agnostic.

Q. Can you let me know the resources on Azure Storage?
A. The Azure storage documentation is a good place to start for this. We will talk about this in detail in the IaaS introduction part of our pre-con.

If we have missed any question, please leave your question in the comment section of this post and we will answer it.

Lastly, we loved the notes that Matt Penny [t] took during our session. A screenshot of that is shown below. Thank you Matt! J The 24HOP session presentation is attached on the session page.

Notes

Default Trace–Performance Issues

There are multiple events that a default trace in SQL Server 2005 and above tracks which can be significantly useful for finding out areas of improvement. The events that I will be concentrating on are:

1. Missing Column Statistics – This event class indicates that column statistics that could have been useful for the optimizer are not available due to which an incorrect cardinality estimation could occur. This can cause the optimizer to choose a less efficient query plan than expected. You will not see this event produced unless the option to auto-create statistics is turned off.

2. Missing Join Predicate – This event class indicates that a query is being executed that has no join predicate. (A join predicate is the ON search condition for a joined table in a FROM clause.) This could result in a long-running query. This event is produced only if both sides of the join return more than one row.

3. Sort Warnings – This event class indicates that sort operations do not fit into memory. This does not include sort operations involving the creation of indexes, only sort operations within a query (such as an ORDER BY clause used in a SELECT statement). The EventSubClass field in this event shows whether this was a single pass or a multiple pass. A single pass (EventSubClass = 1) is when the sort table was written to disk, only a single additional pass over the data was required to obtain sorted output. A multiple pass (EventSubClass = 2) is when the sort table was written to disk, multiple passes over the data were required to obtain sorted output. A multiple pass is an enemy of query performance.

4. Hash Warnings – This event class can be used to monitor when a hash recursion or cessation of hashing (hash bailout) has occurred during a hashing operation.  Hash recursion (EventSubClass = 0) occurs when the build input does not fit into available memory, resulting in the split of input into multiple partitions that are processed separately. Hash bailout (EventSubClass = 1) occurs when a hashing operation reaches its maximum recursion level and shifts to an alternate plan to process the remaining partitioned data. Hash bailout usually occurs because of skewed data. Another enemy of performance!

5. Server Memory Change – This event class occurs when Microsoft SQL Server memory usage has increased or decreased. You can even determine what is the current memory usage after the increase or decrease.

6. Log File Auto Grow – This event class indicates that the log file grew automatically. This event is not triggered if the log file is grown explicitly through ALTER DATABASE. Frequent log file growths are not food for performance.

7. Data File Auto Grow – This event class indicates that the data file grew automatically. This event is not triggered if the data file is grown explicitly by using the ALTER DATABASE statement.

Since this information is already available in the default trace, I decided to use my Default Trace Statistics Power View Excel sheet to track this information graphically. And this is what I got (see screenshot 1)!

DefaultTrace_PerfIssues

So what is the above Excel sheet displaying?

1. The information available in the first column chart will show the Data and Log file grow events per database.

2. The first matrix in the middle of the Excel sheet shows the number of Sort Warnings and Hash Warnings with drill-down capabilities for each database to see the EventSubClass fields.

3. The second matrix shows the Missing Column Statistics and the Missing Join Predicate events for each database. The drill-down capability gives the name of the column statistics that was missing.

4. The line graph shows the change in memory for the SQL Server database engine.

Happy monitoring!

Previous posts in this series:

Schema Changes History Report

A shot in the arm for sp_purge_data

I had recently worked on a performance issue with the Management Data Warehouse job “mdw_purge_data”. This job calls a stored procedure sp_purge_data which was taking a long time to complete. This stored procedure just got a shot in the arm with an update to it released by the SQL Server Developer working on it. The T-SQL for the modified stored procedure is available on the SQL Agent MSDN blog. The actual Stored Procedure will be available in an upcoming update for SQL Server 2008 R2.

Addition: July 20th, 2011: SQL Server 2008 R2 Service Pack 1 has various updates for the purge process which doesn’t require modification of the procedure using the script mentioned in the aforementioned blog post.

A recent discussion on Twitter reminded me that I had this as one of the items to blog about so here it is. Now for some of the interesting technical details.

The poor performance of the stored procedure has been addressed through a refactoring of the Stored Procedure code but the issue doesn’t occur on all Management Database Warehouse databases.

The issue occurs when you have large amounts of data stored in the [snapshots].[query_stats] for each unique snapshot ID. If you have orphaned rows in the ‘ [snapshots].[notable_query_plan] ‘ and ‘ [snapshots].[notable_query_text] ‘ tables of the Management Data Warehouse. This was corrected in the builds mentioned in the KB Article below:

970014    FIX: The Management Data Warehouse database grows very large after you enable the Data Collector feature in SQL Server 2008
http://support.microsoft.com/default.aspx?scid=kb;EN-US;970014

The above KB Article also has a workaround mentioned which allows you to clean-up the orphaned rows using a T-SQL script.

The “Query Statistics” data collection set collects query statistics, T-SQL text, and query plans of most of the statements that affect performance. Enables analysis of poor performing queries in relation to overall SQL Server Database Engine activity. It makes use of DMVs sys.dm_exec_query_stats to populate the entries in the snapshots.query_stats table. If this DMV returns a large number of rows for each time the collection set runs, the number of rows in the actual Management Data Warehouse table “snapshots.query_text” can grow very quickly. This will then cause performance issues when the actual purge job executes on the MDW database.

In summary, if you are using Management Data Warehouse to monitor a highly active server where there are a lot of ad-hoc queries or prepared SQL queries being executed, then I would recommend you to modify you the sp_purge_data stored procedure using the MSDN blog link, I mentioned earlier in my post. I have seen significant performance improvement due to this on the specific environment that I was working on.

QnA: Performance Dashboard Webcast

I had an overflow of questions during the webcast today and wasn’t able to answer a few as we ran short of time. I shall attempt to address the most common questions asked during the webcast.

How do I configure SQL Server Performance Dashboards?
SQL Server Performance Dashboard was originally released for SQL Server 2005. Setup instructions are available with the download link. However, the same install can be tweaked to work for SQL Server 2008 and SQL Server 2008 R2 using the blog post by Sudarshan: Configuring Performance Dashboard for SQL Server 2008
Download link: SQL Server 2005 Performance Dashboard

The SQL Server instance being monitored must be running SQL Server 2005 SP2 or later. After completing the installation, you must:
1. Run the Setup.sql file on each instance of SQL Server 200x that you wish to monitor with the SQL Server 2005 Performance Dashboard Reports.
2. Open the performance_dashboard_main.rdl file with the Custom Reports functionality new to Management Studio in SQL Sever 2005 Service Pack 2.

Does Performance Dashboard work for SQL Server 2000?
No, Performance Dashboard uses DMVs to generate the data for the reports which are available from SQL Server 2005 and higher.

Where can I get the presentation and webcast recording?
The presentation and webcast recording for the entire webcast series will be made available on the SQLServerFAQ MSDN blog. I shall post an update once both are available.

What is the performance impact of running these reports?
If your server is unresponsive or experiencing high CPU usage on all the processors, then these reports will also face issues while retrieving diagnostic data from the DMVs. However, the overall impact of running this reports on a SQL Server instance is very minimal as compared to any other performance diagnostic tool.

Can these reports be used on SQL Azure or any other RDBMS?
No, they cannot be used to monitor SQL Azure databases or any other non-SQL Server RDMBS environment.

What permissions are required for using Performance Dashboard?
You need to have VIEW SERVER STATE permission to view all the information exposed by Performance Dashboard as most of the information is collected from DMVs.

Can these reports be deployed to a local shared repository?
You can use Visual Studio and create custom reports similar to the Performance Dashboard reports or modify the existing reports as per your needs and deploy them. Be advised, that this will be a customized solution and not supported by CSS w.r.t. performance issues while running these reports or configuration issues.

Lastly, thank you for attending the session! And for your patience.

In case there are more questions, please feel free to add a comment and I shall reply back accordingly.

Cheers!

 

Webcast: Understanding Performance Bottlenecks using Performance Dashboard

I will be delivering a webcast on “Understanding Performance Bottlenecks using Performance Dashboard”. The idea of this is webcast is demonstrate the use of tools offered by Microsoft to analyze performance bottlenecks when the issue is happening. This will be a good folow-up on the recent write-up that I did for SSWUG on tackling SQL Performance issues using tools provided by Microsoft.

Time: 2:30PM IST – 3:45PM IST
Date: 4th May, 2011

Here is the session abstract:

This session will demonstrate how to troubleshoot a SQL Server performance issue using out-of-the-box features without having to collect diagnostic data for post-mortem analysis.

Apart from this session, there are other sessions being delivering as part of this webcast series.

Managing and Optimizing Resources for SQL Server (Date: May 2, 2011 Time: 2:30 pm – 3:45 pm IST)

Optimizing and Tuning Full Text Search for SQL Server (Date: May 3, 2011 Time: 2:30 pm – 3:45 pm IST)

Cool Tools to have for SQL Server DBA (Date: May 5, 2011 Time: 2:30 pm – 3:45 pm IST)

Learn Underappreciated Features of SQL Server to Improve Productivity (Date: May 6, 2011 Time: 2:30 pm – 3:45 pm IST)

If you are interested, here is the registration link for signing up: http://virtualtechdays.com/SQLServer2008R2/

 

T-SQL Tuesday#17: It’s all about APPLYcation this time

imageIt’s time for another round of T-SQL Tuesday and this round of the revolving blog party is being hosted by Matt Velic [Blog | Twitter].

APPLY – That is the topic for this month’s T-SQL Tuesday! The APPLY operator was added to the T-SQL repertoire and which has resulted in lesser use of cursors for a large number of diagnostic scripts that CSS uses to collect data while working on SQL Performance issues. In this blog, I shall share a few examples of such queries that we use to collect data while working on SQL Performance cases.

TOP Query Plan Statistics

The following query gives you a list of the SQL batches/procedures with their CPU usage, Query/Batch duration and Physical Reads rank. This query helps identify the TOP CPU/Duration/Read consuming queries by making use of system DMVs. The output below is useful for the following reasons:

1. I get the usecount of the procedure/batch and if this batch is called multiple times and the use count of a Compiled Proc cached object is only 1, then the plan is not being re-used. This now tells me that I need to look at reasons behind inability of plan re-use.

2. I get the total and average resource usage statistics for each of the queries listed in the output.

3. A quick glance at the output gives me an idea of the most expensive queries on the instance w.r.t. reads or/and CPU and/or query duration.


SELECT

LEFT(p.cacheobjtype + ' (' + p.objtype + ')',35) AS cacheobjtype,

p.usecounts,

p.size_in_bytes/1024  AS size_in_kb,

PlanStats.total_worker_time/1000 AS tot_cpu_ms,

PlanStats.total_elapsed_time/1000 AS tot_duration_ms,

PlanStats.total_physical_reads,

PlanStats.total_logical_writes,

PlanStats.total_logical_reads,

PlanStats.CpuRank,

PlanStats.PhysicalReadsRank,

PlanStats.DurationRank,

LEFT(CASE WHEN pa.value = 32767 THEN 'ResourceDb' ELSE ISNULL(DB_NAME(CONVERT(sysname,pa.value)),CONVERT(sysname,pa.value)) END,40) AS dbname,

sql.objectid,

CONVERT(nvarchar(50), CASE WHEN sql.objectid IS NULL THEN NULL ELSE REPLACE(REPLACE(sql.[text],CHAR(13),' '),CHAR(10),' ') END) AS procname,  REPLACE(REPLACE(SUBSTRING(sql.[text],PlanStats.statement_start_offset/2+1,CASE WHEN PlanStats.statement_end_offset=-1 THEN LEN(CONVERT(nvarchar(max),sql.[text]))

ELSE PlanStats.statement_end_offset/2 - PlanStats.statement_start_offset/2+1 END),CHAR(13),' '),CHAR(10),' ') AS stmt_text

FROM

(

SELECT

stat.plan_handle,

statement_start_offset,

statement_end_offset,

stat.total_worker_time,

stat.total_elapsed_time,

stat.total_physical_reads,

stat.total_logical_writes,

stat.total_logical_reads,

ROW_NUMBER()OVER ( ORDER BY stat.total_worker_time DESC ) AS CpuRank,

ROW_NUMBER()OVER ( ORDER BY stat.total_physical_reads DESC ) AS PhysicalReadsRank,

ROW_NUMBER()OVER ( ORDER BY stat.total_elapsed_time DESC ) AS DurationRank

FROM sys.dm_exec_query_stats stat

) AS PlanStats

INNER JOIN sys.dm_exec_cached_plans p

ON p.plan_handle =  PlanStats.plan_handle

OUTER APPLY sys.dm_exec_plan_attributes ( p.plan_handle ) pa

OUTER APPLY sys.dm_exec_sql_text ( p.plan_handle ) AS sql

WHERE

(PlanStats.CpuRank<50

OR PlanStats.PhysicalReadsRank<50

OR PlanStats.DurationRank<50)

AND

pa.attribute='dbid'

ORDER BY tot_cpu_ms DESC


Top Queries with Similar Query Hash and Query Plan Hash

 

SELECT TOP 10 query_plan_hash, query_hash,

COUNT (distinct query_plan_hash) as 'distinct query_plan_hash count',

SUM(execution_count) as 'execution_count',

SUM(total_worker_time) as 'total_worker_time',

SUM(total_elapsed_time) as 'total_elapsed_time',

SUM (total_logical_reads) as 'total_logical_reads',

MAX(REPLACE (REPLACE (SUBSTRING (st.[text], qs.statement_start_offset/2 + 1,CASE WHEN qs.statement_end_offset = -1 THEN LEN (CONVERT(nvarchar(max), st.[text])) ELSE qs.statement_end_offset/2 - qs.statement_start_offset/2 + 1 END), CHAR(13), ' '), CHAR(10), ' '))  AS sample_statement_text,

MIN(CAST(query_plan as varchar(max))) AS 'ShowPlan XML'

FROM sys.dm_exec_query_stats AS qs

CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) AS st

CROSS APPLY sys.dm_exec_query_plan(qs.plan_handle) as sp

GROUP BY query_plan_hash, query_hash

ORDER BY sum(total_worker_time) ASC;

This is a query which can help you identify queries which have the same query plan hash

SQL Server Books Online topic “Finding and Tuning Similar Queries by Using Query and Query Plan Hashes” has more information on this topic. The query hash feature was added in SQL Server 2008 which made it easier to troubleshooting performance issues caused by ad-hoc queries which differed in just literal values. RML Utilities does a similar task by creating query hash but now if you are troubleshooting on the server, you can do this using DMVs without having to capture a profiler trace.

The right operand supplied to the Apply operator is a function of one or more column values that are present in the left operand. So basically, the right operand is a table-valued expression of which is evaluated once for each row that appears in the left operand. The Cross Apply and Outer Apply are the two flavors of the Apply operator. So if I wanted to simulate an Apply Operation without the Operator itself, it would require the use of temporary tables or table variables.

I use the APPLY operator a lot while parsing XML data like Process Monitor traces or XML query plans which make life a lot easier and saves me from writing a huge bunch of T-SQL code.

Profiler events for different scenarios

A few moons back I had blogged about the high frequency events in SQL Server. In this post, I shall talk about the common scenarios where profiler traces are collected for SQL Performance issues and what kind of events need to be captured in the server side traces that are configured for the data collection.

While configuring a profiler trace, always include all data columns for that event as that doesn’t have an overhead as much as the number of events being captured during the trace capture.

In this post, I shall cover two of the most common issues that CSS collects profiler data for addressing SQL Performance issues: Attentions/Timeouts and High CPU usage/Long Running queries.

Attentions or Timeouts

Attentions or Timeouts can occur due to client side application timeout setting (mostly this is the case) or due to database engine side issue (like failure to acquire memory, problem on the server side connectivity layer etc.). During this time, you will have a lot of queries starting but not completing successfully. If you are dealing with such a scenario, then you would need to capture the Starting events using a profiler trace to find out which statements didn’t complete and encountered the Attention event.

The events that need to be captured are:

All Database Events

Log File Auto Shrink
Data File Auto Shrink
Log File Auto Grow
Data File Auto Grow

All Errors and Warnings

Attention (Important for tracking the timeouts)
ErrorLog
EventLog
Exception
Hash Warning
Execution Warnings
Sort Warnings
Missing Column Statistics
Missing Join Predicate
Exchange Spill Event
Blocked process report
User Error Message
Background Job Error
Bitmap Warning
Database Suspect Data Page
CPU threshold exceeded

Lock:Deadlock Chain and Lock:Timeout (timeout > 0)

Performance: Auto Stats

Showplan All or Showplan XML (As execution plan will not be available for queries which didn’t complete)

Showplan Statistics Profile or Showplan XML Statistics Profile (For comparing estimated and actual query plan)

Security Audit: Audit Login, Audit Login Failed and Audit Logout

Stored Procedures events:

RPC:Starting
SP:Recompile
SP:Starting
SP:Completed
SP:StmtStarting
SP:StmtCompleted
RPC:Completed

Transactions: SQLTransaction is required for tracing back the transaction if there are nested transactions). DTCTransaction is required only if DTC transactions are getting timed out.

TSQL events:

SQL:BatchCompleted
SQL:BatchStarting
SQL:StmtStarting
SQL:StmtCompleted
SQL:StmtRecompile

Cursor events: CursorRecompile and CursorImplicitConversion (Required for checking if the requested cursor is being implicitly converted or if the cursor is being frequently recompiled).

The above events should be sufficient to track down any query timeouts and the common root causes for the same.

High CPU usage/Long Running queries

Another set of issues that are commonly dealt with while troubleshooting SQL Performance issues are Long Running Queries or Queries causing high CPU usage on the server. In such a case, the above set of events would be definitely be useful. If you are certain that the profiler trace can be captured for the entire duration of the query execution, then for highly transactional servers, you can omit the capture of the following events:

SQL: BatchStarting
SQL: StmtStarting
SP: StmtStarting
SP: Starting
RPC: Starting

The above logic of configuring a profiler trace will also help you capture trace data when you are trying to track down queries which are causing excessive amount of IO on the SQL instance.

Reference for capturing blocking information on a SQL instance: SQL 2000 Blocker Script output/SQL 2005 PerfStats

Keep in mind that the profiler data captured may not be sufficient for root cause analysis if you are not capturing SQL blocking information and performance monitor data in conjunction with the profiler traces.

In case you need additional events like deadlock events or lock events, then refer to my previous post and find out if they fall in the high frequency event category and when to capture the particular event.

TIP: As much as possible, avoid adding Text filters to your profiler trace configuration as it can lead to high CPU usage for highly active OLTP environments during filter evaluation.

Hope that the above information will help you reduce the excessive amount of data collection for troubleshooting SQL performance issues.