24HOP Session: How to be a Ninja – Troubleshooting SQL PERF on Azure VMs


I have been a bit quiet on this blog but that is partly due to the fact that I have moved into a new role and a new country. I am now a part of the SQL Server Product Group [b|t] and based out of Redmond.

I am actually really excited about this. Sourabh Agarwal [b|t] and I are going to be presenting a preview to our SQL PASS Summit 2015 pre-con session "How to be a Ninja: Troubleshooting SQL performance on Azure Virtual Machines". Yes, we are starting the initiation program of becoming a SQL performance troubleshooting NINJA on SQL Server!

Troubleshooting is an art but the tricks of the trade changed with the advent of Azure Virtual Machines. Performance troubleshooting is different and at the same time very similar to what you have been used to for SQL Server. SQL Server performance on Azure VMs can be a sore point for many as the host troubleshooting entry points are limited and the knowledge of the internal workings scarce.

In this session, we will show you what best practices should be known for SQL Server instances running on Azure Virtual Machines! We will talk about tips on automating the implementation of all these best practices during deployment making this a single one-click deployment. This session will be a pre-cursor to our pre-con where we will go the whole nine yards and detail how to automate deployments from scratch, implement best practices automatically and analyze performance issues magically!

We hope you can join us for this session online and we do hope to see you during our pre-con! The 24Hop sessions are full of great sessions from great speakers in the SQL Family. See the full list here. I would recommend looking through the list and signing up for the ones that you are interested in. This will also give you a preview of what you can expect in the SQL PASS Summit this year.

This 24 Hours of PASS: Summit 2015 Preview event takes place over 24 hours, beginning September 17, 2015, 12:00 UTC. Featuring 24 webcasts delivered over 24 hours, this event provides a glimpse into the unparalleled content on offer at PASS Summit 2015, October 27-30, in Seattle, WA.

WHEN: September 17th at 8AM PST (3PM GMT)
WHERE: ONLINE
Facebook Event for our session: https://www.facebook.com/events/938656286172663/
Registration link for the event: http://www.sqlpass.org/24hours/2015/summitpreview/Registration.aspx

[UPDATE] September 29th, 2015

Thank you for the feedback that you shared after the session. It is always great to know what people liked in the session and even better to know where we need to improve. This helps ensure that our next iteration has the necessary tweaks. We received an overall 90% positive feedback and we thank everyone who attended for that!

The replies to the questions from the session are available below.

Q. Regarding the performance fixes as best practices(hotfixes/CU), do we have separate hotfixes(.msi/msp) for azure environment when compared to on premises environment?
A. The SQL Server installation bits that you would run on Azure VMs and on virtualized/physical on-premises environment are the same. So there aren’t any different set of fixes that need to run on Azure VMs.

Q. Are these Cheat Sheets available online?
A. The cheat sheets are available in the presentation PDF on the 24HOP site.

Q. Is using "Lock Pages in Memory" lead to that total allocated memory amount of SQL Server process is not seen in Windows Task Manager?
A. Task manager is not a good place to look for allocated memory when you want to find out allocations made after enabling Lock Pages in Memory privilege for the SQL Server service account. You could either look at Total Server Memory perfmon counter or the memory DMVs to track SQL Server memory usage. Additional reference: https://msdn.microsoft.com/en-us/library/ms176018.aspx 

Q. Why are you disabling caching on the log file drive?
A. This is due to the IO patterns that the SQL Server transaction log file receives and how Azure storage is structured. We have seen in tests that the performance for SQL Server transaction log is best when write caching is disabled for disks which hosts transaction log files. We will talk about this in detail during our pre-con session.

Q. For Datawarehousing workloads, do you recommend lock pages in memory setting on on-premise/azure VM hosting SQL Server?
A. For on-premise workloads, we recommend you test and ascertain the needs before enabling Lock Pages in Memory (LPIM) privilege. For Azure VM workloads, the first important task is to pick the machine with the right SKU. We recommend enabling LPIM to prevent paging to the local disk on the rack which can negatively affect performance.

Q. Why are there different storage options based on Windows version? Is there any dependency on SQL versions?
A. There aren’t different storage options based on Windows version. The different storage options are based on the performance tier that you want to be on. It is Windows and SQL version and release agnostic.

Q. Can you let me know the resources on Azure Storage?
A. The Azure storage documentation is a good place to start for this. We will talk about this in detail in the IaaS introduction part of our pre-con.

If we have missed any question, please leave your question in the comment section of this post and we will answer it.

Lastly, we loved the notes that Matt Penny [t] took during our session. A screenshot of that is shown below. Thank you Matt! J The 24HOP session presentation is attached on the session page.

Notes

Advertisements

Default Trace–Performance Issues


There are multiple events that a default trace in SQL Server 2005 and above tracks which can be significantly useful for finding out areas of improvement. The events that I will be concentrating on are:

1. Missing Column Statistics – This event class indicates that column statistics that could have been useful for the optimizer are not available due to which an incorrect cardinality estimation could occur. This can cause the optimizer to choose a less efficient query plan than expected. You will not see this event produced unless the option to auto-create statistics is turned off.

2. Missing Join Predicate – This event class indicates that a query is being executed that has no join predicate. (A join predicate is the ON search condition for a joined table in a FROM clause.) This could result in a long-running query. This event is produced only if both sides of the join return more than one row.

3. Sort Warnings – This event class indicates that sort operations do not fit into memory. This does not include sort operations involving the creation of indexes, only sort operations within a query (such as an ORDER BY clause used in a SELECT statement). The EventSubClass field in this event shows whether this was a single pass or a multiple pass. A single pass (EventSubClass = 1) is when the sort table was written to disk, only a single additional pass over the data was required to obtain sorted output. A multiple pass (EventSubClass = 2) is when the sort table was written to disk, multiple passes over the data were required to obtain sorted output. A multiple pass is an enemy of query performance.

4. Hash Warnings – This event class can be used to monitor when a hash recursion or cessation of hashing (hash bailout) has occurred during a hashing operation.  Hash recursion (EventSubClass = 0) occurs when the build input does not fit into available memory, resulting in the split of input into multiple partitions that are processed separately. Hash bailout (EventSubClass = 1) occurs when a hashing operation reaches its maximum recursion level and shifts to an alternate plan to process the remaining partitioned data. Hash bailout usually occurs because of skewed data. Another enemy of performance!

5. Server Memory Change – This event class occurs when Microsoft SQL Server memory usage has increased or decreased. You can even determine what is the current memory usage after the increase or decrease.

6. Log File Auto Grow – This event class indicates that the log file grew automatically. This event is not triggered if the log file is grown explicitly through ALTER DATABASE. Frequent log file growths are not food for performance.

7. Data File Auto Grow – This event class indicates that the data file grew automatically. This event is not triggered if the data file is grown explicitly by using the ALTER DATABASE statement.

Since this information is already available in the default trace, I decided to use my Default Trace Statistics Power View Excel sheet to track this information graphically. And this is what I got (see screenshot 1)!

DefaultTrace_PerfIssues

So what is the above Excel sheet displaying?

1. The information available in the first column chart will show the Data and Log file grow events per database.

2. The first matrix in the middle of the Excel sheet shows the number of Sort Warnings and Hash Warnings with drill-down capabilities for each database to see the EventSubClass fields.

3. The second matrix shows the Missing Column Statistics and the Missing Join Predicate events for each database. The drill-down capability gives the name of the column statistics that was missing.

4. The line graph shows the change in memory for the SQL Server database engine.

Happy monitoring!

Previous posts in this series:

Schema Changes History Report

A shot in the arm for sp_purge_data


I had recently worked on a performance issue with the Management Data Warehouse job “mdw_purge_data”. This job calls a stored procedure sp_purge_data which was taking a long time to complete. This stored procedure just got a shot in the arm with an update to it released by the SQL Server Developer working on it. The T-SQL for the modified stored procedure is available on the SQL Agent MSDN blog. The actual Stored Procedure will be available in an upcoming update for SQL Server 2008 R2.

Addition: July 20th, 2011: SQL Server 2008 R2 Service Pack 1 has various updates for the purge process which doesn’t require modification of the procedure using the script mentioned in the aforementioned blog post.

A recent discussion on Twitter reminded me that I had this as one of the items to blog about so here it is. Now for some of the interesting technical details.

The poor performance of the stored procedure has been addressed through a refactoring of the Stored Procedure code but the issue doesn’t occur on all Management Database Warehouse databases.

The issue occurs when you have large amounts of data stored in the [snapshots].[query_stats] for each unique snapshot ID. If you have orphaned rows in the ‘ [snapshots].[notable_query_plan] ‘ and ‘ [snapshots].[notable_query_text] ‘ tables of the Management Data Warehouse. This was corrected in the builds mentioned in the KB Article below:

970014    FIX: The Management Data Warehouse database grows very large after you enable the Data Collector feature in SQL Server 2008
http://support.microsoft.com/default.aspx?scid=kb;EN-US;970014

The above KB Article also has a workaround mentioned which allows you to clean-up the orphaned rows using a T-SQL script.

The “Query Statistics” data collection set collects query statistics, T-SQL text, and query plans of most of the statements that affect performance. Enables analysis of poor performing queries in relation to overall SQL Server Database Engine activity. It makes use of DMVs sys.dm_exec_query_stats to populate the entries in the snapshots.query_stats table. If this DMV returns a large number of rows for each time the collection set runs, the number of rows in the actual Management Data Warehouse table “snapshots.query_text” can grow very quickly. This will then cause performance issues when the actual purge job executes on the MDW database.

In summary, if you are using Management Data Warehouse to monitor a highly active server where there are a lot of ad-hoc queries or prepared SQL queries being executed, then I would recommend you to modify you the sp_purge_data stored procedure using the MSDN blog link, I mentioned earlier in my post. I have seen significant performance improvement due to this on the specific environment that I was working on.