Wikis: Disk Contention Issues

I had written this post a while back and decided to add a bit of more context w.r.t. Perfmon and other DMVs that can be used to fetch information.

It is said that the recommendation was that Average Disk Queue Length (perfmon counter in Logical/Physical Disk object) for disks be less than 2 * number of spindles on the disk. So if your disk had 10 spindles, then values above 20 woud be a cause of concern. But on a SAN or on virtualized environments, this particular counter can be misleading at times. So what was valid a few years has become somewhat of an urban legend. I personally am not a very big fan of calling a disk performance bad just based on this counter alone and without knowing how the underlying physical disk structure looks like.

To get a complete picture of your I/O performance, you might want to the check the following counters under perfmon too (keep in mind IO vendor logging and if you are using Virtual Machines, logging I/O calls at the virtualization bus layer can become equally important in some cases):
Physical Disk:
Avg. Disk Secs/Write
Avg. Disk Secs/Read
Avg. Disk Secs/Transfer

Logical Disk:
Avg. Disk Secs/Write
Avg. Disk Secs/Read
Avg. Disk Secs/Transfer

Also, you would want to find out if there are a lot of Split I/O (another Perfmon disk counter) occurring on the server. If that is the case, then most probably, you have disk level fragmentation on the drives storing SQL database files. That doesn’t mean you start de-fragmenting your disk drives on which SQL Server database files reside on.

For disks on which SQL Server Data (MDF,NDF) and Log (LDF) files reside, we recommend that these counters (Avg Disk Secs/Read,Write,Transfer) show values less that 30 milli-seconds or they do not represent frequent spikes or prolonged response times above 30 milli-seconds. If they are constantly showing values of 0.5 and above, then we have a serious disk contention issue on our hands. What this means is that your disk is roughly taking 0.5 second to perform an I/O operation which on database server environments would be a performance bottleneck. However, a low average value and very frequent spikes for the counters mentioned above which occur during the the timeframe of your performance issues is also not a good sign and would require a further investigation.

The other counters of interest in Perfmon are:

Disk Bytes/Sec
Disk Read Bytes/Sec
Disk Write Bytes/Sec

You can drill down into the above counters on the basis of logical disk or physical disk to determine which disk is receiving the highest amount of I/O activity. As always with performance issues, what is a good value *depends* on the throughput of the disks. Depending on how much I/O these disks are receiving and the throughput of the disks, you will be able to determine if the disks are capable of hanlding the load.

The disk contention could be due to these reasons:
1. There is an underlying H/W issue. This could mean that some driver or firmware needs an upgrade (essentially all your firmware/drivers should always be on the latest builds available) or it could be a problem with the SAN configuration or it could be that this particular SAN is not dedicated to SQL Server.
2. OR it could be that the disk contention is due to the fact that the queries executing on SQL Server are performing excessive amount of reads due to sub-optimal plans. Sub-optimal plans would be generated if your indexes are out of shape or if your statistics are skewed. This would require a statistics update with a full scan. (Please refer my statistics update post.) You should be able to find out which process is driving IO by looking at the Process performance object in Perfmon and the I/O Data Bytes counter. There is no fixed value for good and bad performance for this counter. The amount of I//O being pushed through should be compared with the response time of the disks and the I/O thresholds which the disk can handle.
3. OR it could be that the disk activity on the disks has increased over a period of time due to increase in data or addition of new databases. In such a case, it would be a good idea to provide a separate disk or storage area for some of the databases as the current activity might be too much for the disk to handle. Sometimes, load balancing for physical storage also helps in alleviating a disk contention issue.

Also, it would be a good idea to check if you are receiving any stuck/stalled IO warnings for any particular database in the SQL Server ERRORLOGs. (Reference: Stuck Stalled IO post)

What kind of waits would you monitor for when looking at wait statistics while troubleshooting disk contention. The most common waits that you would see in the sys.sysprocesses or sys.dm_exec_requests DMV output are PAGEIOLATCH_XX waits (XX = EX or SH most commonly) or WRITELOG under the wait types. Keep in mind, that the wait duration for these wait types should be non-zero. It could either keep increasing over a period of time for a particular request. Or it could show non-zero values which are constantly dropping to zero and increasing which means that there are multiple requests which are waiting on IO to complete. If you experience WRITELOG waittypes for high durations for user transactions, then that would have a cascading effect as SQL Server follows a Write-Ahead-Logging (WAL) protocol which means that the log is written into in sequential order. If one transaction is waiting for a log write operation to complete, then another concurrent transaction on the same database would also experience delays.

You can identify the queries performing high amount of I/O by collecting profiler traces and tracking the reads/writes done by the statements and batches executing on the server during that time. You only need Statement and Batch Completed events in case you are not encountering any attentions. Refer my post on profiler for the set of profiler events to be avoided.

Addition: May 5, 2011:

You can also use the query below to find out which database is receiving the maximum amount of I/O and on which database are the IO stalls being experienced. This information is retrieved from sys.dm_io_virtual_file_stats.


SELECT DB_NAME (f.database_id) AS database_name, f.database_id, f.name AS logical_file_name, f.[file_id], f.type_desc,

CAST (CASE WHEN LEFT (LTRIM (f.physical_name), 2) = '\\' THEN LEFT (LTRIM (f.physical_name), CHARINDEX ('\', LTRIM (f.physical_name), CHARINDEX ('\', LTRIM (f.physical_name), 3) + 1) - 1)

WHEN CHARINDEX ('\', LTRIM(f.physical_name), 3) > 0 THEN UPPER (LEFT (LTRIM (f.physical_name), CHARINDEX ('\', LTRIM (f.physical_name), 3) - 1))

ELSE f.physical_name END AS nvarchar(255)) AS logical_disk,

fs.num_of_reads, fs.num_of_bytes_read, fs.io_stall_read_ms, fs.num_of_writes, fs.num_of_bytes_written,

fs.io_stall_write_ms, fs.size_on_disk_bytes

FROM sys.dm_io_virtual_file_stats (default, default) AS fs

INNER JOIN sys.master_files AS f ON fs.database_id = f.database_id AND fs.[file_id] = f.[file_id]

*The above query is the same query that is used by Management Data Warehouse to collect a sub-set of the data using the “Server Activity” collection set.

If you want to drill-down further into the current low-level I/O, locking, latching, and access method activity for each partition of a table or index in the database identified in the above query, then you can use the query below to get that information for a particular database(s) and also identify if their is a missing index recommended for that table. The information for this is retrieved using sys.dm_db_index_operational_stats. Keep in mind that the data returned by sys.dm_db_index_operational_stats exists only as long as the metadata cache object that represents the heap or index is available. This data is neither persistent nor transactionally consistent. This means you cannot use these counters to determine whether an index has been used or not, or when the index was last used.

-- Shows objects which have page_io_latch_wait_count > 0

select db_name(d.database_id) as database_name,

quotename(object_schema_name(d.object_id, d.database_id)) + N'.' + quotename(object_name(d.object_id, d.database_id)) as object_name,

d.database_id,

d.object_id,

d.page_io_latch_wait_count,

d.page_io_latch_wait_in_ms,

d.range_scans,

d.index_lookups,

case when mid.database_id is null then 'N' else 'Y' end as missing_index_identified

from (select

database_id,

object_id,

row_number() over (partition by database_id order by sum(page_io_latch_wait_in_ms) desc) as row_number,

sum(page_io_latch_wait_count) as page_io_latch_wait_count,

sum(page_io_latch_wait_in_ms) as page_io_latch_wait_in_ms,

sum(range_scan_count) as range_scans,

sum(singleton_lookup_count) as index_lookups

from sys.dm_db_index_operational_stats(NULL, NULL, NULL, NULL)

where page_io_latch_wait_count > 0

group by database_id, object_id ) as d

left join (select distinct database_id, object_id from sys.dm_db_missing_index_details) as mid on mid.database_id = d.database_id and mid.object_id = d.object_id

where d.row_number <= 20 and d.database_id = DB_ID('AdventureWorks') -- Replace database name with the database that you are interested in

-- Comment out the line below if you want to look at I/O for system tables as well

and d.object_id not in (select object_id from sys.objects where is_ms_shipped = 1)

*The query above is  a variation of the query that is used by the “IO Statistics” report present in SQL Server Performance Dashboard.

If you are interested in index usage statistics for that database, then you do so using sys.dm_db_index_usage_stats.

Example:


select OBJECT_NAME(object_id,database_id) as ObjectName, object_id, index_id, user_seeks, user_scans, user_lookups, user_updates, last_user_seek, last_user_scan, last_user_lookup, last_user_update

from sys.dm_db_index_usage_stats

where database_id = DB_ID('AdventureWorks') -- Replace database name with the database that you are interested in

and (user_lookups <> 0 or user_scans <> 0 or user_seeks <> 0)

order by user_scans desc

Related Articles

Sometimes, it’s also worthwhile checking the activity on tempdb and finding out if the contention is due to tempdb activity. The links below could be helpful:

Missing Indexes Feature in SQL Server 2005

Indexes are essential in making sure that your queries have efficient query plans and for SELECTS, you don’t end up doing searches on HEAPS.

One of the biggest improvements in SQL Server 2005 is that it tracks all the transactions happening on the server and makes a list of indexes which could prove beneficial for those queries. Of course, one index might be beneficial for one query but detrimental to another. So, it is highly essential that we test the feasibility of implementing these indexes on a production environment before rolling out changes to our indexes.

The missing indexes feature is on by default. No controls are provided to turn the feature on or off, or to reset any of the tables returned when the dynamic management objects are queried. When SQL Server is restarted, all of the missing index information is dropped. This feature can only be disabled if an instance of SQL Server is started by using the -x argument (one of the startup parameters) with the sqlservr command-prompt utility. However, along with the missing indexes, a lot of other useful events for performance troubleshooting will not be captured which is why this startup option is not recommended.

So, the first thing I would do if a query or a set of queries are running slowly in SQL Server, I would query the DMVs related to these missing indexes and find out if there are an indexes related to the tables on which those slow running queries are executing. This feature becomes highly useful when the following conditions are true:
1. NO CPU bottleneck
2. NO Blocking on the server
3. NO Disk bottleneck

You can view a list of all the missing indexes using Performance Dashboard which can be used with SQL Server 2005 Service Pack 2 and above.

You can get the CREATE INDEX commands for the missing indexes recommended on your instance using the query mentioned here.

If you want to test an automatic determination of the index candidates for your workload, then have a look at the post below:
http://blogs.msdn.com/b/queryoptteam/archive/2006/06/01/613516.aspx

Related Links for Missing Indexes
About Missing Indexes
http://msdn2.microsoft.com/en-us/library/ms345524.aspx
Finding Missing Indexes
http://msdn2.microsoft.com/en-us/library/ms345417.aspx
Limitations of this feature
http://msdn2.microsoft.com/en-us/library/ms345485.aspx

WMI Script for Changing SQL Server 2000 and 2005 passwords

Our domain security policy requires us to change our Domain Account passwords every once in a while. This security feature is a good practice but the change of password requires me to change my SQL Server component startup accounts to be changed also as most of them run under my Domain Account.

Tired of accomplishing this task which requires me to type out my user name and password (twice for the reconfirmation 🙂 )  brought forward the need for automating this task. Once I did this, I thought it would be prudent for me to share the same with the SQL community. WMI is quite powerful and a number of tasks can be achieved through it which you shall see in my future blog posts. I have created a category in my blog completely for WMI.

Addition: October 24th, 2011

If you are using SQL Server 2008 R2, then you need to have this fix applied so that SetServiceAccount can work:

2397020 FIX: “Access is denied. [0x80070005]” error message occurs in SSCM when you try to change the password of an account of SQL Server 2008 R2 in Windows Vista or in a later version of Windows
http://support.microsoft.com/default.aspx?scid=kb;EN-US;2397020

If you use SetServiceAccount to change the password, then this will result in a SQL Server service restart. The script provided on the Technet Gallery is for changing both the service account and the password. If only the password needs to be changed for the service account, then use SetServiceAccountPassword. This will change the password and will not automatically restart your SQL Server service.

Download the file: pwdchange file and change the name to pwdchange.vbs. This can be executed from command prompt using the following command (provided you have Windows Script Host which almost all Windows Servers have):

cscript pwdchange.vbs

The script has comments written to help you understand the same. This script makes use of the following namespaces:

\root\Microsoft\SqlServer\ComputerManagement (SQL Server 2005)

\root\CIMV2 (SQL Server 2000)

The reason I do not use SQL Server 2000 WMI namespace as it is not installed by default with a SQL Server 2000 default installation. It has to be installed separately. So, I decided to use CIMV2 which has Win32_Service for achieving the same purpose. In the script you would have to provide the following details:

strAccount1 = “Account1” ‘ For alias@domain.com
strAccount2 = “Account2” ‘ For domain\alias naming convention
strAccount3 = “Account1” ‘ To cover any other naming convention
strAlias = “Alias” ‘ Provide the alias
strDomain = “DomainName” ‘ Provide the domain name
strPassword = “Set Password Here” ‘ Put the new password here

SQL Server is not configured for Remote Connections

This is one of the most common error messages when users use default settings to connect to a SQL Server Express Instance.

ERROR
Sqlcmd: Error: Microsoft SQL Native Client: An error has occurred while establishing a connection to the server. When connecting to SQL Server 2005, this failure may be caused by the fact that under the default settings SQL Server does not allow remote connections.


Please refer the following article to make sure that your SQL Express Instance is configured to accept Remote Connections:
http://support.microsoft.com/kb/914277

Also, make sure from the SQL Server Configuration Manager that TCP/IP and Named Pipes are enabled and you have VIA protocol disabled.

Another thing you might want to check is if SQL Server Browser Service is up and running.

Federated Databases

This is something that not many people are aware of. Hence, I thought about putting up a brief overview of this side of SQL Server.

To achieve the high levels of performance required by the largest Web sites, a multitier system typically balances the processing load for each tier across multiple servers. SQL Server 2005 shares the database processing load across a group of servers by horizontally partitioning the data in a SQL Server database. These servers are managed independently, but cooperate to process the database requests from the applications; such a cooperative group of servers is called a federation.

A federated database tier can achieve very high levels of performance only if the application sends each SQL statement to the member server that has most of the data required by the statement. This is referred to as collocating the SQL statement with the data required by the statement. Collocating SQL statements with the required data is not a requirement unique to federated servers. It is also required in clustered systems.

Federated Server Tier

There is one instance of SQL Server on each member server.

Each member server has a member database. The data is spread through the member databases.

The tables from the original database are horizontally partitioned into member tables. There is one member table per member database, and distributed partitioned views are used to make it appear as if there was a full copy of the original table on each member server.

The application layer must be able to collocate SQL statements on the member server that contains most of the data referenced by the statement.

Backing Up and Restoring Federated Database Servers

In a federated-database-server tier that is built by using distributed partitioned views, the member servers form one logical unit. Therefore, you must coordinate the recovery of the member databases to make sure that they remain synchronized correctly.

SQL Server 2005 does not require that you coordinate backups across member servers. Backups can be independently taken from each database, without regard for the state of the other member databases. Because the backups do not have to be synchronized, there is no processing overhead for synchronization and no blockage of running tasks.

The most important aspect of recovering a set of member databases is the same as recovering any other database: Plan and test the recovery procedures before you put the databases into production. You must set up processes to restore all the databases to the same logical point in time. SQL Server includes features to support the recovery of all member databases to the same point in time.

Pros

1. Federated servers if implemented correctly are a great way to load balance a database server environment and is very similar to a database farm implementation.

2. This would greatly allow you to distribute the load on your servers based on any of the following criteria:

a. Geographic location

b. Traffic in terms of users

c. Traffic in terms of transactions

d. Database table size

3. Also, federated servers give you the option on partitioning data across servers with the help of distributed partitioned views

4. This gives you the option of horizontally partitioning the data across various servers which ultimately leads to greater throughput

5. It lets you control the traffic coming in and also helps in maintaining the load thresholds across the entire setup

6. Furthermore, if we have a middle tier in the entire setup, then design changes in the federated server environment will not affect the client side applications in any manner as they would be connecting to the middle tier and the middle tier will connect with the database server

Cons

1. One of the major drawbacks is disaster recovery. If one of the member server fails, there needs to be a failback plan in place which could cause minimum hindrance to the normal operations while implementation.

2. Also, if distributed partitioning is being implemented, then rules and constraints need to be strong enough to prevent any sort on inconsistencies from arising due to data modification

3. Also, if the performance of one of the member servers takes a hit, most of the distributed partitioned views would also take a hit in terms of operations done on them

4. The middle tier should be designed in such a robust manner that there is no ambiguity in resolving which server in the environment needs to targeted based on the query coming into the server

5. The backup/restore scenarios need to be designed and planned in such a way that all the member servers are all synchronized at all times

Useful Articles

The following whitepaper and TechNet articles should help a great deal here:

Planning, Implementing, and Administering Scaleout Solutions with SQL Server 2005 Solution

Scaling Out SQL Server with Data Dependent Routing

Designing Data Tier Components and Passing Data Through Tiers