Sometimes, we see the following messages in the SQL Server 2000 (Service Pack 4 and above) and SQL Server 2005 ERRORLOGs:
2004-11-11 00:21:25.26 spid1 SQL Server has encountered 192 occurrence(s) of IO requests taking longer than 15 seconds to complete on file [E:\db_data2.ndf] in database [user_db] (7). The OS file handle is 0x00000000000074D4. The offset of the latest long IO is:
A given I/O request that becomes stalled or stuck is only reported one time. If the message reports 10 I/Os as stalled, these 10 reports will not occur again. If the next message reports 15 I/Os stalled, it indicates that 15 new I/O requests have become stalled.
Most of these messages if occur frequently in a SQL Server, then it could be due to one of the following:
1. There is an underlying hardware issue where in your firmware and disk drivers need to be upgraded. This is mostly done by the Internal Storage Team or preferably by the Hardware Vendor
2. Or the disks are overwhelmed with the IO Requests being posted on the risks. In this case, you would want to go ahead and make sure that if possible some of the data files and log files would need to be moved to a different drive
3. Or it could be due to bad plans SQL Server ends up performing extra number reads than required
Also, monitoring the disks using the following counters under performance monitor under the would be helpful (under the performance objects – Logical Disk and Physical Disk):
1. Average Disk Queue Length
2. Average Disk Secs\Transfer
3. Average Disk Secs\ Write
4. Average Disk Secs\ Reads
The last three counters should never show a prolonged peak period which shows values of 30 milli-seconds and above. Please refer my blog article Setting up Perfmon Logs on how to setup a perfmon trace.
One important point that needs to be kept is the amount of I/O being performed by SQL Server. You can find out the amount of I/O being driven by SQL Server by using the Process performance object with the SQL Server instance (sqlservr#) that you are concerned with and looking at the values for I/O Data Bytes value. It might be a case where your queries have sub-optimal plans due to which they are performing an excessive amount of reads/writes which is causing the disk to be throttled.
If you find out that this is actually an issue with the disk sub-system, then it is always a good idea to involve the disk vendor/or your internal storage team to find out if all the drivers/firmware are updated with the latest patches.
Addition: May 26, 2011:
If you want to get your hands dirty and debug a Stuck-Stalled IO request, then Bob Dorr, Principal Escalation Engineer at Microsoft, has a blog post on this:
SQL IO Basics
SQL Stuck Stalled IO KB Article
Working with TempDB in SQL Server 2005
Troubleshooting Performance Problems (section on I/O)
Pingback: Debugging a non-yielding scheduler issue « TroubleshootingSQL