If you have ever worked in testing on database projects, you would have probably done testing by trying to narrow down the data set involved which is nearly representative of the actual production data set or used a sub-set of the queries which are representative of the workload which is expected or is active on the production environment.
Now this brings me to the reason of this post. It is quite often during testing that we forget that SQL Server executes all DML operations by default in autocommit mode. This means that all individual statements are committed if they complete successfully. If you want to avoid this behavior, then you either need to set IMPLICIT_TRANSACTIONS setting to ON for your database connection or use a BEGIN TRANSACTION command before you execute your query.
SQL Server follows Write Ahead Logging protocol which means no data modifications are written to disk before the associated log record is written to disk. This maintains the ACID properties for a transaction. (when you involve disk-caching, that opens up another can of worms but that discussion is outside the scope of this blog post)
My belief is that if you are comparing execution times between two different environments, then you should be doing that on exactly the same hardware configuration and if that is not feasible, you should at-least match the CPU, physical RAM and disk sub-system on both sides. I had already documented in an earlier blog post why RAM/CPU configurations are important when comparison execution statistics between two different SQL Server environments. Ideally, you should have the same exact environment as your test environment including the workload (i.e. number of users, connections etc.). However, we all know that Utopia is not a place where we live in and hence the toned down test scripts, scaled down/up environments, shared resources and I could keep lamenting!!
In the last month, I dealt with two such issues where in a T-SQL batch performing a large number of inserts on a database table was being used to compare the performance between two different environments. Since I get called in to fix a problem and no-one generally calls CSS to tell us that their server is running as expected… the problem invariably happened to be that a bigger, beefier server was taking a longer time to execute the same batch. I shall demonstrate where not knowing about the WAL protocol can cause such test mechanisms to be skewed and take you down the proverbial rabbit-hole!
The script that I shall be using for demonstrating the pitfalls of this behavior is pretty simple:
declare @cntr int = 1
while @cntr <= 1000000
insert into tblInserts (SNo,RowVal) values(@cntr,’Record ID ‘ +CAST(@cntr as varchar(7)))
set @cntr += 1
The script inserts 1 million rows into a database table (a heap table) using a while loop. During the time of the insert, I shall capture various performance counter values during the execution along with wait statistics for the query.
Modification: September 9th, 2011: Based on Kendra’s (Twitter) feedback, changing the sub-headings. Test 1 makes use of auto-commit mode of SQL Server which is the default and Test 2 can be achieved either by using implicit transaction mode or performing an explicit transaction (BEGIN…COMMIT).
Test 1: With AutoCommit mode
For two iterations the above script on an average took 8 minutes and 30 seconds to complete. When I looked into the wait statistics captured (at 5 second intervals), I don’t see anything standing out other than a few non-zero wait durations for WRITELOGs during certain periods. The non-zero wait times (in milli-seconds) are shown below with the highest value being captured being 10 milli-seconds. The Average Disk Secs/(Read/Write) don’t show me any outstanding values during the data capture to indicate that the disk was a bottleneck, then why does the data below show so many WRITELOG waits. Keep in mind that the total amount of time waited for as indicated by the data below is also not a significantly large amount. So why is it taking over eight minutes to insert the data??
To explain the query, I would need to fall back of SQL Server Performance Monitor counters (sampled at 5 second intervals). On analysis of the performance monitor counters, I found the following:
a. SQLServer:Databases: Log Bytes Flushed/sec showed that on an average 3.1 MB of log bytes were being flushed every 5 seconds for the database on which I was performing the inserts.
b. SQLServer:DatabasesLog Flushes/sec showed that about 6000 log flushes were occurring for this database every 5 seconds on an average.
c. SQLServer:Wait Statistics: Log write waits shows me that on an average there were about 9000+ waits started per second. However, the Cumulative wait time (ms) per second for the same counter shows me negligible values and the Average wait time (ms) value is nearly zero through the data collection period.
So how can I explain where the time is being spent?? Now I decided to run another test using implicit transactions.
I ran two iterations of the above T-SQL batch within BEGIN TRANSACTION…COMMIT block and the average duration was 14 seconds! Whattttt!??!?! Yes.. And all this can be simply explained using the same counters that I looked above. This time around the performance monitor data told me the following story:
a. SQLServer:Databases: Log Bytes Flushed/sec showed that on an average 26 MB of log bytes were being flushed every 5 seconds for the database on which I was performing the inserts.
b. SQLServer:DatabasesLog Flushes/sec showed that about 468 log flushes were occurring for this database every 5 seconds on an average.
c. SQLServer:Wait Statistics: Log write waits shows me that on an average there were about 23(approx.)+ waits started per second.
If you look at the Performance Monitor graphs for the disk statistics that I captured for a single run for Test 1 (screenshot above) and Test 2 (screenshot on the left), you will see that the %Idle Time for the disk, on which the database files resided on (in my case F: drive), shows was busy 50% of the time (see blue squiggly above) during the test and the value climbed back to ~100% after the test completed. On the contrary, the %Idle Time for the same disk has a very zig-zag pattern (see black squiggly on left) during Test 2 which indicates that the disk was sporadically busy and it was not a continuous pattern as observed in Test 1.
The Avg. Disk Sec/Write counter shows absolutely no latency which means that the there is no problem in terms of latency on the disks.
During Test 1, data was being written to the disk at the rate of 907 KB/s where as during Test 2, the write rate was 5MB/s which was determined by monitoring the Disk Write Bytes/sec counter.
The reason for this difference is that for every insert in Test 1 is followed by an autocommit. This means that you need to flush the log buffer for each insert. However in Test 2, the log buffer was being flushed much lesser but at the same time more data was being written to the disk for each log flush. Since SQL Server follows a WAL protocol, till the earlier log flush is completed, it cannot commit/move onto the next transaction.
If you are running a similar system with two different sets of hardware, then you would start having to chase down disk speeds/RPMs between the two servers. For servers which have disks performing optimally, this behavior is quite difficult to gather without the right set of data.
With SQL Server 2008, tracking down the waits for a single session is much, much easier with the help of XEvents. But that shall be a story for another day… errrr.. post!!
Moral of the story: If you are using a test query similar to the one shown above to test/benchmark performance and the query batch is not indicative of your actual workload, then you would probably not be able to gauge or establish an accurate performance benchmark for the queries executing on your system. And REMEMBER that SQL Server runs in auto-commit mode by default!