Performance benchmarking: Explicit vs Implicit Transactions

Posted on July 27, 2011 by Amit Banerjee

Benchmarking… The single most important word in Performance Tuning. Without an established benchmark, you really don’t know when to actually call it a day!

If you have ever worked in testing on database projects, you would have probably done testing by trying to narrow down the data set involved which is nearly representative of the actual production data set or used a sub-set of the queries which are representative of the workload which is expected or is active on the production environment.

Now this brings me to the reason of this post. It is quite often during testing that we forget that SQL Server executes all DML operations by default in autocommit mode. This means that all individual statements are committed if they complete successfully. If you want to avoid this behavior, then you either need to set IMPLICIT_TRANSACTIONS setting to ON for your database connection or use a BEGIN TRANSACTION command before you execute your query.

SQL Server follows Write Ahead Logging protocol which means no data modifications are written to disk before the associated log record is written to disk. This maintains the ACID properties for a transaction. (when you involve disk-caching, that opens up another can of worms but that discussion is outside the scope of this blog post)

My belief is that if you are comparing execution times between two different environments, then you should be doing that on exactly the same hardware configuration and if that is not feasible, you should at-least match the CPU, physical RAM and disk sub-system on both sides. I had already documented in an earlier blog post why RAM/CPU configurations are important when comparison execution statistics between two different SQL Server environments. Ideally, you should have the same exact environment as your test environment including the workload (i.e. number of users, connections etc.). However, we all know that Utopia is not a place where we live in and hence the toned down test scripts, scaled down/up environments, shared resources and I could keep lamenting!!

In the last month, I dealt with two such issues where in a T-SQL batch performing a large number of inserts on a database table was being used to compare the performance between two different environments. Since I get called in to fix a problem and no-one generally calls CSS to tell us that their server is running as expected… the problem invariably happened to be that a bigger, beefier server was taking a longer time to execute the same batch. I shall demonstrate where not knowing about the WAL protocol can cause such test mechanisms to be skewed and take you down the proverbial rabbit-hole!

The script that I shall be using for demonstrating the pitfalls of this behavior is pretty simple:

declare @cntr int = 1
while @cntr <= 1000000
begin
insert into tblInserts (SNo,RowVal) values(@cntr,’Record ID ‘ +CAST(@cntr as varchar(7)))
set @cntr += 1
end

The script inserts 1 million rows into a database table (a heap table) using a while loop. During the time of the insert, I shall capture various performance counter values during the execution along with wait statistics for the query.

Modification: September 9th, 2011: Based on Kendra’s (Twitter) feedback, changing the sub-headings. Test 1 makes use of auto-commit mode of SQL Server which is the default and Test 2 can be achieved either by using implicit transaction mode or performing an explicit transaction (BEGIN…COMMIT).

Test 1: With AutoCommit mode

For two iterations the above script on an average took 8 minutes and 30 seconds to complete. When I looked into the wait statistics captured (at 5 second intervals), I don’t see anything standing out other than a few non-zero wait durations for WRITELOGs during certain periods. The non-zero wait times (in milli-seconds) are shown below with the highest value being captured being 10 milli-seconds. The Average Disk Secs/(Read/Write) don’t show me any outstanding values during the data capture to indicate that the disk was a bottleneck, then why does the data below show so many WRITELOG waits. Keep in mind that the total amount of time waited for as indicated by the data below is also not a significantly large amount. So why is it taking over eight minutes to insert the data??

Runtime	Session_id	Status	Wait_type	Wait_time	Last_Wait_Type
12:46 AM	51	suspended	WRITELOG	10	WRITELOG
12:46 AM	51	suspended	WRITELOG	7	WRITELOG
12:47 AM	51	suspended	WRITELOG	1	WRITELOG
12:47 AM	51	suspended	WRITELOG	2	WRITELOG
12:47 AM	51	suspended	WRITELOG	3	WRITELOG
12:47 AM	51	suspended	WRITELOG	2	WRITELOG
12:48 AM	51	suspended	WRITELOG	7	WRITELOG
12:49 AM	51	suspended	WRITELOG	8	WRITELOG
12:49 AM	51	suspended	WRITELOG	1	WRITELOG
12:49 AM	51	suspended	WRITELOG	10	WRITELOG
12:49 AM	51	suspended	WRITELOG	6	WRITELOG
12:50 AM	51	suspended	WRITELOG	2	WRITELOG
12:51 AM	51	suspended	WRITELOG	3	WRITELOG
12:51 AM	51	suspended	WRITELOG	4	WRITELOG
12:51 AM	51	suspended	WRITELOG	1	WRITELOG
12:51 AM	51	suspended	WRITELOG	2	WRITELOG
12:52 AM	51	suspended	WRITELOG	5	WRITELOG
12:52 AM	51	suspended	WRITELOG	7	WRITELOG
12:53 AM	51	suspended	WRITELOG	1	WRITELOG
12:53 AM	51	suspended	WRITELOG	8	WRITELOG

To explain the query, I would need to fall back of SQL Server Performance Monitor counters (sampled at 5 second intervals). On analysis of the performance monitor counters, I found the following:
a. SQLServer:Databases: Log Bytes Flushed/sec showed that on an average 3.1 MB of log bytes were being flushed every 5 seconds for the database on which I was performing the inserts.
b. SQLServer:DatabasesLog Flushes/sec showed that about 6000 log flushes were occurring for this database every 5 seconds on an average.
c. SQLServer:Wait Statistics: Log write waits shows me that on an average there were about 9000+ waits started per second. However, the Cumulative wait time (ms) per second for the same counter shows me negligible values and the Average wait time (ms) value is nearly zero through the data collection period.

So how can I explain where the time is being spent?? Now I decided to run another test using implicit transactions.

Test 2: Without AutoCommit mode

I ran two iterations of the above T-SQL batch within BEGIN TRANSACTION…COMMIT block and the average duration was 14 seconds! Whattttt!??!?! Yes.. And all this can be simply explained using the same counters that I looked above. This time around the performance monitor data told me the following story:
a. SQLServer:Databases: Log Bytes Flushed/sec showed that on an average 26 MB of log bytes were being flushed every 5 seconds for the database on which I was performing the inserts.
b. SQLServer:DatabasesLog Flushes/sec showed that about 468 log flushes were occurring for this database every 5 seconds on an average.
c. SQLServer:Wait Statistics: Log write waits shows me that on an average there were about 23(approx.)+ waits started per second.

If you look at the Performance Monitor graphs for the disk statistics that I captured for a single run for Test 1 (screenshot above) and Test 2 (screenshot on the left), you will see that the %Idle Time for the disk, on which the database files resided on (in my case F: drive), shows was busy 50% of the time (see blue squiggly above) during the test and the value climbed back to ~100% after the test completed. On the contrary, the %Idle Time for the same disk has a very zig-zag pattern (see black squiggly on left) during Test 2 which indicates that the disk was sporadically busy and it was not a continuous pattern as observed in Test 1.

The Avg. Disk Sec/Write counter shows absolutely no latency which means that the there is no problem in terms of latency on the disks.

During Test 1, data was being written to the disk at the rate of 907 KB/s where as during Test 2, the write rate was 5MB/s which was determined by monitoring the Disk Write Bytes/sec counter.

The reason for this difference is that for every insert in Test 1 is followed by an autocommit. This means that you need to flush the log buffer for each insert. However in Test 2, the log buffer was being flushed much lesser but at the same time more data was being written to the disk for each log flush. Since SQL Server follows a WAL protocol, till the earlier log flush is completed, it cannot commit/move onto the next transaction.

If you are running a similar system with two different sets of hardware, then you would start having to chase down disk speeds/RPMs between the two servers. For servers which have disks performing optimally, this behavior is quite difficult to gather without the right set of data.

With SQL Server 2008, tracking down the waits for a single session is much, much easier with the help of XEvents. But that shall be a story for another day… errrr.. post!!

Moral of the story: If you are using a test query similar to the one shown above to test/benchmark performance and the query batch is not indicative of your actual workload, then you would probably not be able to gauge or establish an accurate performance benchmark for the queries executing on your system. And REMEMBER that SQL Server runs in auto-commit mode by default!

HAPPY TESTING!

del.icio.us Tags: Troubleshooting,Transaction log,Best Practices

PREEMPTIVE_OS_GETPROCADDRESS: Are you confusing me?

Posted on July 20, 2011 by Amit Banerjee

PREEMPTIVE_OS_GETPROCADDRESS is a new wait type that was introduced in SQL Server 2008 to track time taken by GetProcAddress to load the entrypoint in the DLL when Extended Stored Procedure calls are made by the SQL Server instance. However, the way the wait time is tracked by this wait type has a little quirk which Rob Dorr had documented in a blog post.

Based on a discussion on #sqlhelp thread on Twitter, I found the need to demonstrate this unique quirk with the help of an example. I am using the following command to list you the sub-directories on my C: drive using the command:

exec master..xp_dirtree ‘C:\’

Next I monitored the wait statistics for the session executing this Extended Stored procedure using the T-SQL script below with a 1 second delay:

SET NOCOUNT ON
WHILE (1=1)
BEGIN
SELECT [session_id], [status], [command], [start_time], [cpu_time], [reads], [writes], [wait_resource], [wait_time], [last_wait_type], [wait_type]
FROM sys.dm_exec_requests
WHERE [session_id] = 53 -- Session ID that is executing the Extended Stored Procedure

WAITFOR DELAY '00:00:01'
RAISERROR ('', 10, 1, N'Waits');

END

The output of this script was as follows:

session_id status        command          start_time              cpu_time    reads    writes      wait_time   last_wait_type                        wait_type

53         runnable      EXECUTE          2011-07-20 00:59:44.330 62          0        0           712         PREEMPTIVE_OS_GETPROCADDRESS          PREEMPTIVE_OS_GETPROCADDRESS

53         running       EXECUTE          2011-07-20 00:59:44.330 234         0        0           1713        PREEMPTIVE_OS_GETPROCADDRESS          PREEMPTIVE_OS_GETPROCADDRESS

53         runnable      EXECUTE          2011-07-20 00:59:44.330 452         0        0           2714        PREEMPTIVE_OS_GETPROCADDRESS          PREEMPTIVE_OS_GETPROCADDRESS

As you can see above the wait time keeps increasing even though I can see that the session is performing work due to the following results:
a. The results window in SSMS for the session executing the extended stored procedure was churning out results
b. Using a Process Monitor, I could trace the filesystem activity on the C: drive for the SQL Server process

You could run into this issue while using any Extended Stored Procedure (system or user) for a SQL Server 2008 or SQL Server 2008 R2 instance. Since I know that the session is performing work, the only two other things that I would check if there are resource bottlenecks w.r.t. memory or I/O if some filesystem related activity is being performed by the extended stored procedure. Most of the issues where I have see this wait type causing confusion is when the extended stored procedure is performing disk related activity. At that time, Process Monitor will put your doubts to rest along with the information that I mentioned above to confirm that that the procedure is actually doing some work and not stuck or perceived as “hung”.

del.icio.us Tags: #sqlhelp,Extended Stored Procedures,PREEMPTIVE_OS_GETPROCADDRESS

T-SQL Tuesday #20 Wrap-up and a few smiles

Posted on July 15, 2011 by Amit Banerjee

I had the privilege of hosting this month’s T-SQL Tuesday which is a recurring blog party, the brainchild of MVP Adam Machanic (blog|twitter). It goes without saying that the posts this time were a wonderful read and it’s very encouraging to see such brilliant engagement from the community. A big “thank you” to all the people who took the time to pen down a post for this month’s topic: “T-SQL Best Practices”. If you are a newcomer to the world of T-SQL or a seasoned pro, the insights shared in these posts are definitely a good read if you work with SQL Server.

Below is a round-up of all the posts from this month (in chronological order of comments appearing on my invitation post):

1. Jason Bacani [Blog | Twitter] writes about why T-SQL formatting is best as well as good practice and makes it easier for the person supporting the code that you have written. I feel motivated on working on the code only if I like looking at it.

2. Rob Farley [Blog | Twitter], a popular SQL Server MVP, talks about why understanding query plans can help in writing better and more efficient T-SQL code. Totally agree with Rob here and from experience I can say that this goes a long way in writing efficient T-SQL code.

3. Dirk Wegener [Blog] talks about why including comments to explain your code logic and reiterates why T-SQL formatting is important in his first T-SQL Tuesday post. Way to go Dirk and we look forward to your participation in the future.

4. Noel McKinney [Blog | Twitter] talks about coding best practices which are quite pertinent in today’s development world. I completely agree that everything shouldn’t be done through T-SQL. Sometimes exploring other avenues may be the right answer!

5. Pinal Dave [Blog | Twitter], a friend and now a colleague at Microsoft, talks about practices that can be avoided in environments generally and not just from a T-SQL perspective.

6. Thomas Rushton [Blog | Twitter] talks about coding best practices, some of which is generally applicable to the world outside T-SQL as well. Absolutely loved the quote at the end of your post!

7. Rich Brown [Blog | Twitter] shows with an example how a developer managed to get into trouble by using an ORDER BY construct for an INSERT statement.

8. Sebastian Meine [Blog | Twitter] shows why an interface layer is required between the application and database. This is quite difficult to achieve due to various reasons (some of it maybe just ignorance sometimes) but if you can pull it off, it really does save a lot of time during a post-deployment crisis scenario.

9. Muthukkumaran Kaliyamoorthy [Blog | Twitter] talks about a list of DOs and DON’Ts while writing T-SQL code.

10. Nick Haslam [Blog | Twitter] talks about how multiple database files can help improve throughput of your DML queries for large data warehouse with the help of some TPC-H test numbers that he shares in the same post.

11. Marcos Freccia [Blog | Twitter] talks about a recent performance problem he resolved by avoiding an implicit convert operation in the query plan with the help of a small change in his T-SQL query. An example of how following best practices can help performance.

12. Matt Velic [Blog | Twitter] talks about best practices to follow while writing T-SQL code from a readability, formatting and code clarity standpoint. Supporting and fixing a code is much faster if it’s easy to read!

13. Robert Matthew Cook [Blog | Twitter] shows how not following simple best practices led to a cascading effect where a stored procedure failed to execute. He goes the extra mile to explain why the issue happened!

14. Erin Stellato [Blog | Twitter] explains why following T-SQL coding standards and adding comments to your code can help save painful hours of searching and brain-wracking during the time of a crisis.

15. Josh Feierman [Blog | Twitter] talks about what can be done in the current scheme of things with the business angle kept in mind. And what he talks about in his post are ground realities and what can be done to counter them to make us efficient in what we do without having sleepless nights while handling a critical issue.

16. Aaron Bertrand [Blog | Twitter], a popular MVP, talks about some of the best practices that should be followed while writing T-SQL code. He provides examples on what kind of issues you could run into when such practices are not followed.

17. Jack Vamvas [Blog | Twitter] gives an example and shows how filtered indexes come to the rescue when T-SQL code which doesn’t follow best practices cannot be tuned due to support and license agreements.

18. Gabriel Villa [Blog | Twitter] talks about T-SQL coding best practices which would help in better management of the code base in the post-deployment phase. This would definitely lead to less consumption of aspirins for the team managing and supporting the code.

19. Nancy Hidy Wilson [Blog | Twitter] talks about adding proper comments to your T-SQL code in a crisp and concise post.

20. Ricardo [Blog | Twitter] re-iterates why comments in the code are necessary but they necessarily don’t need to be short stories. Additionally, he mentions about testing the T-SQL code to ensure that it works in production and not just on the Development box that you are using it write the code on. The post is in Spanish but Bing Translate helped me read the English version of the post and I did have a smile on my face while reading through it.

21. Colleen Morrow [Blog | Twitter] explains with an example why using functions in the predicate of a T-SQL query can be bad karma for your code.

22. Carlos Bossy [Blog | Twitter] explains in his post why T-SQL code made to function in a procedural manner is not probably the best approach.

23. Jes Borland [Blog | Twitter] shares her top three T-SQL best practices and how keeping it simple is the best idea!

Once again, my sincere thanks to all the SQL folks that contributed to this month’s T-SQL Tuesday! And also, my thanks to Adam for letting me host this virtual blog party for this month!

Since a major part of my daily job involves troubleshooting SQL Server performance related issues, I do come across T-SQL code which from a best practices standpoint has a lot to be desired. So, I would say that the next few lines are my 2 cents from what I have seen based on my experience but I ‘m afraid it’s going to turn out to be a $1 worth of rambling!

1. If your company has established coding standards/formats, follow them – This helps keep consistent across the board and the standards and formats were put in place majority of the time with a good intention in mind. If there aren’t any defined, then define a set for yourself and follow them!
2. Use some form of source-control – It could be as simple as adding comments in the procedure code itself. It really saves time and helps blood pressure levels stay normal when trying to find out why a small change in the code was made and when, which is the current cause of the performance bottleneck!
3. Add comments/description to your code – I am yet to come across a production application which has a single person in-charge of the production, support, maintenance and version upgrades. Your code more often than not will be supported and maintained by someone else. Help that person understand why that piece of code is there and what is it supposed to do. What is obvious to you may not be obvious to someone else. So, rather than having a DBA curse you at an unearthly hour, mention (a single line is sufficient most times) why a best practice was not followed.
4. T-SQL is not the answer for everything – There are some things that can probably be done much better through code in some other language like C#. Don’t be afraid to explore such an option. At the end of the day, SQL Server is a database platform and NOT an application server. (*ducks and runs*)
5. It doesn’t hurt to look under the hood – It’s not necessary to know how the query optimizer works or how to read a query plan or even how indexes should be designed for tables for writing T-SQL code. But if you did understand some of this if not a majority of it, you will find that you end up writing much more efficient T-SQL code than you used to when you didn’t know some of under-the-hood workings of the database engine.

Do not misunderstand the above points to mean that I write the best T-SQL code can be written. All I am trying to say is that write T-SQL code which is easy to read, decipher and support! Even then as a Developer you might not become a DBA’s best friend overnight but you would have taken a giant step in getting there and bridging that chasm that exists between the two worlds! Smile

While reading Thomas Rushton’s [Blog | Twitter] submission this month I came across the quote mentioned below. Going forward, this is going to be my punch-line when I talk about best practices. If you are one of those individuals where a visual aid helps, then the above picture should probably aid in following best practices while writing T-SQL code.

Always code as if the person who ends up maintaining your code is a violent psychopath who knows where you live.

^{*courtesy Jeff Atwood (blog|twitter)}

In case you want to keep getting updates on what is happening on my blog, then you can do so using one of the icons below (Facebook page, Twitter or RSS Feeds).

Disclaimer: I have provided the correct links to your blog, twitter accounts and posts to the best of my knowledge. In case there are discrepancies, please leave a comment and I shall make the necessary corrections. Such an error was purely unintentional!

del.icio.us Tags: #TSQL2sDay,T-SQL,T-SQL Best Practices,T-SQL Tuesday

Canonical ACLs: You said wwwhhat?

Posted on July 9, 2011 by Amit Banerjee

A while back I had to debug the SQL Server 2008 setup code to find out why the Registry Consistency Check rule was taking an obnoxiously long time to complete on a particular server. While trying to apply SQL Server 2008 SP2, you will find that the Registry Consistency Check rule is executed. Now the setup UI didn’t show any progress for over an hour. When I checked a previous set of setup logs ,I found the following failure (from the Detail.txt file):

2011-04-17 15:06:53 Slp: d:14c36b1c9c26b2820a8\x64\FixSqlRegistryKey_x86.exe exit code: 1
2011-04-17 15:06:53 Slp: Rule ‘AclPermissionsFacet’ detection result: AreSqlAclPermissionsCanonical= False
2011-04-17 15:06:53 Slp: Evaluating rule : AclPermissionsFacet
2011-04-17 15:06:53 Slp: Rule running on machine: <machine name>
2011-04-17 15:06:53 Slp: Rule evaluation done : Failed
2011-04-17 15:06:53 Slp: Rule evaluation message: The SQL Server registry keys from a prior installation cannot be modified. To continue, see SQL Server Setup documentation about how to fix registry keys.

In another setup attempt (again from Detail.txt file), I found that the FixSqlRegistryKey_x86.exe did not log any message post the launch of the EXE:

2011-04-29 11:42:58 Slp: D:\Microsoft SQL 2008 SP2\x64\FixSqlRegistryKey_x64.exe exit code: 0
2011-04-29 11:42:58 Slp: Launching external tool: D:\Microsoft SQL 2008 SP2\x64\FixSqlRegistryKey_x86.exe
2011-04-29 11:42:58 Slp

If you capture a Process Monitor trace, you will find that the FixSqlRegistryKey_*.exe is traversing the registry keys pertaining to the SQL Server registry hive. The EXE (FixSqlRegistryKey_*.exe) looks up and fixes all non-canonical ACLs with the registry key and its child keys for the two hives below:

HKLM\Software\Microsoft\Microsoft SQL Server\
HKLM\Software\Microsoft\MSSQLServer\

Did I just read the word Canonical?!? Yes, you did. A canonical ACL must have ACEs sorted according to the following order:

1. Access-denied on the object
2. Access-denied on a child or property
3. Access-allowed on the object
4. Access-allowed on a child or property
5. All inherited ACEs

Reference: http://blogs.msdn.com/b/patcarna/archive/2009/02/09/the-access-control-list-is-not-canonical.aspx

If SQL Setup finds ACLs in non-canonical form, then it will try and fix the ACLs for the registry key and it’s sub-keys which have the ACLs in non-canonical form. If there are permission issues encountered while traversing the registry, then this might take a longer time to error out or fail with the error message mentioned above.

You have the following option to correct the non-canonical ACLs:

1. Locate non-canonical ACLs in two hives mentioned above and fix them manually.
2. Re-run the rules
3. Check which registry key the EXE takes a long time to move forward from or encounters an access denied error using a Process Monitor trace.

Note: Always use a backing file pointed to a local disk when capturing Process Monitor traces for prolonged periods. Refer this post for more details.

Once you have fixed the non-canonical ACLs, you should be able to successfully get beyond the registry consistency rule check without having to get lunch and dinner in between!

Since I didn’t want to sit and traverse through all the registry keys manually, I wrote a Powershell script to check if the ACLs for your SQL Server registry hive and it’s sub-keys are in canonical form or not and report back the key which is not in non-canonical form.


Function fnCheckCanonical
{
Param ([string] $regvalue)

$reg = Get-ChildItem $regvalue -recurse
Write-Host "Checking hive : " $regvalue
foreach ($path in $reg)
{

$regkey = $path.ToString()
$regkey = $regkey.Replace("HKEY_LOCAL_MACHINE","HKLM:")
#Write-Host $regkey

$acl = Get-Acl -Path $regkey
if ($acl.AreAccessRulesCanonical.ToString().Equals("True"))
{
Write-Host "Verified key: " + $acl.Path.ToString()
}
else
{
Write-Host "Reg Key: " $acl.Path.ToString()
Write-Host "IsCanoncial: " $acl.AreAccessRulesCanonical.ToString()
Write-Host "Failed canonical check"
}

}

}

fnCheckCanonical HKLM:"Software\Microsoft\Microsoft SQL Server"
fnCheckCanonical HKLM:"Software\Microsoft\MSSQLServer"
fnCheckCanonical HKLM:"SOFTWARE\Wow6432Node\Microsoft\Microsoft SQL Server"
fnCheckCanonical HKLM:"SOFTWARE\Wow6432Node\Microsoft\MSSQLServer"

ACLs don’t change into non-canonical forms just because they are having a bad day but it could happen when you use scripting tasks from .NET/VBScript/Powershell or subinacl.exe to incorrectly modify/add/remove ACLs on a registry key. Digging into the root cause of how they actually got into the state is as good as searching for a needle in all the haystacks in the world! Here is an article on how subinacl.exe could arrange ACLs incorrectly on Windows 2000: http://support.microsoft.com/kb/296865

Hope this helps you in fixing your non-canonical ACLs if you are unfortunate enough to run into this issue.

del.icio.us Tags: SQL Setup,SQL Server 2008,ACLs

Statistics Auto Update may not be automatically the best choice

Posted on July 9, 2011 by Amit Banerjee

I had written a post a while back stating how DBREINDEX and ALTER INDEX treat the statistics update differently for user and auto-created statistics on a database table. In this post, I shall talk about how auto-update statistics which normally is left ON on most databases that I have worked on can sometimes choose to play “spoil sport” for the all important cardinality estimation during the plan generation phase. The auto-update statistics option of a database affects all table statistics: index, user created and auto-created.

As usual, my favorite way to illustrating a SQL Server behavior is to use an example. I created a table with two integer columns and inserted a million rows in the table. There are two indexes on the table: a clustered index on the primary key (column a) and a non-clustered index on the second column (column b). The DBCC SHOW_STATISTICS output for the index on column b has the following output:

Name Updated Rows Rows Sampled
idx_b Jul 8 2011 6:34AM 1000000 482999

If you look at the data above, you will notice that the rows and rows sampled have a difference which means that the statistics were updated with a less than a 100% sampling. There are only 3 distinct values for this particular column. The following query returned 26,000 rows but the estimated rows was way off.

select a,b from dbo.tblStats where b = 4

Plan snippet:

Rows EstimateRows
26000 2952.387

Now when I updated the statistics with a fullscan on the table and generated a new plan for the same query and…. voila…. I have the correct estimate rows and the histogram also shows me the correct data distribution. In my scenario, I had 400,000 records in the table after which I ran a WHILE loop to insert 600,000 additional records in the table. Till then the statistics were updated with a fullscan. After that an auto-update statistics messed up the sampling rate. This is because when auto update statistics kicks in it uses a very small percent as the default sampling rate used is a function of table size. This means that larger the table, the lower the sampling percentage will be. Now this is not necessarily a bad thing. Eg. If an auto stats operation kicks in during a query execution and auto update stats asynchronously is not set for the database, you don’t want the statistics update operation to sample the entire table, update the statistics histogram and then execute the query. So the default sampling functionality allows for quick statistics gathering even for large tables. The default sampling is used even when a CREATE STATISTICS command is executed on the table.

Now why is this important to be aware when designing your statistics update strategy for VLDBs or OLTP environments. The queries in production environments can involve JOIN operations and other complex constructs which means that the cardinality estimation of the rows to be fetched will determine which operator is to be used. Over and under estimating can prove detrimental to the performance of query plan.

Moral of the story: Statistics for a table need to be updated with a fullscan if there have been a significant number of changes made to the table data.

The frequency of statistics update would have to be determined based on the number of updates that your frequently used tables receive and when the instance has some free time to handle the statistics update operations. This is the famous “It Depends” answer. Even though it is an online, it does consume resources and can potentially cause blocking when the scan is being done on the table to figure out the data distribution. It is not always required to rebuild indexes to update statistics. Read the Books Online article “Using Statistics to Improve Query Performance” for multiple options like Asynchronous Statistics Update, NoRecompute option etc. and how they can be pertinent to your situation in case you have large databases where statistics update cannot be done very frequently.

Do not go by the row modification counter value starting from SQL Server 2005 as the statistics updates are handled based on column modification counters than row modifications.

There are edge-case scenarios where a 100% sampling might not be best choice because of a skew in data distribution, but that is a story for another day!

My colleague Jack wrote some posts on interesting issues that he ran into with data distribution in table and the statistics sampling:
Parallel Index Creation performance issue
http://blogs.msdn.com/b/psssql/archive/2009/03/18/be-aware-of-parallel-index-creation-performance-issues.aspx
Why does this query consumes so much CPU?
http://blogs.msdn.com/b/psssql/archive/2011/04/29/why-does-this-query-consumes-so-much-cpu.aspx
Sampling can produce less accurate statistics if the data is not evenly distributed
http://blogs.msdn.com/b/psssql/archive/2010/07/09/sampling-can-produce-less-accurate-statistics-if-the-data-is-not-evenly-distributed.aspx

Reference:
Statistics used by the Query Optimizer in SQL Server 2005
http://technet.microsoft.com/en-us/library/cc966419.aspx

del.icio.us Tags: Statistics,Update Statistics,Cardinality Estimation,Histogram