Statistics Auto Update may not be automatically the best choice

I had written a post a while back stating how DBREINDEX and ALTER INDEX treat the statistics update differently for user and auto-created statistics on a database table. In this post, I shall talk about how auto-update statistics which normally is left ON on most databases that I have worked on can sometimes choose to play “spoil sport” for the all important cardinality estimation during the plan generation phase. The auto-update statistics option of a database affects all table statistics: index, user created and auto-created.

As usual, my favorite way to illustrating a SQL Server behavior is to use an example. I created a table with two integer columns and inserted a million rows in the table. There are two indexes on the table: a clustered index on the primary key (column a) and a non-clustered index on the second column (column b). The DBCC SHOW_STATISTICS output for the index on column b has the following output:

Name         Updated              Rows                 Rows Sampled
idx_b        Jul  8 2011  6:34AM  1000000              482999              

If you look at the data above, you will notice that the rows and rows sampled have a difference which means that the statistics were updated with a less than a 100% sampling. There are only 3 distinct values for this particular column. The following query returned 26,000 rows but the estimated rows was way off.

select a,b from dbo.tblStats where b = 4

Plan snippet:

Rows      EstimateRows
26000     2952.387

Now when I updated the statistics with a fullscan on the table and generated a new plan for the same query and…. voila…. I have the correct estimate rows and the histogram also shows me the correct data distribution. In my scenario, I had 400,000 records in the table after which I ran a WHILE loop to insert 600,000 additional records in the table. Till then the statistics were updated with a fullscan. After that an auto-update statistics messed up the sampling rate. This is because when auto update statistics kicks in it uses a very small percent as the default sampling rate used is a function of table size. This means that larger the table, the lower the sampling percentage will be. Now this is not necessarily a bad thing. Eg. If an auto stats operation kicks in during a query execution and auto update stats asynchronously is not set for the database, you don’t want the statistics update operation to sample the entire table, update the statistics histogram and then execute the query. So the default sampling functionality allows for quick statistics gathering even for large tables. The default sampling is used even when a CREATE STATISTICS command is executed on the table.

Now why is this important to be aware when designing your statistics update strategy for VLDBs or OLTP environments. The queries in production environments can involve JOIN operations and other complex constructs which means that the cardinality estimation of the rows to be fetched will determine which operator is to be used. Over and under estimating can prove detrimental to the performance of query plan.

Moral of the story: Statistics for a table need to be updated with a fullscan if there have been a significant number of changes made to the table data.

The frequency of statistics update would have to be determined based on the number of updates that your frequently used tables receive and when the instance has some free time to handle the statistics update operations. This is the famous “It Depends” answer. Even though it is an online, it does consume resources and can potentially cause blocking when the scan is being done on the table to figure out the data distribution. It is not always required to rebuild indexes to update statistics. Read the Books Online article “Using Statistics to Improve Query Performance” for multiple options like Asynchronous Statistics Update, NoRecompute option etc. and how they can be pertinent to your situation in case you have large databases where statistics update cannot be done very frequently.

Do not go by the row modification counter value starting from SQL Server 2005 as the statistics updates are handled based on column modification counters than row modifications.

There are edge-case scenarios where a 100% sampling might not be best choice because of a skew in data distribution, but that is a story for another day!

My colleague Jack wrote some posts on interesting issues that he ran into with data distribution in table and the statistics sampling:
Parallel Index Creation performance issue
http://blogs.msdn.com/b/psssql/archive/2009/03/18/be-aware-of-parallel-index-creation-performance-issues.aspx
Why does this query consumes so much CPU?
http://blogs.msdn.com/b/psssql/archive/2011/04/29/why-does-this-query-consumes-so-much-cpu.aspx
Sampling can produce less accurate statistics if the data is not evenly distributed
http://blogs.msdn.com/b/psssql/archive/2010/07/09/sampling-can-produce-less-accurate-statistics-if-the-data-is-not-evenly-distributed.aspx

Reference:
Statistics used by the Query Optimizer in SQL Server 2005
http://technet.microsoft.com/en-us/library/cc966419.aspx

Why can I not produce the same plan with a copy of the production database?

This question has been asked multiple times for SQL Server 2005/2008 environments. Well, the answer is SQL Server has become a bit smarter when it comes to the Query Optimizer. Now it takes into account the hardware environment and database session state also. (Sometimes, the optimizer may be a bit too smart for it’s own good 🙂 but that is discussion not being addressed in this thread)

To determine an optimal query plan, SQL Server 2005 and 2008 uses the following information:

  1. The database metadata – The table statistics should hold the same information i.e. same of data distribution.
  2. The hardware environment – Is the Physical Memory (RAM) and the number of Processors identical to the letter on the two machines.
  3. The database session state

Typically, you must simulate all these same types of information if you want to reproduce the behavior of the query optimizer on a test system. If you are lucky, then without 2 & 3 being satisfied, you might land up with the same plan. In scenarios where you don’t, Option 2 & 3 would be a good option to simulate before running off to solve the question:

Why is Server A generating a better plan than Server B?

With the advent of Virtualization, the simulation of the physical memory and CPU processors is not that big a deal as before. Just thought I would answer this question because many time I have been asked to explain why the same database backup doesn’t produce the same plan as Server A. The first option is to always create statistics-only copy of your database and see if you can reproduce the plan that you see on your production server. If yes, then you can safely proceed to the next step of troubleshooting the bad plan i.e. find out the most expensive part of the plan and take necessary steps to tune it like adding covering indexes, defining statistics, re-placing the join order, adding query hints etc.

Very often SQL CSS team would require a statistics clone of your database, so that they can reproduce the issue in-house. This would not contain any data from the tables but a clone of the database metadata. So, in case you need to troubleshoot a performance issue where you suspect the query plan to be the culprit, you also use a statistics clone and use that on a test server to check if you reproduce the so-called “bad” plan. The reason I mention a test environment because sometimes it is not possible to troubleshoot a query performance issue on a production server. And generating a database statistics clone for a few tables is much faster than a backup restore of the entire database.

You can use the information mentioned in the article below to create a statistics clone of your database:

How to generate a script of the necessary database metadata to create a statistics-only database in SQL Server 2005 and in SQL Server 2008
http://support.microsoft.com/?kbid=914288