I had written a post a while back stating how DBREINDEX and ALTER INDEX treat the statistics update differently for user and auto-created statistics on a database table. In this post, I shall talk about how auto-update statistics which normally is left ON on most databases that I have worked on can sometimes choose to play “spoil sport” for the all important cardinality estimation during the plan generation phase. The auto-update statistics option of a database affects all table statistics: index, user created and auto-created.
As usual, my favorite way to illustrating a SQL Server behavior is to use an example. I created a table with two integer columns and inserted a million rows in the table. There are two indexes on the table: a clustered index on the primary key (column a) and a non-clustered index on the second column (column b). The DBCC SHOW_STATISTICS output for the index on column b has the following output:
Name Updated Rows Rows Sampled
idx_b Jul 8 2011 6:34AM 1000000 482999
If you look at the data above, you will notice that the rows and rows sampled have a difference which means that the statistics were updated with a less than a 100% sampling. There are only 3 distinct values for this particular column. The following query returned 26,000 rows but the estimated rows was way off.
select a,b from dbo.tblStats where b = 4
Now when I updated the statistics with a fullscan on the table and generated a new plan for the same query and…. voila…. I have the correct estimate rows and the histogram also shows me the correct data distribution. In my scenario, I had 400,000 records in the table after which I ran a WHILE loop to insert 600,000 additional records in the table. Till then the statistics were updated with a fullscan. After that an auto-update statistics messed up the sampling rate. This is because when auto update statistics kicks in it uses a very small percent as the default sampling rate used is a function of table size. This means that larger the table, the lower the sampling percentage will be. Now this is not necessarily a bad thing. Eg. If an auto stats operation kicks in during a query execution and auto update stats asynchronously is not set for the database, you don’t want the statistics update operation to sample the entire table, update the statistics histogram and then execute the query. So the default sampling functionality allows for quick statistics gathering even for large tables. The default sampling is used even when a CREATE STATISTICS command is executed on the table.
Now why is this important to be aware when designing your statistics update strategy for VLDBs or OLTP environments. The queries in production environments can involve JOIN operations and other complex constructs which means that the cardinality estimation of the rows to be fetched will determine which operator is to be used. Over and under estimating can prove detrimental to the performance of query plan.
Moral of the story: Statistics for a table need to be updated with a fullscan if there have been a significant number of changes made to the table data.
The frequency of statistics update would have to be determined based on the number of updates that your frequently used tables receive and when the instance has some free time to handle the statistics update operations. This is the famous “It Depends” answer. Even though it is an online, it does consume resources and can potentially cause blocking when the scan is being done on the table to figure out the data distribution. It is not always required to rebuild indexes to update statistics. Read the Books Online article “Using Statistics to Improve Query Performance” for multiple options like Asynchronous Statistics Update, NoRecompute option etc. and how they can be pertinent to your situation in case you have large databases where statistics update cannot be done very frequently.
Do not go by the row modification counter value starting from SQL Server 2005 as the statistics updates are handled based on column modification counters than row modifications.
There are edge-case scenarios where a 100% sampling might not be best choice because of a skew in data distribution, but that is a story for another day!
My colleague Jack wrote some posts on interesting issues that he ran into with data distribution in table and the statistics sampling:
Parallel Index Creation performance issue
Why does this query consumes so much CPU?
Sampling can produce less accurate statistics if the data is not evenly distributed
Statistics used by the Query Optimizer in SQL Server 2005