Statistics Auto Update may not be automatically the best choice

I had written a post a while back stating how DBREINDEX and ALTER INDEX treat the statistics update differently for user and auto-created statistics on a database table. In this post, I shall talk about how auto-update statistics which normally is left ON on most databases that I have worked on can sometimes choose to play “spoil sport” for the all important cardinality estimation during the plan generation phase. The auto-update statistics option of a database affects all table statistics: index, user created and auto-created.

As usual, my favorite way to illustrating a SQL Server behavior is to use an example. I created a table with two integer columns and inserted a million rows in the table. There are two indexes on the table: a clustered index on the primary key (column a) and a non-clustered index on the second column (column b). The DBCC SHOW_STATISTICS output for the index on column b has the following output:

Name         Updated              Rows                 Rows Sampled
idx_b        Jul  8 2011  6:34AM  1000000              482999              

If you look at the data above, you will notice that the rows and rows sampled have a difference which means that the statistics were updated with a less than a 100% sampling. There are only 3 distinct values for this particular column. The following query returned 26,000 rows but the estimated rows was way off.

select a,b from dbo.tblStats where b = 4

Plan snippet:

Rows      EstimateRows
26000     2952.387

Now when I updated the statistics with a fullscan on the table and generated a new plan for the same query and…. voila…. I have the correct estimate rows and the histogram also shows me the correct data distribution. In my scenario, I had 400,000 records in the table after which I ran a WHILE loop to insert 600,000 additional records in the table. Till then the statistics were updated with a fullscan. After that an auto-update statistics messed up the sampling rate. This is because when auto update statistics kicks in it uses a very small percent as the default sampling rate used is a function of table size. This means that larger the table, the lower the sampling percentage will be. Now this is not necessarily a bad thing. Eg. If an auto stats operation kicks in during a query execution and auto update stats asynchronously is not set for the database, you don’t want the statistics update operation to sample the entire table, update the statistics histogram and then execute the query. So the default sampling functionality allows for quick statistics gathering even for large tables. The default sampling is used even when a CREATE STATISTICS command is executed on the table.

Now why is this important to be aware when designing your statistics update strategy for VLDBs or OLTP environments. The queries in production environments can involve JOIN operations and other complex constructs which means that the cardinality estimation of the rows to be fetched will determine which operator is to be used. Over and under estimating can prove detrimental to the performance of query plan.

Moral of the story: Statistics for a table need to be updated with a fullscan if there have been a significant number of changes made to the table data.

The frequency of statistics update would have to be determined based on the number of updates that your frequently used tables receive and when the instance has some free time to handle the statistics update operations. This is the famous “It Depends” answer. Even though it is an online, it does consume resources and can potentially cause blocking when the scan is being done on the table to figure out the data distribution. It is not always required to rebuild indexes to update statistics. Read the Books Online article “Using Statistics to Improve Query Performance” for multiple options like Asynchronous Statistics Update, NoRecompute option etc. and how they can be pertinent to your situation in case you have large databases where statistics update cannot be done very frequently.

Do not go by the row modification counter value starting from SQL Server 2005 as the statistics updates are handled based on column modification counters than row modifications.

There are edge-case scenarios where a 100% sampling might not be best choice because of a skew in data distribution, but that is a story for another day!

My colleague Jack wrote some posts on interesting issues that he ran into with data distribution in table and the statistics sampling:
Parallel Index Creation performance issue
http://blogs.msdn.com/b/psssql/archive/2009/03/18/be-aware-of-parallel-index-creation-performance-issues.aspx
Why does this query consumes so much CPU?
http://blogs.msdn.com/b/psssql/archive/2011/04/29/why-does-this-query-consumes-so-much-cpu.aspx
Sampling can produce less accurate statistics if the data is not evenly distributed
http://blogs.msdn.com/b/psssql/archive/2010/07/09/sampling-can-produce-less-accurate-statistics-if-the-data-is-not-evenly-distributed.aspx

Reference:
Statistics used by the Query Optimizer in SQL Server 2005
http://technet.microsoft.com/en-us/library/cc966419.aspx

Advertisement

DBREINDEX and ALTER INDEX differences

I had written about update statistics a long time back and had been meaning to write a clarification on the difference in behavior w.r.t. table statistics being updated when rebuilding indexes using different T-SQL commands. I am sure all of you are aware of why statistics for a table need to be updated with a fullscan to ensure that the optimizer doesn’t fall prey to sub-optimal plans. However, there is a little known fact about how statistics get affected when a DBREINDEX is issued on a table versus an ALTER INDEX…REBUILD ALL.

I created a table with 4 integer columns and inserted 100,000 records into the table using the script below:

-- Create table
create table tblUpdStats (a int primary key,b int,c int, d int)
-- Insert 100,000 rows in the table
set nocount on
declare @var int = 1
while @var<=100000
begin
insert into tblUpdStats values(@var,@var+1,@var+2,@var+3)
set @var+=1
end
-- Create a non-clustered index on column b
create index idx_b on tblUpdStats(b)
-- Create a user-defined statistics on column c
create statistics stat_c on tblUpdStats(c)
-- Execute the following to query so that the database engine creates an auto-created stats on column d
select d from tblUpdStats where d > 50000
Name                            Updated              Rows    Rows Sampled
------------------------------- -------------------- ------- ------------
idx_b                           May 13 2011  8:03PM  100000  100000
PK__tblUpdSt__3BD0198E558AAF1E  May 13 2011  8:03PM  100000  100000
stat_c                          May 13 2011  8:03PM  100000  100000
_WA_Sys_00000004_53A266AC       May 13 2011  8:03PM  100000  100000

I took a snapshot of the DBCC SHOW_STATISTICS output for the statistics associated with the non-clustered index, clustered index, user created statistics and auto-created statistics. All the rows and rows sampled are equal which means that the statistics are updated with a fullscan. I then inserted another 100,000 rows into the same table. I then executed the following command to rebuild the indexes on the table:

alter index all on tblstats rebuild

I took a snapshot of the DBCC SHOW_STATISTICS command again and notice the difference below:

Name                            Updated              Rows     Rows Sampled
--------------------------------------------------------------------------
idx_b                           May 13 2011  8:19PM  200000   200000
PK__tblUpdSt__3BD0198E558AAF1E  May 13 2011  8:19PM  200000   200000
stat_c                          May 13 2011  8:03PM  100000   100000
_WA_Sys_00000004_53A266AC       May 13 2011  8:03PM  100000   100000

You will see that the user created and auto-created statistics do not reflect the change in the number of rows. However when I execute a DBREINDEX on the table, I see a different picture altogether. (See show_statistics output below)

dbcc dbreindex ('tblUpdStats','')
Name                               Updated              Rows      Rows Sampled
 ---------------------------------- -------------------- --------- --------------
 idx_b                              May 13 2011  8:31PM  200000    200000
 PK__tblUpdSt__3BD0198E558AAF1E     May 13 2011  8:31PM  200000    200000
 stat_c                             May 13 2011  8:31PM  200000    200000
_WA_Sys_00000004_53A266AC           May 13 2011  8:31PM  200000    200000

Now if I look at the rows sampled and rows column values, I see that the updated column shows the same date for all the objects. In this case the rows sampled and rows are the same indicating that a fullscan occurred for the user created/auto created statistics on the table.

To summarize, ALTER INDEX…REBUILD ALL will not update statistics which were manually created or were auto-created by the Database Engine. It will only update statistics associated with the indexes defined on the table columns. On the other hand, a DBCC DBREINDEX on the table will update statistics for the user created statistics and the auto-created statistics using the default sampling rate. The default sampling comes into play for very large tables. If you have a maintenance plan job which does an index rebuild operation on SQL Server 2005 or above, then that uses an ALTER INDEX…REBUILD command.

How to check database statistics last updated date/time & Percent Rows modified in SQL Server

Well, here goes my first post – I will keep it simple. As a DBA you’ve probably heard people talking about statistics ALL the time. You probably have jobs/maintenance plans setup to update statistics. But if you ever ran into an issue with slow query performance and wanted to do a quick check on the statistics for ALL indexes in the database, you would use either dbcc show_statistics or sp_autostats. But they work per index/table. Here is a way to do this for the entire database using TSQL query,

— For SQL 2005/SQL 2008

select a.id as 'ObjectID', isnull(a.name,'Heap') as 'IndexName', b.name as 'TableName',
stats_date (id,indid) as stats_last_updated_time
from sys.sysindexes as a
inner join sys.objects as b
on a.id = b.object_id
where b.type = 'U'

Note: Rowmodctr values displayed above are not entirely reliable starting with SQL 2005 onwards. But, you can rely on last updated time to make a decision if the statistics need to be updated.
Read here for more

Many of you might have lot of tables or have large databases where you only want to update statistics using ‘UPDATE STATISTICS’ or ‘sp_updatestats’, but only for those tables that have large row modifications. Here is a script that you can use to get the output of each index that has significant row modifications. You can pipe this to a temp table and choose to update statistics on only these tables. You can do this by looking at the “ModifiedPercent” column.

select
schemas.name as table_schema,
tbls.name as table_name,
i.name as index_name,
i.id as table_id,
i.indid as index_id,
i.rowmodctr as modifiedRows,
(select max(rowcnt) from sysindexes i2 where i.id = i2.id and i2.indid < 2) as rowcnt,
convert(DECIMAL(18,8), convert(DECIMAL(18,8),i.rowmodctr) / convert(DECIMAL(18,8),(select max(rowcnt) from sysindexes i2 where i.id = i2.id and i2.indid < 2))) as ModifiedPercent,
stats_date( i.id, i.indid ) as lastStatsUpdate
from sysindexes i
inner join sysobjects tbls on i.id = tbls.id
inner join sysusers schemas on tbls.uid = schemas.uid
inner join information_schema.tables tl
on tbls.name = tl.table_name
and schemas.name = tl.table_schema
and tl.table_type='BASE TABLE'
where 0 < i.indid and i.indid < 255
and table_schema <> 'sys'
and i.rowmodctr <> 0
and i.status not in (8388704,8388672)
and (select max(rowcnt) from sysindexes i2 where i.id = i2.id and i2.indid < 2) > 0

— For SQL 2000

SELECT 'Index Name' = i.name, 'Table Name' = o.name,
'Statistics Date' = STATS_DATE(i.id, i.indid)
FROM sysobjects o, sysindexes i
where o.id = i.id

You can put this in a cursor for every database if you want to and play around with it. Will post back another script which will help you decide what criteria you want to pick when updating indexes. Stay tuned….

Addition: February 4, 2012: An updated version of the script is available here.

How to update statistics for a VLDB

Statistics are something that the SQL Optimizer depends on to generate a plan. The more outdated your statistics are, greater are the chances that your query optimizer will land up with a sub-optimal plan.
So what do you do when you have a database which is highly transactional in nature and is quite large in size. So, you might land up in a scenario where the database is online 24X7 and there is no downtime window where you can update your statistics with a FULLSCAN. 

Before we get into what my opinion is about updating statistics of a database is, lets make sure we understand what I mean by statistics being “up-to-date“. Statistics being up-to-date is the sampling rate used while updating statistics was 100% or they were updated with a FULLSCAN. Statistics updated using AUTO UPDATE STATISTICS property of a database will not update statistics with a FULLSCAN contrary to popular belief. The sole reason for this being is that suppose a table has 1 GB of data and a query on the database initiated a AUTO UPDATE of a STAT on this database, then if the sampling was 100%, then your query duration would take an eternity. 

So, if you have a large database:
1. Ensure that the AUTO UPDATE STATS is turned on for the database. This would provide for some relief
2. Identify which tables need to have their statistics updated with a fullscan. Only those would be required which have the maximum number of queries running on them
3. Next identify a period when the database usage is low. This is the time when you can use to update the stats of the other tables which did not qualify the list in #2.
4. Use a script to update statistics of the tables identified step # 2 probably based on rowmodctr values (ROWMODCTR: Counts the total number of inserted, deleted, or updated rows since the last time statistics were updated for the table. In SQL Server 2005, this is not going to be always helpful [Ref: http://msdn2.microsoft.com/en-us/library/ms190283.aspx]. But in SQL 2000, this could be used as a deciding factor) 

Sample Script
============
Let’s say I had identified 4 tables in STEP 2:
CREATE TABLE UPD_STATS_TBL (tblname varchar(10), dt_updated datetime, rowmodctr bigint) 

SELECT * FROM UPD_STATS_TBL 

tblname dt_updated rowmodctr 

TBL1 2007-10-24 21:13:46.123 20000
TBL2 2007-10-23 21:13:46.123 400000
TBL3 2007-10-22 21:13:46.123 508000
TBL4 2007-10-24 20:13:46.123 87000
 

**************************************************  

CREATE PROC UPD_STATS_DB 

AS 

DECLARE @tbl varchar(10) 

UPDATE upd_stats_tbl SET rowmodctr = (SELECT MAX(rowmodctr) from sys.sysindexes where id=OBJECT_ID(tblname)) 

SELECT TOP 1 @tbl = tblname 

FROM UPD_STATS_TBL 

WHERE rowmodctr > 10000 

ORDER BY dt_updated 

-- For SQL Server 2005, you would have to used the STATS DATE function instead of rowmodctr values. Refer my statistics update post for more details. 

DECLARE @stmt varchar (100) 

SET @stmt = 'UPDATE STATISTICS'+SPACE(1)+@tbl+SPACE(1)+'WITH FULLSCAN' 

EXEC (@stmt) 

UPDATE upd_stats_tbl SET dt_updated = (SELECT GETDATE()) 

WHERE tblname = @tbl

**************************************************
NOTE: You could refine this down to the table statistic if you wanted to.

Statistics Update

This is an important task in terms of database maintenance. This ensures the statistics are up-to-date which in turn would ensure that the query optimizer doesn’t land up with sub-optimal plans. The two ways to monitor for the need for an update on statistics are:
1. STATS_DATE function

2. rowmodctr value in the sysindexes (For SQL 2000) and sys.sysindexes (For SQL 2005) output

The STATS_DATE function would give you the date when the statistics were updated last. The rowmodctr value would tell you how many changes have taken place in terms of update, inserts and deletes in the data of the column the index is associated with. However, this counter is not applicable for SQL Server 2005 as SQL Server 2005 and higher versions use Column Modifications to track modifications on tables and based on the same, the statistics are updated if AUTO UPDATE STATS property of the database is set.

The frequency of statistics update depends on the following:
1. Do BULK operations happen on the database tables in question?
2. Is there an off-peak period where a statistics update with full scan can be done?
3. How often is the data modified in the database tables in question?

Statistics update can be a resource intensive operation depending on the size of the table. If the data in the tables in question change very rarely, then a statistics update with a full scan can be done during a maintenance window. Statistics update is always an ONLINE operation and doesn’t cause the database to be in an OFFLINE mode.

If there are BULK operations happening in a table, then statistics have a tendency to getting skewed after the BULK operation. The best option is do perform a statistics update on this table if this tables is to be used by an application after the bulk operation.

If the database is updated very frequently and the database is very large in size, then it needs to be determined which tables are most frequently updated. Based on this, a statistics updated can be performed for only those tables and a statistics update can be done for the entire database with a lower sampling rate like 20-30% depending on what is suitable. This can be determined by comparing historical data and finding out what kind of sampling rate is suitable for your needs.

Another option is to enable AUTO UPDATE STATISTICS option for the database. But it needs to be monitored whether this is a boon or a bane. AUTO UPDATE STATISTICS can be good when the statistics are relatively updated and it makes sure that the statistics don’t fall too out of place. This feature has a downside when the statistics become out-dated too frequently, then you would have AUTO UPDATE STATISTICS being fired every time and this would cause all cached plans associated with the rows whose statistics have been updated to be recompiled. This can cause a serious bottleneck if there are too many auto update statistics events fired in short time span.

The following methods can be used to update the statistics:
1. UPDATE STATISTICS WITH FULLSCAN (or a lower sampling rate) Please refer the following article for further details: http://msdn2.microsoft.com/en-us/library/aa260645(SQL.80).aspx

2. sp_updatestats

3. A maintenance plan or SQL Agent job to update the statistics
The following script for SQL Server 2005 would be helpful in determining how badly affected the statistics are for the index associated with it. The rowmodctr counts the total number of inserted, deleted, or updated rows since the last time statistics were updated for the table. Again this is applicable for SQL Server 2000 only.

A DBCC DBREINDEX would update the statistics associated with those indexes. But this would not update the AUTO CREATED STATISTICS. The DBCC SHOW_STATISTICS command could also help you to determine the statistics condition for the particular index. The “Row Sampled” and “Rows” if equal would indicate that the sampling is currently 100%.

Modification: May 13, 2011:

I had earlier stated that a DBCC DBREINDEX wouldn’t update the auto-created statistics but this is incorrect. The current behavior is that the auto-created statistics and manually created column statistics are not updated only when ALTER INDEX..REBUILD ALL is used. However, when you use DBCC DBREINDEX, it would update the statistics for the manually and auto-created statistics with the default sampling rate. 


Statistics for INDEX ‘pk_customers’.

Updated Rows Rows Sampled Steps Density Average key length

Jun 23 2007 5:03PM 91 91 91 1.0989011E-2 10.0


The following articles could be helpful:
UPDATE STATISTICS
http://msdn2.microsoft.com/en-us/library/aa260645(SQL.80).aspx
Statistics used by Query Optimizer in SQL Server 2005
http://www.microsoft.com/technet/prodtechnol/sql/2005/qrystats.mspx