Cache files–Why are you increasing in size?

Management Data Warehouse provides a mechanism to collect Perfmon and DMV outputs and store them in a SQL Server database to help you troubleshoot performance issues that you might have faced in the past. This is quite helpful for intermittent performance issues that you might face on you SQL Server instances as it might not be possible to have someone monitoring the system 24X7 to troubleshoot the performance problem when it is occurring.

Management Data Warehouse provides two ways of collecting data:

1. Cached mode – This stores the Collection Set data in a .cache file in the cache file directory which is uploaded to the MDW database based on the schedule that you defined for your Collection Set.
2. Non-cached mode – This method of collection data using a Collection Set doesn’t depend on local cache. It collects the data pertaining to the Collection Set and uploads the data based on two different configuration options: On Demand or based on a schedule.

MDW_CollectionWorkFlow

The screen shot (left) from Books Online of the run-time processing of the MDW jobs shows how the data is uploaded to the MDW database. When you have a highly transactional system with large number of concurrent connections to your SQL Server instance or a lot of processes running on the Windows Server, the amount of data collected by the System Data Collection Sets can be quite large for a single iteration especially for the Server Activity system data collection set.

So what can happen if the .cache files become very large. Let’s say your collection frequency is set to every 15 minutes. This means that the Data Collection set will collect data every 15 minutes. The scheduled upload will then use this cache file to upload the data using the SSIS upload package(s).

Now what happens if you miss an upload schedule because the upload job ran longer than 15 minutes. This means that your .cache file now has 2 sets of data to be uploaded. This cumulative addition of successive collection sets over a period of time with missed upload schedules can cause a problem once the upload job takes more than 60 minutes. The SSIS package(s) are hard-coded to timeout after 60 minutes of execution. Then you are stuck with cache data in your cache folder which cannot be uploaded.

When the SSIS package times out after 60 minutes of execution, you will get the following error in the job history for the job associated with the upload package:

Server Activity,Error,4861,Server Activity,,Error,The thread “ExecMasterPackage” has timed out 3600 seconds after being signaled to stop.,Upload,17/05/2011 13:11:37,,17/05/2011 14:11:38,3601,<server name>,,,

So what are your options here.

Option 1
1. Change the Server Activity collection to non-cached – This change would ensure that the data is uploaded as soon as it is collected without using an interim cache file.
2. Increase the collection frequency – The collection frequency is set to default of 15 seconds. This can be increased to 30 seconds or higher to prevent very large data being collected between the scheduled upload intervals.
3. Modify the sp_purge_data stored procedure for the purge job to run faster – An updated version of sp_purge_data is available here. This will ensure that the purge job completes faster.

Option 2
Evaluate the perfmon counters being collected in the Server Activity collection set. Find out which perfmon counters are required and then create a custom collection set to capture this data with a non-cached mode for data collection.

You can create your custom perfmon collection set as the default Server Activity collection set has a large number of perfmon counters from which data is collected and uploaded.

Below is a custom collection set created for Performance Monitor counters. I have shown how to collect Server specific perfmon counters and SQL instance specific perfmon counters.

use msdb;

Begin Transaction

Begin Try

Declare @collection_set_id_1 int

Declare @collection_set_uid_2 uniqueidentifier

EXEC [dbo].[sp_syscollector_create_collection_set]

@name=N'Perfmon Counter Collection Set',

@collection_mode=1,

@description=N'Collects Perfmon Counters ',

@target=N'',

@logging_level=0,

@days_until_expiration=7,

@proxy_name=N'',

@schedule_name=N'CollectorSchedule_Every_5min',

@collection_set_id=@collection_set_id_1 OUTPUT,

@collection_set_uid=@collection_set_uid_2 OUTPUT

Select @collection_set_id_1, @collection_set_uid_2


Declare @collector_type_uid_3 uniqueidentifier

Select @collector_type_uid_3 = collector_type_uid From [dbo].[syscollector_collector_types] Where name = N'Performance Counters Collector Type';

Declare @collection_item_id_4 int


EXEC [dbo].[sp_syscollector_create_collection_item]

@name=N'Logical Disk Collection and SQL Server CPU',

@parameters=N'<ns:PerformanceCountersCollector xmlns:ns="DataCollectorType">

<PerformanceCounters Objects="LogicalDisk" Counters="Avg. Disk Bytes/Read" Instances="*" />

<PerformanceCounters Objects="LogicalDisk" Counters="Avg. Disk Bytes/Write" Instances="*" />

<PerformanceCounters Objects="LogicalDisk" Counters="Avg. Disk sec/Read" Instances="*" />

<PerformanceCounters Objects="LogicalDisk" Counters="Avg. Disk sec/Write" Instances="*" />

<PerformanceCounters Objects="LogicalDisk" Counters="Disk Read Bytes/sec" Instances="*" />

<PerformanceCounters Objects="LogicalDisk" Counters="Disk Write Bytes/sec" Instances="*" />

<PerformanceCounters Objects="Process" Counters="% Privileged Time" Instances="*" />

<PerformanceCounters Objects="Process" Counters="% Processor Time" Instances="*" />

<PerformanceCounters Objects="Process" Counters="% User Time" Instances="*" />

<PerformanceCounters Objects="$(INSTANCE):Buffer Manager" Counters="Page life expectancy" Instances="*" />

</ns:PerformanceCountersCollector>',

@collection_item_id=@collection_item_id_4 OUTPUT,

@frequency=5,

@collection_set_id=@collection_set_id_1,

@collector_type_uid=@collector_type_uid_3

select @collection_item_id_4

Commit Transaction;

End Try


Begin Catch

Rollback Transaction;

DECLARE @ErrorMessage NVARCHAR(4000);

DECLARE @ErrorSeverity INT;

DECLARE @ErrorState INT;

DECLARE @ErrorNumber INT;

DECLARE @ErrorLine INT;

DECLARE @ErrorProcedure NVARCHAR(200);

SELECT @ErrorLine = ERROR_LINE(),

@ErrorSeverity = ERROR_SEVERITY(),

@ErrorState = ERROR_STATE(),

@ErrorNumber = ERROR_NUMBER(),

@ErrorMessage = ERROR_MESSAGE(),

@ErrorProcedure = ISNULL(ERROR_PROCEDURE(), '-');


RAISERROR (14684, @ErrorSeverity, 1 , @ErrorNumber, @ErrorSeverity, @ErrorState, @ErrorProcedure, @ErrorLine, @ErrorMessage);


End Catch;

Reference:
How to: Create a Custom Collection Set That Uses the Generic T-SQL Query Collector Type
Using SQL Server 2008 Management Data Warehouse for database monitoring in my application

DBREINDEX and ALTER INDEX differences

I had written about update statistics a long time back and had been meaning to write a clarification on the difference in behavior w.r.t. table statistics being updated when rebuilding indexes using different T-SQL commands. I am sure all of you are aware of why statistics for a table need to be updated with a fullscan to ensure that the optimizer doesn’t fall prey to sub-optimal plans. However, there is a little known fact about how statistics get affected when a DBREINDEX is issued on a table versus an ALTER INDEX…REBUILD ALL.

I created a table with 4 integer columns and inserted 100,000 records into the table using the script below:

-- Create table
create table tblUpdStats (a int primary key,b int,c int, d int)
-- Insert 100,000 rows in the table
set nocount on
declare @var int = 1
while @var<=100000
begin
insert into tblUpdStats values(@var,@var+1,@var+2,@var+3)
set @var+=1
end
-- Create a non-clustered index on column b
create index idx_b on tblUpdStats(b)
-- Create a user-defined statistics on column c
create statistics stat_c on tblUpdStats(c)
-- Execute the following to query so that the database engine creates an auto-created stats on column d
select d from tblUpdStats where d > 50000
Name                            Updated              Rows    Rows Sampled
------------------------------- -------------------- ------- ------------
idx_b                           May 13 2011  8:03PM  100000  100000
PK__tblUpdSt__3BD0198E558AAF1E  May 13 2011  8:03PM  100000  100000
stat_c                          May 13 2011  8:03PM  100000  100000
_WA_Sys_00000004_53A266AC       May 13 2011  8:03PM  100000  100000

I took a snapshot of the DBCC SHOW_STATISTICS output for the statistics associated with the non-clustered index, clustered index, user created statistics and auto-created statistics. All the rows and rows sampled are equal which means that the statistics are updated with a fullscan. I then inserted another 100,000 rows into the same table. I then executed the following command to rebuild the indexes on the table:

alter index all on tblstats rebuild

I took a snapshot of the DBCC SHOW_STATISTICS command again and notice the difference below:

Name                            Updated              Rows     Rows Sampled
--------------------------------------------------------------------------
idx_b                           May 13 2011  8:19PM  200000   200000
PK__tblUpdSt__3BD0198E558AAF1E  May 13 2011  8:19PM  200000   200000
stat_c                          May 13 2011  8:03PM  100000   100000
_WA_Sys_00000004_53A266AC       May 13 2011  8:03PM  100000   100000

You will see that the user created and auto-created statistics do not reflect the change in the number of rows. However when I execute a DBREINDEX on the table, I see a different picture altogether. (See show_statistics output below)

dbcc dbreindex ('tblUpdStats','')
Name                               Updated              Rows      Rows Sampled
 ---------------------------------- -------------------- --------- --------------
 idx_b                              May 13 2011  8:31PM  200000    200000
 PK__tblUpdSt__3BD0198E558AAF1E     May 13 2011  8:31PM  200000    200000
 stat_c                             May 13 2011  8:31PM  200000    200000
_WA_Sys_00000004_53A266AC           May 13 2011  8:31PM  200000    200000

Now if I look at the rows sampled and rows column values, I see that the updated column shows the same date for all the objects. In this case the rows sampled and rows are the same indicating that a fullscan occurred for the user created/auto created statistics on the table.

To summarize, ALTER INDEX…REBUILD ALL will not update statistics which were manually created or were auto-created by the Database Engine. It will only update statistics associated with the indexes defined on the table columns. On the other hand, a DBCC DBREINDEX on the table will update statistics for the user created statistics and the auto-created statistics using the default sampling rate. The default sampling comes into play for very large tables. If you have a maintenance plan job which does an index rebuild operation on SQL Server 2005 or above, then that uses an ALTER INDEX…REBUILD command.

A shot in the arm for sp_purge_data

I had recently worked on a performance issue with the Management Data Warehouse job “mdw_purge_data”. This job calls a stored procedure sp_purge_data which was taking a long time to complete. This stored procedure just got a shot in the arm with an update to it released by the SQL Server Developer working on it. The T-SQL for the modified stored procedure is available on the SQL Agent MSDN blog. The actual Stored Procedure will be available in an upcoming update for SQL Server 2008 R2.

Addition: July 20th, 2011: SQL Server 2008 R2 Service Pack 1 has various updates for the purge process which doesn’t require modification of the procedure using the script mentioned in the aforementioned blog post.

A recent discussion on Twitter reminded me that I had this as one of the items to blog about so here it is. Now for some of the interesting technical details.

The poor performance of the stored procedure has been addressed through a refactoring of the Stored Procedure code but the issue doesn’t occur on all Management Database Warehouse databases.

The issue occurs when you have large amounts of data stored in the [snapshots].[query_stats] for each unique snapshot ID. If you have orphaned rows in the ‘ [snapshots].[notable_query_plan] ‘ and ‘ [snapshots].[notable_query_text] ‘ tables of the Management Data Warehouse. This was corrected in the builds mentioned in the KB Article below:

970014    FIX: The Management Data Warehouse database grows very large after you enable the Data Collector feature in SQL Server 2008
http://support.microsoft.com/default.aspx?scid=kb;EN-US;970014

The above KB Article also has a workaround mentioned which allows you to clean-up the orphaned rows using a T-SQL script.

The “Query Statistics” data collection set collects query statistics, T-SQL text, and query plans of most of the statements that affect performance. Enables analysis of poor performing queries in relation to overall SQL Server Database Engine activity. It makes use of DMVs sys.dm_exec_query_stats to populate the entries in the snapshots.query_stats table. If this DMV returns a large number of rows for each time the collection set runs, the number of rows in the actual Management Data Warehouse table “snapshots.query_text” can grow very quickly. This will then cause performance issues when the actual purge job executes on the MDW database.

In summary, if you are using Management Data Warehouse to monitor a highly active server where there are a lot of ad-hoc queries or prepared SQL queries being executed, then I would recommend you to modify you the sp_purge_data stored procedure using the MSDN blog link, I mentioned earlier in my post. I have seen significant performance improvement due to this on the specific environment that I was working on.

T-SQL Tuesday #016 : Aggregate Functions

This month’s T-SQL Tuesday is being hosted by Jes Borland (Twitter | Blog) and the topic for this month is “Aggregate Functions”. What really caught my eye was the play on the T-SQL functions to accentuate the fact why Jes justifies her twitter handle of grrrl_geek:

 

Let’s SUM(thoughts), COUNT(ways we’ve done things), and set MAX(awesome) on our posts!

So at this point, I shall SELECT Content FROM PastExperience WHERE Type = ‘Aggregate Functions’ and pen it together in this blog post!

Aggregate functions have been a great way to slice-and-dice data stored in any RDBMS product and present it in the form which the business wants or would want to look at it. I have seen and also written multiple T-SQL queries in the past which use the HAVING clause along with aggregation functions to filter out data from the final dataset. I was very excited about the use of GROUPING option using ROLLUP for the set of canned queries that I had for aggregating analysis for SQL Performance cases while using SQL Nexus/RML Utilities.

The query below gives me an aggregation of the SUM of all the CPU/Duration/Reads/Writes which completed execution during the time the profiler traces were captured.


select a.hashid,b.OrigText,
SUM(a.duration) as SumDuration, SUM(a.Reads) as SumReads, Sum(a.Writes) as SumWrites,SUM(a.CPU) as SumCPU,
MAX(a.duration) as MAXDuration, MAX(a.Reads) as MAXReads, MAX(a.Writes) as MAXWrites,MAX(a.CPU) as MAXCPU,
MIN(a.duration) as MINDuration, MIN(a.Reads) as MINReads, MIN(a.Writes) as SumWrites,MIN(a.CPU) as MINCPU,
AVG(a.duration) as AVGDuration, SUM(a.Reads) as AVGReads, Sum(a.Writes) as AVGWrites,SUM(a.CPU) as AVGCPU,
GROUPING(a.hashid) AS [GroupNo]
from readtrace.tblBatches a
inner join readtrace.tblUniqueBatches b
on a.HashID = b.HashID
where EndTime is not null
group by  a.hashid,b.OrigText WITH ROLLUP

With the use of the above query, I get the all the queries with their average, minimum, maximum and total statistics of all the queries along with a rollup of the reads, writes, duration and CPU.

COMPUTE is another new introduction into the world of T-SQL which can really help with data analysis especially from my perspective, since I spend a significant part of my day looking into SQL Performance issues. The query below will give me all the long running batches along with the duration/CPU information for each run with a computed sub-total for each such aggregated row.

select a.spid,b.origtext,a.duration, a.cpu
from readtrace.tblBatches a
inner join readtrace.tblUniqueBatches b
on a.HashID = b.HashID
where EndTime is not null
order by a.duration,a.spid
compute sum(duration) , sum(cpu)  BY a.duration,a.spid;

An example output is:

image

Another example where Aggregation helps is identifying the amount of CPU usage by queries which were captured using profiler traces. Why is this important, you ask? This is very important when I need to compare the CPU usage time reported by SQL Server (captured using Perfmon/WMI etc.) and the actual time the user queries spent executing on CPU. The Statement, RPC, Batch and SP completed events capture the CPU consumption for each request. Aggregating them together gives you the ability to figure out the amount of time that was spent by user queries while executing on CPU. The approach used below uses UNIONS and SUM to figure out the total percentage of CPU used by the queries which completed during the profiler trace capture. The only parameter value that you need to provide is the number of CPUs i.e. the number of logical CPUs being used by the SQL instance. The reason I highlighted the phrase in the previous line is because that the below T-SQL batch works on the presumption that the profiler traces which were captured had all the completed events and captured completion events for all queries that were executing during the timeframe of the data collection. This batch would provide incorrect results if you have a long running CPU driven query which continues executing even after the profiler trace capture is stopped. This batch is ideal for scenarios where a profiler trace was started, a workload executed on the server and then the profiler trace was stopped. Under such a scenario, this would help you in identifying the percentage CPU usage for all queries executed by the workload.

DECLARE @MinTime DATETIME
DECLARE @MaxTime DATETIME
DECLARE @ms_diff BIGINT
DECLARE @number_of_cpus INTEGER

SET @number_of_cpus = 2  --Change this as necessary

SELECT
@MinTime = MIN(ISNULL(u.MinTime, '12-31-9999'))
FROM
(
SELECT MIN(StartTime) AS MinTime
FROM ReadTrace.tblBatches
UNION ALL
SELECT MIN(EndTime) AS MinTime
FROM ReadTrace.tblBatches
UNION ALL
SELECT MIN(StartTime) AS MinTime
FROM ReadTrace.tblStatements
UNION ALL
SELECT MIN(EndTime) AS MinTime
FROM ReadTrace.tblStatements
) AS u

SELECT
@MaxTime = MAX(ISNULL(u.MaxTime, '01-01-1900'))
FROM
(
SELECT MAX(StartTime) AS MaxTime
FROM ReadTrace.tblBatches
UNION ALL
SELECT MAX(EndTime) AS MaxTime
FROM ReadTrace.tblBatches
UNION ALL
SELECT MAX(StartTime) AS MaxTime
FROM ReadTrace.tblStatements
UNION ALL
SELECT MAX(EndTime) AS MaxTime
FROM ReadTrace.tblStatements
) AS u

IF (@MinTime = '12-31-9999' OR @MaxTime = '01-01-1900')
BEGIN
RAISERROR ('ERROR: The trace does not contain appropriate columns to determine the min/max time values for events in the trace.',
16, 1)
RETURN
END

SELECT
@ms_diff = DATEDIFF(ms, @MinTime, @MaxTime) * @number_of_cpus
FROM
ReadTrace.tblBatches AS b

SELECT @MinTime AS MinTime, @MaxTime AS MaxTime, @ms_diff AS total_CPU_milliseconds, @number_of_cpus AS NumCpus

SELECT
(CAST(SUM(cpu) AS FLOAT) / @ms_diff) * 100 AS cpu_percentage,
SUM(cpu) AS sum_cpu, COUNT(*) AS executions, ub.normtext
INTO
#temp_cpu_percentages
FROM
ReadTrace.tblBatches AS b
INNER JOIN ReadTrace.tblUniqueBatches AS ub
ON b.HashID = ub.HashId
GROUP BY
ub.normtext
HAVING
SUM(cpu) > 0
ORDER BY
sum_cpu DESC

&nbsp;

SELECT SUM(t.cpu_percentage) AS SQL_cpu_usage_percent
FROM #temp_cpu_percentages AS t

SELECT TOP 10 *
FROM #temp_cpu_percentages AS t
ORDER BY t.cpu_percentage DESC

DROP TABLE #temp_cpu_percentages

What was the objective of this post?

1. Point out two not-so-commonly used T-SQL functionality (GROUPING/COMPUTE/ROLLUP) which can be quite helpful.

2. Show a few ways of aggregating, slicing and dicing data imported into a SQL Nexus database for quick analysis.

Books Online references:

GROUP BY (Transact-SQL)

COMPUTE (Transact-SQL)

What does cntr_type mean?

Have you ever wondered if the cntr_type column value in the sys.sysperfinfo or sys.dm_os_performance_counters output has a significant meaning or not. Well since the column value is there, it obviously has a meaning. Additionally, if the raw values represented by the output of some of the counter types is considered at face value, then your performance base lining can take a severe beating.

Each cntr_type value meaning can be found from the WMI Performance Counter Type or Windows Server Performance Counter Type documentation on MSDN. 

The common counter types in SQL Server are:
PERF_COUNTER_RAWCOUNT | Decimal | 65536
Raw counter value that does not require calculations, and represents one sample.

PERF_COUNTER_LARGE_RAWCOUNT | Decimal | 65792
Same as PERF_COUNTER_RAWCOUNT, but a 64-bit representation for larger values.

PERF_COUNTER_COUNTER | Decimal | 272696320
Average number of operations completed during each second of the sample interval. NOTE: For "per-second counters", this value is cumulative. The rate value must be calculated by sampling the value at discrete time intervals. The difference between any two successive sample values is equal to the rate for the time interval used. For example, batch requests/sec is a per-second counter, it would show cumulative values.

PERF_COUNTER_BULK_COUNT | Decimal | 272696576
Average number of operations completed during each second of the sample interval. This counter type is the same as the PERF_COUNTER_COUNTER type, but it uses larger fields to accommodate larger values.

PERF_AVERAGE_BULK | Decimal | 1073874176 | Decimal | 537003264
Number of items processed, on average, during an operation. This counter type displays a ratio of the items processed (such as bytes sent) to the number of operations completed, and requires a base property with PERF_AVERAGE_BASE as the counter type.

PERF_LARGE_RAW_BASE | Decimal | 1073939712
Base value found in the calculation of PERF_RAW_FRACTION, 64 bits.

Example:
If you had the following values:
SQLServer:Plan Cache | Cache Hit Ratio | Temporary Tables & Table Variables | 381
SQLServer:Plan Cache | Cache Hit Ratio Base | Temporary Tables & Table Variables | 386
Then the Temp Table/Variable cache hit ratio percentage would be: 98.7% (approx.)

You can use the query below to get the comments for each counter type as discussed above:

select object_name,counter_name,instance_name,cntr_value,
case cntr_type 
	when 65792 then 'Absolute Meaning' 
	when 65536 then 'Absolute Meaning' 
	when 272696576 then 'Per Second counter and is Cumulative in Nature'
	when 1073874176 then 'Bulk Counter. To get correct value, this value needs to be divided by Base Counter value'
	when 537003264 then 'Bulk Counter. To get correct value, this value needs to be divided by Base Counter value' 
end as counter_comments
from sys.dm_os_performance_counters
where cntr_type not in (1073939712)

 

Documentation on MSDN:

WMI Performance Counter Types

http://msdn.microsoft.com/en-us/library/aa394569(VS.85).aspx

SQL Server 2005 BOL Topic

sys.dm_os_performance_counters (Transact-SQL) 

The broad classes of counters are as follows:

Non-computational Counter Types

http://msdn.microsoft.com/en-us/library/aa392713(VS.85).aspx

Basic Algorithm Counter Types

http://msdn.microsoft.com/en-us/library/aa384813(VS.85).aspx

Counter Algorithm Counter Types

http://msdn.microsoft.com/en-us/library/aa389384(VS.85).aspx

Timer Algorithm Counter Types

http://msdn.microsoft.com/en-us/library/aa393909(VS.85).aspx

Precision Timer Algorithm Counter Types

http://msdn.microsoft.com/en-us/library/aa392755(VS.85).aspx

Queue-length Algorithm Counter Types

http://msdn.microsoft.com/en-us/library/aa392905(VS.85).aspx

Base Counter Types

http://msdn.microsoft.com/en-us/library/aa384811(VS.85).aspx

Statistical Counter Types

http://msdn.microsoft.com/en-us/library/aa393663(VS.85).aspx