Profiler: Why do you collect so much data?

This has been always been a classic Catch-22. Should I include that event? But that will bloat my profiler trace size very quickly!! And then suddenly, either your gut feeling takes over and you take a split second decision or you make the same decision again based on your SQL knowledge (+gut feeling) to include or not include that event. The decision is always between collecting too much data at the cost of degrading server performance further or collecting too less data that you would require another round of data collection to figure out the root cause of the performance issue that you are troubleshooting.

At the outset, let me make it clear that I am not *perfect* at configuring a profiler trace for every performance issue that you face. But I will say this that I have handled enough SQL performance issues and looked at profiler traces (I actually do this to earn a living Winking smile)  to help outline a set of DOs and DON’Ts for capturing profiler traces for diagnosing performance issues.

In this post, I shall outline some Profiler event categories and events which generate a plethora of noise more than anything else. These events are required for troubleshooting some very specific scenarios (which is why they were introduced) and can be done away with when you are capturing data for evaluating general server performance or looking to identify the resource consuming queries on your instance.

High frequency event categories

  • Broker – None of the events in the Broker Category are required unless you are troubleshooting an issue related to Database Mail or Service Broker in SQL Server 2005 or above. However, the Broker event category can generate a lot of noise if you have Service Broker/Database Mail configured on the server and are using it heavily.
  • Cursors – If the client application or provider that connects to your SQL Server instance uses Server Side cursors, then this event category can basically lead to an unmanageable amount of data collected which will take you ages to sift through. The sp_cursor* calls will be captured either ways in your Stored Procedure event classes which makes it moot point to track Cursor: Open, Close, Execute through this event category for general performance issues. The only useful events in this event category are CursorImplicitConversion (which lets you track Implicit Conversion of requested non-supported cursor type by the SQL engine) and CursorRecompile (which helps you track down T-SQL cursor recompiles due to schema changes).
  • Locks – Now this event category is a whale if you want my frank opinion. Lock:Acquired and Lock:Released event classes (if captured), can make your data analysis more difficult than scaling Mt. Everest! If you need to track the number of locks for each session, then it is done best outside the profiler using DMVs like sys.dm_tran_locks. However, this event category has some useful event classes like:
    1. Lock: Deadlock and Lock: Deadlock Chain – Helps a lot when troubleshooting deadlock issues
    2. Lock: Timeout and Lock: Timeout (timeout > 0) – Helps troubleshooting timeout related issues due to lock blocking.
    3. Lock: Escalation – This one is debatable. Helps track down Lock Escalations but on servers where there are high number of Lock Escalations, this can become a high frequency event.
  • OLEDB – Unless you know based on the symptoms seen regarding the issue that OLEDB calls need to be traced, you shouldn’t be collecting any event from this Event category except for OLEDB Errors. So if your OLEDB provider was returning a large number of errors, you would be able to track it down using the HRESULT obtained from this event class.
  • Query Notifications – Again the same logic mentioned above for Broker event category applies to Query Notifications event category too.
  • Scans – Now this might be useful when you are doing a repro of sorts on a test or development environment but on a production server this is a strict NO-NO! You really should be able to figure out looking at the plans whether there was a table/index scan without having to capture a separate noisy event in the profiler traces.
  • Security Audit – A very good event category when troubleshooting security or permissions related issues on a SQL instance. For Performance related issues, this is just plain *noise*! Because the events under this category are fired nineteen-to-the-dozen on a SQL instance and this would just bloat your profiler trace size rather than doing anything useful! The two really and I mean really useful events in this category are Audit: Login and Audit: Logout which help track the session’s connection settings and other attributes like host name, user name etc.
  • Transactions – This event category also has a lot of noise events which would aid in bloating profiler trace size rather than doing anything useful. Especially the Transaction Manager (TM:*) event classes are not to be captured unless you are troubleshooting an issue Begin/Commit/Rollback taking a long time to complete. The SQLTransation event is quite useful for troubleshooting Deadlock related issues as it helps track transaction begin and commit/rollback and nest levels for a particular session. The DTCTransaction event class is useful for troubleshooting DTC Transaction related issues and tracking the different states of a DTC transaction.
  • Performance Event Category – This is one of the most widely used event categories in CSS for troubleshooting query performance  issues. The reason for that is this event category helps you capture Query Plans. There are a bunch of different event classes in this category which helps you capture query plans which I shall divide into two classes based on data to be obtained from these events:
    1. Compile time details: Showplan All, Showplan XML & Showplan Text (occurs during query execution but doesn’t contain run-time details like rows returned) and Showplan XML For Query Compile & Showplan All For Query Compile (show you the compile time query plan). This event can be very useful when you want to capture a query plan for timeout related issues as the other events which show you the execution time query plans may not be generated when you are dealing with query timeouts. So if you are not facing any timeouts and need the query plans with run-time details, you need not capture any of the events mentioned above. The Query Compile event classes (especially the XML event) are required when you trying to nail down compilation related issues and want to find out the resources (CPU/Memory) consumed during query compilation.
    2. Run-time details: Showplan Statistics Profile and Showplan XML Statistics Profile show you the the query plan with run-time details in text and XML formats respectively. Based on what you like sifting through Text or XML, you could capture either one. XML events are not the best when it comes to toning down the size of the profiler traces captured.

    There are some other high frequency event classes which are mentioned below:

    1. SP:CacheMiss, SP:CacheInsert, SP:CacheRemove, SP:CacheHit – These events are helpful when tracking procedure cache plan caching issues but if you are not dealing with a specific issue related to Procedure Cache, then you are going to be dealing with a lot of noise in the profiler traces if these events are captured.
    2. TSQL : Prepare SQL, TSQL : Exec Prepared SQL, TSQL : Unprepare SQL – Useful for troubleshooting specific server side cursor calls but otherwise plain noise and should be avoided.

    image*TIP: Always check the “Show all Columns” option in the Profiler UI when configuring the trace as some of the event data columns capture valuable information about event which is essential to troubleshooting a performance issue like ObjectID, ObjectName etc.

    The events to be captured using SQL Profiler for common issues like High CPU usage, excessive I/O, timeouts etc. will be covered through a series of posts later.

    To summarize, the profiler can be a friendly tool when run using a Server Side trace with the right set of events being captured.

    T-SQL Tuesday #13 – What the Business Says Is Not What the Business Wants

    This is my second post for T-SQL Tuesday. This time around, T-SQL Tuesday is being hosted by Steve Jones (Twitter | Blog), The Voice of the DBA on SQLServerCentral.com this month. The topic this time around is not a technical topic but nonetheless very important in the grander scheme of things.

    I have seen multiple scenarios where the teams involved in deploying solutions/projects work in silos. There is the management team which decides what the final outcome of the project should look like and what the end deliverables are. Then the set of developers who can throw together gazillion lines of code (in multiple languages) to put together a front-end, maybe a middle-tier, which talks to a database server fetches data and presents it to the end-user. And finally a set of DBAs who are always guilty of the post-deployment issues “till the database is proven innocent” as most applications hit a database to fetch the data at one time or another.

    There are some key things to consider and contemplate about before you start writing code and deploying your solutions.

    What’s the business?

    Business

    Knowing your business is like knowing your backyard. The bigger it is the more lethargic you tend to become. Business acumen is not something that is missing in individuals but it has got to do more with “I want to” versus “I don’t want to”! The positive side of knowing enough about your business is that it can make your life easier. This will prevent you from spending additional hours troubleshooting database performance issues since you deployed something that was not needed for the business and unfriendly to your current database environment. So unless you have a fair idea of why you are implementing what you are implementing, you are going to invite trouble later if not sooner!

    Requirement Analysis: The Who, What and How

    analysisIf you have background in Software Design or have been part of a project, then you will know what RA is. However, due to the time-to-market commitments becoming tighter and tighter in a competitive environment, RA becomes a non-entity or a formality with developers developing without knowing why they are doing what they are doing. A definite recipe for disaster! Unless you understand WHO is going to be using what you are developing, WHAT purpose is it going to serve and HOW they will be using it; you will not be in a position to explain why a module or feature may be an overkill or completely unnecessary. Figuring out the Who, What and How requires a certain understanding of the business (my first point) and without this, you are bound to face challenges!

    Good-to-have and Must-have

    Under_construction_icon-blueYou must have heard this a million times. What is a must-have as opposed to a good-to-have. Without knowing this, you might not be able to land a project where end deliverables are met. It is never a good idea to over-commit. Timelines and deliverables need to be frozen before you start developing which would make me point back to the planning phase. Just because the business asks for it doesn’t mean that the business needs it. Find out if that is going to serve the end-goal. What is being prioritized may not be a must-have but actually a good-to-have. Must-haves in a project are non-comprisable (if the right RA has been done) and the good-to-have is well… (the phrase is self explanatory)

    The shops where there are smart people who can look beyond their egos and work efficiently in a cross-team environment usually don’t land up in a soup post-deployment; using band-aids to fix a “gone-to-hell” deployment.

    Conclusion: Cohesion in communication among teams is required to formulate a concise design goal to achieve or solve a specific business need. Without this, you will be back to the drawing board in no time after spending substantial amount of effort and $$ in a futile exercise.

    Note: The above post is not penned to finger-point at any set of individuals or a group. It is just an attempt to help understand the need for cross collaboration between business groups. Additionally, it is equally pertinent to know your backyard before you deploy something. This is applicable to all business groups and not just DBAs!

    Volume Shadow barfs with 0x80040e14 code

    I was replying on a MSDN forum thread recently and found that the a VSS backup was failing due to large number of database volumes residing on a particular volume.

    I re-created the issue on my box by limiting the number of Max Worker Threads to 128 on my SQL Server 2008 instance. After making the change to the number of worker threads, I created 99 databases on the D: drive of my server using the following script:

    set nocount on
    declare @cntr int = 1,@sql varchar(8000)
    while @cntr < 100
    begin
    set @sql = 'CREATE DATABASE [db'+cast(@cntr as varchar(5))+'] ON  PRIMARY
    
    ( NAME = N' 'db'+cast(@cntr as varchar(5))+''', FILENAME = N' 'D:\DBFiles\db'+cast(@cntr as varchar(5))+'.mdf' ')
    LOG ON
    ( NAME = N' 'db'+cast(@cntr as varchar(5))+'_log' ', FILENAME = N' 'D:\DBFiles\db'+cast(@cntr as varchar(5))+'.ldf'')'
    
    set @cntr = @cntr + 1
    exec(@sql)
    end

    I then attempted to backup the DBFiles folder on the D: drive using Windows Server Backup and it failed.

    image

    After that I looked into the SQL Server Errorlog and found the following errors:

    2010-12-06 03:39:15.280 Backup       Error: 3041, Severity: 16, State: 1.2010-12-06 03:39:15.280 Backup       BACKUP failed to complete the command BACKUP DATABASE db68. Check the backup application log for detailed messages.

    2010-12-06 03:39:15.370 spid53       Error: 18210, Severity: 16, State: 1.2010-12-06 03:39:15.370 spid53       BackupVirtualDeviceFile::PrepareToFreeze:  failure on backup device ‘{DB840D1B-DD17-4D73-83F1-7390F676DEE4}1’. Operating system error 995(The I/O operation has been aborted because of either a thread exit or an application request.).

     

    I had already documented earlier that the infamous 995 error message could be a problem with the sqlvdi.dll but in this case, it is not!

    The next thing I did was to pick the errors from the Application Event logs and I found the following errors repeated multiple times:

    Error    12/6/2010 3:39:15 AM    SQLVDI    1    None    SQLVDI: Loc=SignalAbort. Desc=Client initiates abort. ErrorCode=(0). Process=3920. Thread=11604. Client. Instance=SQL2008. VD=Global\{DB840D1B-DD17-4D73-83F1-7390F676DEE4}1_SQLVDIMemoryName_0.Error    12/6/2010 3:39:15 AM    SQLWRITER    24583    None    Sqllib error: OLEDB Error encountered calling ICommandText::Execute. hr = 0x80040e14. SQLSTATE: 42000, Native Error: 3013

    Error state: 1, Severity: 16

    Source: Microsoft SQL Server Native Client 10.0

    Error message: BACKUP DATABASE is terminating abnormally.

    SQLSTATE: 42000, Native Error: 3224

    Error state: 1, Severity: 16

    Source: Microsoft SQL Server Native Client 10.0

    Error message: Cannot create worker thread.

    If you look closely at the above error, then you will find that the following error message stands out like a sore thumb: “Cannot create worker thread”. After this issue occurs, you will find that the SQLWriter shows the following when you execute the command: vssadmin list writers from a command prompt window:

    Writer name: ‘SqlServerWriter’   Writer Id: {a65faa63-5ea8-4ebc-9dbd-a0c4db26912a}

       Writer Instance Id: {9075f235-fdee-4445-851b-a69c54bd8b33}

    State: [8] Failed

    Last error: Non-retryable error

    Ensure that you do not have any memory related errors in the SQL Server Errorlogs or any DBCC MEMORYSTATUS outputs printed in the Errorlog during the time of the backup failure. If there aren’t, then the above issue occurs due to the SQL instance hitting a worker thread limit.

    The reason for this is explained by Rob Dorr (Principal SQL Server Escalation Engineer at Microsoft) in his blog post:

    The controller and database worker come from the worker pool.  So if your max worker thread is configured for 255 the answer would be a max of 255 / 2 = 127 databases.    You probably won’t get to 127 with this configuration as you have system workers (lazy writer, checkpoint, log writer, and others) that already are using workers.

    On my SQL instance, I have 120 database and I need about 240 worker threads to complete this operation. However, I have only 128 worker threads configured for my SQL instance. If you have Dedicated Administrator Connection enabled for the SQL instance, then you can this troubleshooting further and execute a select against the DMV sys.dm_os_schedulers (for SQL Server 2005 or above) and get an output of the following type:

    status                 counts
    ———————- ——–

    HIDDEN ONLINE          168

    VISIBLE ONLINE         8

    VISIBLE ONLINE (DAC)   1

    Query used:

    select status, count(*) as counts
    from sys.dm_os_schedulers
    group by status

    As you will see that there are about 168 hidden online schedulers created for performing the Snapshot backups. But SQL instance in question doesn’t have sufficient work threads to facilitate this operation.

    What can you do in such situations:

    1. Increase the number of worker threads to account for the additional worker threads required for completing the VSS backups. This can be counter-productive as it can lead to non-Buffer Pool memory contention on 32-bit instances of SQL Server. The higher the number of max worker threads on your SQL Server instance, more the amount of non-BPool memory is consumed for the thread stack. This results in depletion of the available Buffer Pool memory.

    2. Change the physical location of the SQL Server database files so that all the database files do not reside on the same volume.

    3. Perform a VSS Backup with SQL Server VSS Writer service turned OFF. However, you cannot recover any SQL Server database files from such a Volume Shadow Backup.

    4. Use SQL Native backups or other VDI backup tools (non-Snapshot) to perform SQL Server database backups.

    Note: This also happens when you select any file on a drive which has a high number of SQL Server database files. This is because the SQL Server VSS Writer tries to enumerate the database files for the SQL Server instance that reside on the volume that has been selected for the Snapshot backup.

    Location of SQL Binaries can flip out Bare Metal backups

    Bare Metal recovery allows system administrators to recover a system from failure within a short period of time and also manage system state backups in a more efficient manner. However, there is a slight catch here when you have SQL Sever binaries installed in a non-default location.

    I recently worked on an issue where Windows Backup Admin was listing a non-system drive as a critical component when trying to perform a bare-metal backup using the following command:

    wbadmin start backup -allcritical -backupTarget:<any existing drive name>:

    When I looked into the contents of the drive, I found that the drive only had SQL Server database files and binaries on it. This is the output that I got:

    wbadmin start backup -allcritical -backupTarget:S:
    wbadmin 1.0 – Backup command-line tool
    (C) Copyright 2004 Microsoft Corp.

    Retrieving volume information…
    This will back up volume OS(C:),New Disk(D:),New Disk(E:) to S:.
    Do you want to start the backup operation?
    [Y] Yes [N] No N

    The operation ended before completion.

    I then decided to use DiskShadow to show me all the critical components marked by the VSS Writers. I got the following output:

    WRITER “System Writer”:

    – Volumes affected by this component:

    – \\?\Volume{2407f815-f9c4-11df-aef7-806e6f6e6963}\ [C:\]
    – \\?\Volume{22c3ff31-f965-11df-a20b-00155df1c42a}\ [D:\]
    – \\?\Volume{0342683d-f96a-11df-8a1f-00155df1c42a}\ [E:\]

    This is because the SQL Server program binaries were installed on the D: and E: drives. And these are the paths that show up as critical as per the System Writer:

    – d:\program files\microsoft sql server\100\dts\binn
    – d:\program files\microsoft sql server\msas10.mssqlserver\olap\bin
    – d:\program files\microsoft sql server\msrs10.mssqlserver\reporting services\reportserver\bin
    – d:\program files\microsoft sql server\mssql10.mssqlserver\mssql\binn
    – e:\program files\microsoft sql server\msas10.inst01\olap\bin
    – e:\program files\microsoft sql server\msrs10.inst01\reporting services\reportserver\bin
    – e:\program files\microsoft sql server\mssql10.inst01\mssql\binn

    DiskShadow command used: list writers detailed

    From http://msdn.microsoft.com/en-us/library/aa384630(VS.85).aspx

    A volume is a critical volume if it contains system state information. The boot and system volumes are included automatically. The requester must include all volumes that contain system-critical components reported by writers, such as the volumes that contain the Active Directory. System-critical components are marked as “not selectable for backup.” In VSS, “not selectable” means “not optional.” Thus, the requester is required to back them up as part of system state.

    Based on the above description, it is clear that I cannot perform a Bare Metal backup without including D: and E: drives.

    Note: The above is just an example. There might be other system-critical components which are marked as not selectable for backup by other VSS Writers. VSS works on Volume Shadow concept. So, if a particular drive in a volume is marked as critical, then the entire volume would need to be backed up.

    As I mentioned earlier, for system critical components, the “Is selectable: FALSE” is set (VSS_CF_NOT_SYSTEM_STATE flag is set for non-system components). In the above example, the drive on which the SQL binaries exist is marked as critical by the System Writer.

    Wiki: SQL Database Engine Startup Failures for Clusters

    I had recently written a Wiki article on my blog for a troubleshooting startup methodology for SQL Server startup failures for Standalone instances. In this article, I shall cover a similar roadmap for troubleshooting database engine startup failures for clustered SQL Server instances. Some of the steps are going to be the same as troubleshooting start-up failures for stand-along instance but there are some minute differences. So here goes…

    The first thing that you need to do is to try and bring the SQL Server resource online from the Failover Cluster Administrator in case you are using a Windows Server 2003 cluster or below. If you are using a Windows Server 2008 or above, you would need to use the Failover Cluster Management Snap-in. If the SQL Server resource fails to come online, then check the SQL Server Errorlogs for any errors that were reported.

    Locate the SQL Server Errorlog location and check if any errors were observed in the SQL Server Errorlog. This can be done by looking up the –e startup parameter value for the SQL Server instance in question. If you are using SQL Server 2005 and above, then right-click on the Service Name in Configuration Manager –> Properties –> Advanced Tab. Alternatively, you can use a VBScript (or Powershell) to fetch this information. Here is an example for obtaining the Startup Parameters using WMI and VBScript.

    Verify that the files pointed to by the –d (master data file location), –e (Errorlog location) and –l (master log file location) actually exist and have the correct file paths. Starting from SQL Server 2005, the Errorlog will have the startup parameters printed in it when the SQL instance starts up. If you find no errors in the SQL Errorlogs, look into the Windows Application Event logs and ensure that there are no errors related to the cluster resource for the SQL instance or the cluster group (applicable for Windows 2003 clusters) that the SQL resource is a part of.

    If the Windows Application Event logs, don’t provide any useful errors, then look at any warnings or errors pertaining to the SQL Server resource in the Windows Cluster logs.

    Refer the following post to generate the Cluster log on a Windows Server 2008 Failover cluster or higher. For a Windows Server 2003 Cluster, the default location of the cluster logs is: <system driver>:\<windows folder>\Cluster\Cluster.log.

    Now try to start the SQL Server instance using command prompt using the sqlservr.exe to check if the service comes online without bringing any user database online using -T3608.

    For a default SQL Server instance:

    sqlservr.exe –c –m –T3608

    For a named instance of SQL Server:

    sqlservr.exe –c –m –T3608 –s<instance name>

    If the above fails, then the console window should provide you with an error message as to why the SQL instance fails to start. If the above succeeds, then try and start the service using the Configuration Manager.

    Note: Make sure that you are not falling prey to a Desktop Heap exhaustion issue.

    If you determine that the SQL Server registry entries have an erroneous entry, then you would need to follow the steps mentioned in the KB Article below to rectify the problem (due to Registry Synch operations that happen on a cluster):
    http://support.microsoft.com/kb/953504

    You might also find that the SQL Server resource entries in the Cluster Registry hive (HKLM\Cluster\Resources\GUID) are missing. You would then have to re-create the same using the steps mentioned in the KB Article below (applicable for SQL Server 2000 and SQL Server 2005 only):
    http://support.microsoft.com/kb/810056

    NOTE: Remember that modifying registry keys incorrectly on a Windows Server can have disastrous effects. Make sure that you take a registry backup before making any registry modifications.

    Apart from the following common startup failure root causes, there are an additional few which are applicable to clustered SQL Server instances only (in BOLD below):

    The most common root causes for service startup failures are:

    1. Access denied on the folders for the SQL Server Service Account on the DATA folder which contains the SQL Server system databases or due to some other process (like anti-virus holding a lock on the SQL database files)
    2. Insufficient permissions granted to the SQL Server Service account.
    3. File not found errors due to either accidental deletion of files or disk failures.
    4. System database files having inconsistencies prevent SQL Server startup.
    5. Password of the service account was changed but not updated on the server that has the SQL instance installed.
    6. Startup parameters have incorrect file path locations
    7. The dependent resources for the SQL Server clustered instance are failing to come online. (Eg. Shared disks on which the database files reside on)
    8. Incorrect (read:messed up) Registry entries in the CLUSTER registry hive for the concerned SQL instance which have been checkpointed already and due to which the instance fails to start.
    9. Operating system/Resource crunch/Environment related issues on one or more nodes due to which the instance comes online on one node but not on the other.

    Additional references:
    T-SQL Tuesday: Common SQL Server myths – Series I–Setup and Services
    SQL SERVER 2000 CLuster FAILS TO COME ONLINE ON OTHER NODE
    SQL SERVER 2005 RESOURCE FAILS TO COME ONLINE ON CLUSTER AFTER CHANGING THE SAN DRIVE TO ADD MORE DISK SPACE
    Sql Server 2005 Resource Taking long time to come online on Windows Cluster with Resource Manager Creation Failed reported in Errorlog
    Sql Server Resource Fails to come online on Cluster with Error Error Locating Server/Instance Specified [xFFFFFFFF]
    SQL Server 2008 service fails to start, after applying Service Pack 1

    Versions that this roadmap applies to:
    SQL Server 2000, SQL Server 2005, SQL Server 2008, SQL Server 2008 R2