Troubleshooting that darn backup error

Backup and restore operations even though they are supposed to be a no-brainer, it can leave you wanting to tear your hair out. Considering the different elements that influence the outcome of a successful backup operation like backup media, consistency of the backup, network issues (if backing up to a UNC path), consistency of the database being backed up etc.

I shall attempt to outline a set of troubleshooting steps specifically for non-native backup/restore related errors for SQL Server databases.

Any BACKUP command that completes successfully or with errors has associated messages logged in the SQL Server Errorlog.

A successful backup from the SQL Server Errorlog:

Database backed up. Database: adventureworks, creation date(time): 2010/05/28(03:00:18), pages dumped: 17291, first LSN: 67:7733:56, last LSN: 67:7761:1, number of dump devices: 1, device information: (FILE=1, TYPE=DISK: {‘F:\adventureworks.bak’}).

You will see that the backup completion message has the database name, the number of pages that were contained in the backup, the date the backup was created and the number of devices and their type (in this case files). So you can reconstruct the sequence of backups even if your backup information history is missing from the MSDB system tables and you just have your SQL Server Errorlogs to play with. If you want to get real fancy about this, then you could spin up a nifty little Powershell or VBScript to parse through the SQL Errorlogs and provide the backup sequence to you as well.

Now, let’s look at a failed backup message from the SQL Errorlog.

Error: 18204, Severity: 16, State: 1.
BackupDiskFile::CreateMedia: Backup device ‘R:\adventureworks.bak’ failed to create.
Operating system error 3(The system cannot find the path specified.).
Error: 3041, Severity: 16, State: 1.
BACKUP failed to complete the command BACKUP DATABASE adventureworks.

You will notice above that the Operating System error code associated with the backup failure is reported in the error message. More often than not, the error message is self explanatory. If the error code doesn’t throw back an error text along with it, you can get the windows error code associated with the error code using net helpmsg <error number> from a command prompt window.

In the above case, there wasn’t any R: drive on my server. Now that I have finished stating the most obvious of troubleshooting methodologies for SQL native backups which was done to set the context for the next part of this post. The troubleshooting methodology for non-native backups.

Non-native SQL backups of databases use one of the following methods:

1. Use of APIs exposed through SQLVDI.dll to perform a VDI based backup
2. Use of VSS framework to perform a snapshot backup of the SQL Server database

Note: In this post, I shall not be addressing any storage level backup solutions that have options or features to handles SQL database backups.

This post is aimed at discussing the tackling of VSS/VDI related errors that you face while performing a database backup.

If a VDI backup fails, then you can try the following sequence of troubleshooting steps:

1. Attempt a backup to the same location using native SQL backup from a query window or SQLCMD. This might not always be possible as some of the backup tools do not allow any other backup application to connect to the backup share apart from the backup tool’s agents.
2. The next thing that you can do is to ensure that your SQLVDI.DLL is updated to the latest build. For this you can use the SQL Server Backup Simulator available on Code Gallery using the “Validate VDI Installation” option in the tool. This check in the tool will perform basic checks like current DLL version and additional checks based on root causes of common scenarios that CSS has seen in the past for VDI backup failures. Additional information on the usage of the tool is available here.
3. Perform a backup of the database using the tool for which the backup is failing (to the same destination if possible). The parameter tweaks at this point are not available in the current version like striped backups, changes to MAXTRANSFERSIZE/BUFFER COUNT etc. but if the backup from this tool is successful, then you know for a fact that the SQLVDI APIs are working as expected. Then it is either an additional piece of logic in the backup software that is failing or there is an external factors like network/backup media or there is a resource crunch on the SQL instance on which the database resides.


Common errors that you might see during a VDI backup operation failure:

Error message 1
BackupVirtualDeviceFile::ClearError: failure on backup device ‘VDI_ DeviceID ‘. Operating system error 995(The I/O operation has
been aborted because of either a thread exit or an application request.).

Error message 2
Error: 18210, Severity: 16, State: 1.
BackupMedium::ReportIoError: write failure on backup device ‘VDI_ DeviceID ‘. Operating system error 995(The I/O operation has been aborted because of either a thread exit or an application request.)

Error message 3
Error: 18210, Severity: 16, State: 1. 
BackupVirtualDeviceFile::RequestDurableMedia: Flush failure on backup device ‘VDI_ DeviceID. Operating system error 995(The I/O operation has been aborted because of either a thread exit or an application request.)

The above errors are mostly due to outdated versions of SQLVDI.DLL on SQL Server 2000 or SQL Server 2005 instances. The “Validate VDI Installation” option in the Backup Simulator should be able to identify any known issues with DLL version mismatch for you.


If you have an application that uses Volume Shadow Service to backup your SQL database and that backup fails, then you your troubleshooting steps would be a bit different. The tell-tale sign of a VSS Snapshot backup is in the I/O Frozen message being written into the SQL Errorlogs for the databases being backed up. In such a failure scenario, take a backup of the database using Windows Server Backup Admin (Windows Server 2008 and above) or NTBackup utility to perform a backup of the database files involved. If this backup is successful, it means that a snapshot backup is successfully completing for the database. Now depending on your backup application’s implementation logic, this might be a resource crunch or a backup storage media incompatibility or even a resource crunch.

Common errors associated with VSS backups:

Issue# 1
Error: 3041, Severity: 16, State: 1.
Backup      BACKUP failed to complete the command BACKUP DATABASE <database name>. Check the backup application log for detailed messages.
SubprocessMgr::EnqueueSubprocess: Limit on ‘Max worker threads’ reached.
This error message is raised when your backup application uses VSS to backup a large number of databases simultaneously. More information around this can be found in this post Volume Shadow barfs with 0x80040e14 code

Issue #2
While performing a Bare Metal backup you might run into issues if your SQL Server binaries are located on a different drive other than the C: drive. More information on the same is mentioned here: Location of SQL Binaries can flip out Bare Metal backups

Issue #3

Event Type:    Error
Event Source:    VSS
Event Category:    None
Event ID:    6005
Description:
Sqllib error: Database %1 is stored on multiple volumes, only some of which are being shadowed.

The above error occurs when you backup only one of the volumes that contains SQL database data/log files which are spread over multiple volumes. This is explained in detailed in this post: Why is VSS complaining about SQL Data files


More information on how VSS/VDI work is present in the blog post by Sudarshan: INFORMATIONAL- SHEDDING LIGHT on VSS & VDI Backups in SQL Server

Previous posts related to Volume Shadow Backups
IO Frozen messages while taking NT Backup for SQL databases

Previous posts related to VDI backups
SQL VDI backup fails with 0×80770007

Two starts, one recompile and a confusing place

I used to be regular viewer of the TV Series “Two Guys, A Girl and A Pizza Place” co-starring Ryan Reynolds. The blog post title is a play on that name. In the past week, I have heard about this phenomenon of “one ghost Stmt: Starting” event more than thrice which is what triggered this blog post.

When you are analyzing such a trace file using RML Utilities, you will find an entry in the tblStatements table showing a Statement for a particular Batch Sequence (Column name: BatchSeq) which has a Start Time but no End Time. Additionally the AttnSeq column for that row would have a NULL value indicating that the statement didn’t timeout nor was it cancelled due to a Server/Client side error.

So what is this ghost entry all about. If you manually dig through the profiler trace, you will find that that the Statement in question has a Recompile event associated with it. This is by-design. When a Stored Procedure statement encounters a Recompile event, it would have a SP: StmtStarting event associated with it but no corresponding SP: StmtCompleted event. This can lead to confusion during trace file analysis.

Here is an example illustrating the same. I have a Stored Procedure whose definition is as follows:

CREATE PROC usp_recompile
AS
SELECT * FROM tbl1 OPTION (RECOMPILE)

The Recompile hint is added to forcefully induce a Recompile for the Stored Procedure when it is executed. If you look into the profiler events captured below during the Stored Procedure execution, you will see that there is a SP:StmtStarting event followed by a SP:Recompile and a SP:StmtRecompile event. Then there is another SP:StmtStarting event indicating that the recompile operation is complete and the start of the execution of the SP statement.

image

Hope this helps clear up the confusion around this topic.

SQL Server Backup Simulator: Cumulative update

The SQL Server Backup Simulator was launched on Microsoft Code Gallery to aid and assist in troubleshooting backup issues related to SQLVDI. Thank you to everyone who has evangelized the tool and provided feedback on the same.

The current release is marked as v1.2.

We released two minor updates for the tool to incorporate the following changes:

  1. SQL Backup Simulator does not report FileVersion instead it reports ProductVersion while reading the sqlvdi.dll information – There was an issue that we uncovered while validating the sqlvdi DLL version which was corrected in v1.1 and is now part of the v1.2 release.

  2. COPY_ONLY feature while simulating backup – The COPY_ONLY feature was added to ensure that the any full backups done for a database by the tool doesn’t break an existing backup chain for a database.

The latest executable can be downloaded from: https://github.com/Microsoft/tigertoolbox/releases/latest

What can you tell me Default Trace

The default trace functionality introduced in SQL Server 2005 is still a much under appreciated feature of the product. This is somewhat like a black box trace which is analogous to a airline black box can help you in certain post-mortem analysis as a DBA. I am sure a lot of you have a host of in-house scripts that you have in-place or some monitoring/auditing tool that reports some inadvertent/unscrupulous activity on the databases that you administer. Nonetheless, it is always a good idea to be aware of the out-of-the-box features that are available that can make your job a wee bit easier.

Deepak Rangarajan (Blog), a Microsoft MVP, had blogged about the set of events which are captured by this default trace. So, I shall not attempt to duplicate his efforts here but collate a set of issues that CSS has been able to resolve using Default Traces in the past and will continue to do so in the future.

Dude: Where’s my database?
Not really a question that you want to be pondering on if you do not have any monitoring/auditing tools in place. And nor do you want to BING that search term. (Ok, if you prefer even Google it!) The default trace captures the following events: Object:Created, Object:Altered and Object:Deleted which will help you track down inadvertent deletion of databases. SQL Server Management Studio has a standard report called “Schema Changes History” report which can help you track such changes on a SQL instance in the form of a report.

Memory pressure: Where, When?
Ideally this can be tracked by Perfmon data collection and most of the SQL monitoring tools available in the market today track the Target and Total Server Memory counters for SQL instances. When you monitor the delta changes for these counters, you can track the amount of fluctuations that are occurring. The default trace tracks these changes using the Server Memory Change profiler event which will help you decide if there is a large fluctuation in Max Server Memory configuration for the SQL instance. This event gets raised when SQL Server memory usage has increased or decreased by either 1 megabyte (MB) or 5 percent of the maximum server memory (Buffer Pool only), whichever is greater. This is quite useful when tracking memory pressure being experienced for a SQL instance due to factors external to the SQL instance.

Mirrored Database: When did your state change?
All database mirroring state changes for all your mirrored databases are also monitored using the Default Trace using the event: Database mirroring status change. This helps track all database mirroring state changes without having to parse the SQL Errorlogs for the same.

OMG: Someone’s mucking around with my database security?
We have handled multiple issues in the past where we needed to identify various security related audit activity on a SQL instance like new Logins/Users being added, mostly for post-mortem analysis. The default trace “Security Audit” event classes show their usefulness by letting you trace such kind of activity using the default trace.

Full-Text: What are you upto?
The default traces also help trace Full-Text crawl activity which at times have been known to cause performance degradation when very large FT catalogs are involved.

Database: Did you just grow or shrink?
SQL Server databases enabled for Autogrow help prevent the database from going down if the file becomes full but at the same time Autogrow events can cause performance issues like blocking. Additionally, database auto shrink (both log and data files) operations can severely impact database performance. With the help of the default traces, you can can all Auto Grow/Shrink operations for your databases on the instance.

SQL Performance: Now how can I use the default trace for this?
Well the answer is not a definitive one but the default trace can give you an indication if you have a baseline established. The default trace traces the following events: Hash Warning, Missing Column Statistics, Missing Join Predicate and Sort Warning. These events can be used to find out if there is an excessive amount of sorting or hashing that is occurring on your SQL instance (without a baseline, this data is moot point though). You can also identify if a number of your queries are suffering from missing statistics and missing join predicates.

One thing to note here is that the default traces rollover which means that if the default traces are not copied over to an alternate location, they will be overwritten and valuable information required for post-mortem information will be lost.

Profiler events for different scenarios

A few moons back I had blogged about the high frequency events in SQL Server. In this post, I shall talk about the common scenarios where profiler traces are collected for SQL Performance issues and what kind of events need to be captured in the server side traces that are configured for the data collection.

While configuring a profiler trace, always include all data columns for that event as that doesn’t have an overhead as much as the number of events being captured during the trace capture.

In this post, I shall cover two of the most common issues that CSS collects profiler data for addressing SQL Performance issues: Attentions/Timeouts and High CPU usage/Long Running queries.

Attentions or Timeouts

Attentions or Timeouts can occur due to client side application timeout setting (mostly this is the case) or due to database engine side issue (like failure to acquire memory, problem on the server side connectivity layer etc.). During this time, you will have a lot of queries starting but not completing successfully. If you are dealing with such a scenario, then you would need to capture the Starting events using a profiler trace to find out which statements didn’t complete and encountered the Attention event.

The events that need to be captured are:

All Database Events

Log File Auto Shrink
Data File Auto Shrink
Log File Auto Grow
Data File Auto Grow

All Errors and Warnings

Attention (Important for tracking the timeouts)
ErrorLog
EventLog
Exception
Hash Warning
Execution Warnings
Sort Warnings
Missing Column Statistics
Missing Join Predicate
Exchange Spill Event
Blocked process report
User Error Message
Background Job Error
Bitmap Warning
Database Suspect Data Page
CPU threshold exceeded

Lock:Deadlock Chain and Lock:Timeout (timeout > 0)

Performance: Auto Stats

Showplan All or Showplan XML (As execution plan will not be available for queries which didn’t complete)

Showplan Statistics Profile or Showplan XML Statistics Profile (For comparing estimated and actual query plan)

Security Audit: Audit Login, Audit Login Failed and Audit Logout

Stored Procedures events:

RPC:Starting
SP:Recompile
SP:Starting
SP:Completed
SP:StmtStarting
SP:StmtCompleted
RPC:Completed

Transactions: SQLTransaction is required for tracing back the transaction if there are nested transactions). DTCTransaction is required only if DTC transactions are getting timed out.

TSQL events:

SQL:BatchCompleted
SQL:BatchStarting
SQL:StmtStarting
SQL:StmtCompleted
SQL:StmtRecompile

Cursor events: CursorRecompile and CursorImplicitConversion (Required for checking if the requested cursor is being implicitly converted or if the cursor is being frequently recompiled).

The above events should be sufficient to track down any query timeouts and the common root causes for the same.

High CPU usage/Long Running queries

Another set of issues that are commonly dealt with while troubleshooting SQL Performance issues are Long Running Queries or Queries causing high CPU usage on the server. In such a case, the above set of events would be definitely be useful. If you are certain that the profiler trace can be captured for the entire duration of the query execution, then for highly transactional servers, you can omit the capture of the following events:

SQL: BatchStarting
SQL: StmtStarting
SP: StmtStarting
SP: Starting
RPC: Starting

The above logic of configuring a profiler trace will also help you capture trace data when you are trying to track down queries which are causing excessive amount of IO on the SQL instance.

Reference for capturing blocking information on a SQL instance: SQL 2000 Blocker Script output/SQL 2005 PerfStats

Keep in mind that the profiler data captured may not be sufficient for root cause analysis if you are not capturing SQL blocking information and performance monitor data in conjunction with the profiler traces.

In case you need additional events like deadlock events or lock events, then refer to my previous post and find out if they fall in the high frequency event category and when to capture the particular event.

TIP: As much as possible, avoid adding Text filters to your profiler trace configuration as it can lead to high CPU usage for highly active OLTP environments during filter evaluation.

Hope that the above information will help you reduce the excessive amount of data collection for troubleshooting SQL performance issues.