T-SQL Tuesday #19: Disasters and Recovery

This month’s revolving blog party a.k.a. T-SQL Tuesday is being hosted by Allen Kin (blog | twitter). The topic for this month is Disasters and Recovery. The creation of fail-safe mechanisms is probably one of the most important facet of any IT administrator role in today’s world where online transactions have become synonymous to daily lives. When creating a DR strategy, you need to keep three things in mind:

1. RPO Recovery Point Objective
2. RTORecovery Time Objective
3. SLA – Service Level Agreements

Mike Walsh has already documented about the above three points in an earlier post on SQL University DBA Week. So what am I going to talk about in this post….. Well, since a major part of my daily job involves working on critical situations which sometimes involve disaster recovery, I will talk about some key but simple points that you should be aware of while restoring your database[s] in case your server encounters a disaster.

1. Always have backups – This point no matter how many times reiterated is still less! You should always have backups of your databases. You should store your backups on a separate media which is not the same as the disk drives which store the database files. This ensures that you don’t have a single point of failure. I have seen a lot of scenarios where the backups and the database files are stored on the same drive. Once the drive goes BOOM!!… You are left with zilch!! A bad scenario to be in!

2. Test your backups – Just taking regular backups doesn’t ensure that you will be safe when a disaster strikes. You need to restore your backups and ensure that the backups can be restored successfully. If you have an automated DR strategy in place, then it is always good to perform dry-runs to ensure that your team is well versed with the recovery process when the need arises. You don’t want to be grappling with your restore scripts during a crisis situation. The next nugget of information is to ensure that a DBCC CHECKDB on the restored database completes without any errors. Just because the restore was successful, doesn’t mean that the database is consistent!

3. Know your environment – An application doesn’t just depend on your database[s]. There might be customized connection settings, connection aliases, specific logins, database users, linked servers etc. which need to be kept handy in case you need to bring a new environment online which was a clone of your previous disaster ridden system. I have seen multiple times where the databases have been restored successfully but the logins and linked specific to the application are missing. So now you have an environment which has the application databases but other specifics pertaining to the application’s functioning are missing.

4. System databases need to be backed up also – System databases do need to be backed up as well. Eg. Without the master database backup in a disaster scenario, you will be missing the necessary logins that your application needs to login to the user database.

5. Benchmarking is very important – As I mentioned earlier, a dry-run is very important. This is primarily due to the fact that if you do not know how much time a restore is going to take, you cannot define your RTO and adhere to your agreed SLAs. A classic situation is that the application needs to be up within 4 hours but since no once tested the entire restore cycle, no one knows how long it will take to restore the set of full/differential/log backups that are available.

6. Have multiple points of failure – This is mostly considered as a good to have but in critical environments, I consider this as a must-have! A simple implementation of this would be redundancy. Keep two copies of your database backups. If one set of database backups are inconsistent, you have a redundant set of backups to fall back on. A decision taken to disk space by reducing the number of redundant copies can look very daft when you are not able to bring a production system online due to the unavailability of consistent backups.

7. Never rely on REPAIR ALLOW DATA LOSS as your savior – The REPAIR ALLOW DATA LOSS option provided with CHECKDB should always and always be your last resort! This means that when all else fails, then you resort to repair options. This repair option should never be your first option for recovering from a disaster because as the name states it always results in data loss!!

8. Know how long a CHECKDB takes to complete on the database – If you do not run CHECKDB regularly on the database for which you are creating a DR strategy, then you are inviting trouble. Always run periodic CHECKDB on your databases and note the time taken so that you have a fair estimate on how long a CHECKDB should take to complete successfully on the given database.

9. Redundant database copies – A lot of environments use Database Mirroring, Log Shipping and Replication to maintain duplicate copies of the database. If you are using any of these features to maintain copies of the existing databases, then you need to note two things: first being the latency between the primary and secondary copies. This will define your RPO as the average latency will be the amount of data loss that you should be prepared to deal with and this will also define RPO to some measure as the time taken to recover the missing data would be defined by latency. Another point to keep in mind is that if you decide to use one of the alternate database copy of the database as the new production database, then you need to ensure that you avoid certain gotchas. Example: Orphaned users for SQL Authenticated logins when you use log shipping or database mirroring.

10. Keep in mind the additional SQL Server features being used – If you are using replication or mirroring or log shipping on the primary database being recovered, then you need to account for additional steps before restoring the databases as a simple restore of a database backup for such a database will not do. Eg. Special considerations need to be followed for restoring replicated databases.

For the non-technical aspects, a disaster recovery plan should include the following:

Disaster recovery plan types include the following (from Books Online):

  1. A list of people to be contacted if a disaster occurs
  2. Information about who owns the administration of the plan
  3. A checklist of required tasks for each recovery scenario. To help you review how disaster recovery progressed, initial each task as it is completed, and indicate the time when it finished on the checklist.

The above points might seem like basics but it would be surprising that they don’t get religiously followed on some production environments!

Troubleshooting that darn backup error

Backup and restore operations even though they are supposed to be a no-brainer, it can leave you wanting to tear your hair out. Considering the different elements that influence the outcome of a successful backup operation like backup media, consistency of the backup, network issues (if backing up to a UNC path), consistency of the database being backed up etc.

I shall attempt to outline a set of troubleshooting steps specifically for non-native backup/restore related errors for SQL Server databases.

Any BACKUP command that completes successfully or with errors has associated messages logged in the SQL Server Errorlog.

A successful backup from the SQL Server Errorlog:

Database backed up. Database: adventureworks, creation date(time): 2010/05/28(03:00:18), pages dumped: 17291, first LSN: 67:7733:56, last LSN: 67:7761:1, number of dump devices: 1, device information: (FILE=1, TYPE=DISK: {‘F:\adventureworks.bak’}).

You will see that the backup completion message has the database name, the number of pages that were contained in the backup, the date the backup was created and the number of devices and their type (in this case files). So you can reconstruct the sequence of backups even if your backup information history is missing from the MSDB system tables and you just have your SQL Server Errorlogs to play with. If you want to get real fancy about this, then you could spin up a nifty little Powershell or VBScript to parse through the SQL Errorlogs and provide the backup sequence to you as well.

Now, let’s look at a failed backup message from the SQL Errorlog.

Error: 18204, Severity: 16, State: 1.
BackupDiskFile::CreateMedia: Backup device ‘R:\adventureworks.bak’ failed to create.
Operating system error 3(The system cannot find the path specified.).
Error: 3041, Severity: 16, State: 1.
BACKUP failed to complete the command BACKUP DATABASE adventureworks.

You will notice above that the Operating System error code associated with the backup failure is reported in the error message. More often than not, the error message is self explanatory. If the error code doesn’t throw back an error text along with it, you can get the windows error code associated with the error code using net helpmsg <error number> from a command prompt window.

In the above case, there wasn’t any R: drive on my server. Now that I have finished stating the most obvious of troubleshooting methodologies for SQL native backups which was done to set the context for the next part of this post. The troubleshooting methodology for non-native backups.

Non-native SQL backups of databases use one of the following methods:

1. Use of APIs exposed through SQLVDI.dll to perform a VDI based backup
2. Use of VSS framework to perform a snapshot backup of the SQL Server database

Note: In this post, I shall not be addressing any storage level backup solutions that have options or features to handles SQL database backups.

This post is aimed at discussing the tackling of VSS/VDI related errors that you face while performing a database backup.

If a VDI backup fails, then you can try the following sequence of troubleshooting steps:

1. Attempt a backup to the same location using native SQL backup from a query window or SQLCMD. This might not always be possible as some of the backup tools do not allow any other backup application to connect to the backup share apart from the backup tool’s agents.
2. The next thing that you can do is to ensure that your SQLVDI.DLL is updated to the latest build. For this you can use the SQL Server Backup Simulator available on Code Gallery using the “Validate VDI Installation” option in the tool. This check in the tool will perform basic checks like current DLL version and additional checks based on root causes of common scenarios that CSS has seen in the past for VDI backup failures. Additional information on the usage of the tool is available here.
3. Perform a backup of the database using the tool for which the backup is failing (to the same destination if possible). The parameter tweaks at this point are not available in the current version like striped backups, changes to MAXTRANSFERSIZE/BUFFER COUNT etc. but if the backup from this tool is successful, then you know for a fact that the SQLVDI APIs are working as expected. Then it is either an additional piece of logic in the backup software that is failing or there is an external factors like network/backup media or there is a resource crunch on the SQL instance on which the database resides.


Common errors that you might see during a VDI backup operation failure:

Error message 1
BackupVirtualDeviceFile::ClearError: failure on backup device ‘VDI_ DeviceID ‘. Operating system error 995(The I/O operation has
been aborted because of either a thread exit or an application request.).

Error message 2
Error: 18210, Severity: 16, State: 1.
BackupMedium::ReportIoError: write failure on backup device ‘VDI_ DeviceID ‘. Operating system error 995(The I/O operation has been aborted because of either a thread exit or an application request.)

Error message 3
Error: 18210, Severity: 16, State: 1. 
BackupVirtualDeviceFile::RequestDurableMedia: Flush failure on backup device ‘VDI_ DeviceID. Operating system error 995(The I/O operation has been aborted because of either a thread exit or an application request.)

The above errors are mostly due to outdated versions of SQLVDI.DLL on SQL Server 2000 or SQL Server 2005 instances. The “Validate VDI Installation” option in the Backup Simulator should be able to identify any known issues with DLL version mismatch for you.


If you have an application that uses Volume Shadow Service to backup your SQL database and that backup fails, then you your troubleshooting steps would be a bit different. The tell-tale sign of a VSS Snapshot backup is in the I/O Frozen message being written into the SQL Errorlogs for the databases being backed up. In such a failure scenario, take a backup of the database using Windows Server Backup Admin (Windows Server 2008 and above) or NTBackup utility to perform a backup of the database files involved. If this backup is successful, it means that a snapshot backup is successfully completing for the database. Now depending on your backup application’s implementation logic, this might be a resource crunch or a backup storage media incompatibility or even a resource crunch.

Common errors associated with VSS backups:

Issue# 1
Error: 3041, Severity: 16, State: 1.
Backup      BACKUP failed to complete the command BACKUP DATABASE <database name>. Check the backup application log for detailed messages.
SubprocessMgr::EnqueueSubprocess: Limit on ‘Max worker threads’ reached.
This error message is raised when your backup application uses VSS to backup a large number of databases simultaneously. More information around this can be found in this post Volume Shadow barfs with 0x80040e14 code

Issue #2
While performing a Bare Metal backup you might run into issues if your SQL Server binaries are located on a different drive other than the C: drive. More information on the same is mentioned here: Location of SQL Binaries can flip out Bare Metal backups

Issue #3

Event Type:    Error
Event Source:    VSS
Event Category:    None
Event ID:    6005
Description:
Sqllib error: Database %1 is stored on multiple volumes, only some of which are being shadowed.

The above error occurs when you backup only one of the volumes that contains SQL database data/log files which are spread over multiple volumes. This is explained in detailed in this post: Why is VSS complaining about SQL Data files


More information on how VSS/VDI work is present in the blog post by Sudarshan: INFORMATIONAL- SHEDDING LIGHT on VSS & VDI Backups in SQL Server

Previous posts related to Volume Shadow Backups
IO Frozen messages while taking NT Backup for SQL databases

Previous posts related to VDI backups
SQL VDI backup fails with 0×80770007

How to convert a LiteSpeed backup into a native SQL backup

I know all you folks out there are probably more proficient at playing around with LiteSpeed backups than I am. Recently, I had to play around with LiteSpeed backups on a regular basis while troubleshooting a particular case for a customer. I knew that there was a utility called Extractor for LiteSpeed which can convert the LiteSpeed backups into native SQL backups. 

Parameters: 

-F|–BackupFile {path}        Path of the backup or restore file devices 

-N|–BackupIndex  {n}       File number (If multiple backups appended within BackupFile(s))

-E|–MTFFile {path}            Path of the Microsoft Tape Format file(s) to be written.  Specify either one or the correct number of files.  If only one file is named, set will be created with index suffixes. If no files listed, Extractor will print number of MTF files to be written.

-K|–Key {key}                 String used to decrypt database backup file(s) Only required if backup is encrypted.
-I|–Overwrite                   Overwrite MTFFile(s). Equivalent of INIT in native SQLServer backups

-L|–LogLevel {n}              Logging option

Example: 

Extractor.exe -FC:\LiteSpeedTran.BKP -EC:\NativeBackup.BKP -N1  

NOTE: To use Extractor utility, you do not need LiteSpeed to be installed on the box. (But the licensing policies and EULA still apply if you are using Extractor) 

How to get backup header information for LiteSpeed backups 

If you have LiteSpeed installed on your box, then you can executed the following XSP command to get a RESTORE HEADERONLY output for a LiteSpeed backup file. This will give a similar output that the T-SQL command RESTORE HEADERONLY provides: 


exec master.dbo.xp_restore_headeronly @filename =  'C:\LiteSpeedTran.BKP'

Hope this information is helpful for individuals whose working knowledge is at not an EXPERT level with this product. I am a LiteSpeed noob!

Why you shouldn’t NUKE your transaction log file?

I have seen multiple scenarios where DBAs will delete the T-LOG file of a SQL Server database because it is eating up disk space.

THIS IS WRONG AND SHOULDN’T BE DONE UNDER ANY CIRCUMSTANCE!

The transaction log of a SQL Server maintains information about all logged activity that occurs in a database. During a database startup, a recovery phase is initiated which performs the Analysis, Redo and Undo phases to bring the database into a consistent state. The complete process is dependant on the availability of the database transaction log. Without this, you are being exposed to the loss of transaction consistency. Sounds like a fancy term doesn’t it? Well all it means is that you can now have data in your database tables which shouldn’t be there.

Let me elaborate on this with an example. I created a small test database which has a simple table containing two columns (record, amount). I inserted 1000 rows in the table. The T-SQL commands used are:

create database dbLogTest
go
use dbLogTest
go
create table tblTest (record int, amount bigint)
go
set nocount on
declare @cntr int
set @cntr = 1
while (@cntr <= 1000)
begin
insert into tblTest values (@cntr, @cntr * 100)
set @cntr = @cntr + 1
end

A SELECT of the first 9 records in the table looks like this:

image

Now I start a transaction in the database to update the first 10 records with an increment of 100. I do NOT commit the transaction and something happens to my SQL Server transaction log file. (Maybe someone deleted the file when SQL was offline!!) After I restart the SQL Server service and try to use the database, I get the following error in the SQL Server ERRORLOG:

2010-05-04 18:20:25.740 spid19s      Starting up database ‘dbLogTest’.
2010-05-04 18:20:26.300 spid19s      Error: 17207, Severity: 16, State: 1.
2010-05-04 18:20:26.300 spid19s      FileMgr::StartLogFiles: Operating system error 2(The system cannot find the file specified.) occurred while creating or opening file ‘D:\Database Files\dbLogTest_log.LDF’. Diagnose and correct the operating system error, and retry the operation.
2010-05-04 18:20:26.700 spid19s      File activation failure. The physical file name "D:\Database Files\dbLogTest_log.LDF" may be incorrect.
2010-05-04 18:20:26.700 spid19s      The log cannot be rebuilt because the database was not cleanly shut down.

Since the transaction log file is not available, I cannot start the database. Now I will put the database in EMERGENCY mode and run a CHECKDB on the database. The CHECKDB output states that the database has no consistency errors. So the data file is intact and there are no errors.

CHECKDB found 0 allocation errors and 0 consistency errors in database ‘dbLogTest’.
DBCC execution completed. If DBCC printed error messages, contact your system administrator.

You would now expect that the data in the data file is consistent and can be exported out and merged with your database backup (if one exists). Before we pass the verdict of “Everything is OK”, let’s have a look at the first 9 records in the tblTest table which we modified before killing the SQL Server service.

image

And now you have all values incremented by 100 even though you didn’t commit the transaction. This is because a CHECKPOINT on the database would have flushed all the dirty buffers to the database file. SQL Server follows WAL (Write Ahead Logging) for the Transaction Log. The database transaction log would have had the information that the database has an open transaction which was not committed and had modified data. If the transaction log file was present, the database recovery process would have rolled back the above changes and you would have data which is supposed to be present in the database. The transaction log is the heart of the transactional consistency of a SQL Server database. If your database was not cleanly shutdown, you will not be able to rebuild the transaction log with the ATTACH_REBUILD_LOG clause for CREATE DATABASE. The only option would be Export the data into another database with the same schema.

Whenever CSS is left with no other option other than to rebuild the transaction log to get the database online, we do explain in detail the ramifications of how rebuilding the transaction log can affect your transactional consistency. We never recommend rebuilding the transaction log. So, the best option is always to restore a last known database backup and reconcile the remaining portion of the data (since the last backup till date) manually. It is a manual operation but something that cannot be compromised on when you are handling databases in a financial institution or critical systems where a mishap like the one described above could prove disastrous.

A CREATE DATABASE with ATTACH_REBUILD_LOG for the database fails with the following error:

File activation failure. The physical file name "D:\Database Files\dbLogTest_log.LDF" may be incorrect.
The log cannot be rebuilt because the database was not cleanly shut down.
Msg 1813, Level 16, State 2, Line 1
Could not open new database ‘dbLogTest’. CREATE DATABASE is aborted.

This post was an attempt in explaining how a transaction log file loss can affect you adversely. So, the takeaways here are:

1. Ensure that you have a tried and tested disaster recovery plan in place to recover from hardware/software failures that may occur in your environment

2. Never ever DELETE your transaction log file

Rebuilding the transaction log of a SQL Server database is one of the TOP 2 worst things that you can do as a DBA. The second one is running CHECKDB with REPAIR_ALLOW_DATA_LOSS. I have already documented about Repair Allow Data Loss in a previous post of mine.

Read more about these in Paul Randal’s posts on MSDN:

Ta da! Emergency mode repair

When should you rebuild the transaction log?

A snippet from the above post:

Product Support will recommend rebuilding the transaction log only if something has corrupted it and you have no backups (plus gently reminding you to get a backup strategy). Before they do so they’ll make sure you’re aware of the possibly consequences of doing it and advise you to run DBCC CHECKDB to determine whether the database is in a structurally and transitionally consistent state. If not, you’ll most likely need to run REPAIR_ALLOW_DATA_LOSS and then begin the process of working out what data was lost (and start dealing with irate customer?)

There is a no excuse for not having a backup strategy and there is definitely no excuse for running the above NOT recommended steps when you have other options.

I have found various MVPs stressing the same point. Here is one such example from Gail [Blog].

How to find out how many objects of different types are there in a SQL database

I have sometimes found the need on data corruption cases to compare the number of objects exported to the destination database with the source database to find out which objects got exported and which didn’t. You would need to run this script against the source and destination database and compare the output. 

One of the drawbacks of the “Generate Scripts” option in SQL Server Management Studio is that SSMS scripts out the Primary Key constraints along with the CREATE TABLE script. So, if the table is already created, the primary key will not be created if you use the script that was generated by the SSMS Generate Scripts Wizard

Script 

select 

CASE xtype 

WHEN 'C' THEN 'CHECK constraint' 

WHEN 'D' THEN 'Default or DEFAULT constraint ' 

WHEN 'F' THEN 'FOREIGN KEY constraint ' 

WHEN 'L' THEN 'Log ' 

WHEN 'FN' THEN 'Scalar function ' 

WHEN 'IF' THEN 'In-lined table-function ' 

WHEN 'P' THEN 'Stored procedure ' 

WHEN 'PK' THEN 'PRIMARY KEY constraint (type is K) ' 

WHEN 'RF' THEN 'Replication filter stored procedure' 

WHEN 'S' THEN 'System table ' 

WHEN 'TF' THEN 'Table function ' 

WHEN 'TR' THEN 'Trigger ' 

WHEN 'U' THEN 'User table ' 

WHEN 'UQ' THEN 'UNIQUE constraint (type is K) ' 

WHEN 'V' THEN 'View ' 

WHEN 'X' THEN 'Extended stored procedure' 

ELSE 'UNKNOWN' 

END, count(*) as counts 

from sys.sysobjects 

group by CASE xtype 

WHEN 'C' THEN 'CHECK constraint' 

WHEN 'D' THEN 'Default or DEFAULT constraint ' 

WHEN 'F' THEN 'FOREIGN KEY constraint ' 

WHEN 'L' THEN 'Log ' 

WHEN 'FN' THEN 'Scalar function ' 

WHEN 'IF' THEN 'In-lined table-function ' 

WHEN 'P' THEN 'Stored procedure ' 

WHEN 'PK' THEN 'PRIMARY KEY constraint (type is K) ' 

WHEN 'RF' THEN 'Replication filter stored procedure' 

WHEN 'S' THEN 'System table ' 

WHEN 'TF' THEN 'Table function ' 

WHEN 'TR' THEN 'Trigger ' 

WHEN 'U' THEN 'User table ' 

WHEN 'UQ' THEN 'UNIQUE constraint (type is K) ' 

WHEN 'V' THEN 'View ' 

WHEN 'X' THEN 'Extended stored procedure' 

ELSE 'UNKNOWN' 

END

CHECKDB with Repair Allow Data Loss: Should I shoot myself?

Over the past years, I have seen multiple DBAs execute DBCC CHECKDB with REPAIR ALLOW DATA LOSS to repair the SQL Server database corruption in their database without understanding repercussions. Even when you are working with CSS on such issues, CSS engineers actually explain what the command does and also send across a disclaimer stating why this command should NOT be executed on a database and what the repercussions are. Alas due to a lack of a sound database backup strategy, a lot of DBAs are forced to resort to losing data. In a nutshell, Repair Allow Data Loss tries to first repair the corruption in the page but if it cannot, then it checks if it can drop the page and repair the error and hence the name Allow Data Loss! I had a question for all the DBAs who advocate the use of this command:

How would you like your Payroll DBA to run the command? (Good enough food for thought!!)

The only way to compare how much data you lost would be to perform a CHECKDB before running the CHECKDB with REPAIR ALLOW DATA LOSS and after the CHECKDB has completed. The row counts returned by the two CHECKDB outputs would have to be compared. If CHECKDB is failing due to system catalog corruption, then you are plain out-of-luck.

What most people fail to understand that CHECKDB can verify integrity of the database based on the SQL database rules imposed on it like referential integrity, data length, record structures, page structures, header correctness etc. But these are database integrity rules that we are aware of. What about the business logic rules that you imposed while designing the database.

Eg: Let’s say you had a database table called Employee which depended on a Department ID valued to be present in the Department table. No foreign key relationship exists between the two tables. You experienced corruption in the database and used Repair Allow Data Loss command along with CHECKDB which dropped a few pages from the Department table. However, CHECKDB cannot know that a Department ID is missing in the Department table because you didn’t create a Foreign Key between the two tables. So when your application now tries to retrieve the Department ID of the Employee, it will BARF on you! This is the kind of scenario we specifically want to avoid which is why we advise against the use of data loss commands.

SQL Server wouldn’t know what data is supposed to present in the table rows. The only thing that SQL can verify while you are retrieving data or checking integrity of the database is that the column data and column datatype match. I have heard questions like “Why can’t SQL Server figure out if the value is correct?”. Well, would you know why a person’s name is Uncle Sam and not Uncle Bill? 🙂 So, how would SQL Server know that the FirstName column (an example) was supposed to store Sam and not Bam. As per SQL Server, both strings match the column datatype, hence from a structural and logical integrity, the data is correct.