SQL on Azure VM BPCheck

After a long hiatus, I am back on the blogosphere and have decided to dedicate a series to running Best Practices checks using PowerShell on Azure virtual machines running in the new deployment model: ARM (Azure Resource Manager). If you have worked on the classic deployment model, then you will need to unlearn a few things and re-learn a little more. However, the learning curve ain’t that great. Think of Azure Resource Manager as a container for all the cloud components that make up your solution. If you had a SQL Server instance running on an Azure virtual machine, then you would be using compute, networking, storage components in Azure which would together be encapsulated in a Resource Manager group.

I had previously done a series of posts around running best practices checks on Azure Virtual Machines running SQL Server. Some of those checks are still valid today as they only dealt with the SQL Server instance. You will see components of those scripts reused. Some of the checks are way easier due to the way ARM deployments are managed in Azure and the endpoints that the Azure PowerShell 1.0 exposes.

The PowerShell scripts available on the GitHub repository are mentioned below. I will run a post on each of these scripts to explain what each of these do and what to expect as the output of each of these scripts.

  • Get-AllocationUnitCheck.ps1 – Checks if the allocation unit size for the disks attached to the VM is 64K
  • Get-DBProperties.ps1 – Checks if any database has AUTO CLOSE or AUTO SHRINK enabled
  • Get-FilesOnTemp.ps1 – Checks to see if any database files are hosted on the temporary drive
  • Get-IFI.ps1 – Checks to see if the SQL Server service account has instant file initialization security privileges
  • Get-LPIM.ps1 – Checks to see if Lock Pages in Memory privilege is granted to the SQL Server service account
  • Get-OSFilesDB.ps1 – Checks to see if database files are hosted on the OS drive
  • Get-StorageAccountBP.ps1 – Checks to see if the storage account has replication enabled
  • Get-VMSize.ps1 – Checks if the right virtual machine tier is being used
  • Temporary Drive.ps1 – Finds out the temporary drive on the virtual machine
  • Get-Backups.ps1 – Finds out if any backups are being taken to local disk

The PowerShell scripts are available on GitHub repository SqlOnAzureVM. Since these scripts are now on GitHub, please feel free to pull them and enhance them as per your needs.

Azure Storage and SQL Server – Part 5

In this post, we will start checking if the SQL Server files are following best practices for Azure storage recommendations. In Part 4 of this series, I had written about how to identify the temporary drive on an Azure Virtual Machine. I am going to use that snippet of code to perform the file layout test that I will perform here.

The script performs the following operations:

1. Finds out the temporary drive letter

2. The script first finds out if the Azure Virtual Machine is a D-series machine or not.

3. Then it fetches the file location from the SQL Server instance using the Invoke-Sqlcmd cmdlet.

4. Using the above information, it determines if files are present on the temporary drive or not.

If this is a D-Series virtual machine, then the script will not report a warning if it finds the tempdb files on the temporary drive as it is supported. For any other series, the script will report an issue since database files residing on the temporary drive are not supported on the temporary drive other than the D-Series.

The PowerShell script is available below:


# Find out the IDE drives on the machine which has a SCSI target = 1

$Name = Get-WmiObject -Class Win32_DiskDrive -Filter "InterfaceType = `"IDE`" and SCSITargetId = 1" | Select-Object Name

# Find the parition id corresponding to the IDE temporary drive

$Antecedent = "*" + $Name.Name.Replace("\","").Replace(".","") + "*"

$Dependent = Get-WmiObject -Class Win32_DiskDriveToDiskPartition | Where-Object {$_.Antecedent -like $Antecedent} | Select-Object Dependent

# Find the logical disk which corresponds to the temporary drive partition

$Antecedent = "*" + $Dependent.Dependent.Split("`"")[1] + "*"

$TempDrive = (Get-WmiObject -Class Win32_LogicalDiskToPartition | Where-Object {$_.Antecedent -like $Antecedent} | Select-Object Dependent).Dependent.Split("`"")[1]

Write-Host "INFO: Temporary drive on the machine is:" $TempDrive -Foregroundcolor Red

# Variable to track if this is a D-Series VM

[boolean] $DSeries = $false

# Using Get-AzureVM to get the instance size property

$InstanceSize = Get-AzureVM -ServiceName "<azure service name>" -Name "<azure vm name>" | Select-Object InstanceSize

# Find out if this is a D-Series VM

if ($InstanceSize.InstanceSize -like "*_D*")

{

$DSeries = $true

}

# Get the location of the database files from the system catalog

$dbfiles = Invoke-Sqlcmd -ServerInstance "." -Database "master" -Query "select physical_name, name, db_name(database_id) as dbname, type_desc from sys.master_files"

# Run a foreach loop to determine if any user or system database files are on the temporary drive other than the tempdb

foreach ($file in $dbfiles)

{

# Tempdb can be hosted on the temporary drive on D-Series VMs

if ($DSeries-eq $true)

{

if ($file.physical_name.substring(0,2) -eq $TempDrive -and $file.dbname -ne "tempdb")

{

Write-Host "ISSUE:" $file.dbname "(Filename:" $file.name " Location:" $file.physical_name") found on temporary drive" $TempDrive -ForegroundColor Red

}

}

# Report any databases on temporary drive for non-D-Series VMs

else

{

if ($file.physical_name.substring(0,2) -eq $TempDrive)

{

Write-Host "ISSUE:" $file.dbname "(Filename:" $file.name " Location:" $file.physical_name") found on temporary drive" $TempDrive -ForegroundColor Red

}

}

}

Please keep in mind that the above script has not been tested for mount points.

Previous post in the series

Azure Storage and SQL Server – Part 1
https://troubleshootingsql.com/2014/11/10/azure-storage-for-sql-server/
Azure Storage and SQL Server – Part 2
https://troubleshootingsql.com/2014/11/11/azure-storage-and-sql-server-part-2
Azure Storage and SQL Server – Part 3
https://troubleshootingsql.com/2014/11/12/azure-storage-and-sql-server-part-3
Azure Storage and SQL Server – Part 4
https://troubleshootingsql.com/2014/11/13/azure-storage-and-sql-server-part-4/

References

Azure Storage
http://azure.microsoft.com/en-us/services/storage/

Azure Subscription and Service Limits, Quotas, and Constraints
http://azure.microsoft.com/en-us/documentation/articles/azure-subscription-service-limits/#storagelimits

* This blog post has been written based on the service details available on 10th November, 2014.

Azure Storage and SQL Server – Part 3

In the last post of this series, I had talked about how to determine the number of page blobs present in a particular storage account and if the limit for Azure storage service will be exceeded based on the current configuration. In this post, I shall talk about something a bit closer to your virtual machine which is running the SQL Server workload. The script shown below will determine the IOPS supported by the data disks attached to the virtual machine and determine if all the data disks belong to the same storage account. It is recommended that your data disks belong to a single Azure storage account. If they are not part of the same storage account, then a warning in flagged in the output as seen in the screenshot.

The first part of the PowerShell script is uses Get-AzureVM cmdlet to find out the details of the Azure virtual machine which is running your workload. The output of this cmdlet is used as an input for the Get-AzureDataDisk cmdlet which allows me to find out all the data disks that are attached to my Azure virtual machine. Then a simple calculation lets me understand how much I/O workload the virtual machine can handle.

The next part of the script performs an important task of determining if the data disks attached to the virtual machine belong to the same storage account. If not, then it flags a warning. Typically, for high performance VMs, it is recommended that a single storage account be used for hosting the data disks of the virtual machine.

The Powershell script and a screenshot of the output is shown below.


# Temporary variables

$AccountName = ""

$fail = 0

# Get the list of data disks for the Azure virtual machine

$DataDisks = Get-AzureVM -ServiceName "<Azure Service Name>" -Name "<Azure VM Name>" | Get-AzureDataDisk

Write-Host "INFO: This virtual machine has" $DataDisks.Count "data disks which can support a total of"($DataDisks.Count*500)" IOPS" -ForegroundColor Green

# Check if there is a data disk which belongs to a different storage account

foreach ($disk in $DataDisks)

{

if ($AccountName -eq "")

{

$AccountName = $disk.MediaLink.AbsoluteUri.Substring(8,$disk.MediaLink.AbsoluteUri.IndexOf(".")-8)

}

if ($AccountName -ne $disk.MediaLink.AbsoluteUri.Substring(8,$disk.MediaLink.AbsoluteUri.IndexOf(".")-8))

{

$fail = 1

}

}

# Show a warning if the data disks span across multiple storage accounts

if ($fail -eq 1)

{

Write-Host "WARNING: Mulitple storage accounts found for disks attached to the virtual machine" -ForegroundColor Red

}

else

{

Write-Host "PASSED: All disks attached to the virtual machine belong to a single storage account" -ForegroundColor Green

}

image

Previous post in the series

Azure Storage and SQL Server – Part 1
https://troubleshootingsql.com/2014/11/10/azure-storage-for-sql-server/

Azure Storage and SQL Server – Part 2
https://troubleshootingsql.com/2014/11/11/azure-storage-and-sql-server-part-2

References

Azure Storage
http://azure.microsoft.com/en-us/services/storage/

Azure Subscription and Service Limits, Quotas, and Constraints
http://azure.microsoft.com/en-us/documentation/articles/azure-subscription-service-limits/#storagelimits

* This blog post has been written based on the service details available on 10th November, 2014.

Azure Storage and SQL Server – Part 2

In the first post of this series, I had talked about the cheat sheet which documents the best practices for storage for SQL Server workloads running on Azure Virtual Machines. In this post, I am going to show how to automate additional checks for your virtual machine and your storage configuration. I am going to show how to determine if you have more than 40 disks in the storage account as they would not be able to sustain more than 20,000 IOPs as per the storage limits for Azure storage accounts.

The first part of the PowerShell script is used to find out the number of page blobs available in the storage account using the Get-AzureStorageContainer cmdlet. This allows me to fetch all the containers in a particular storage account which is specified as a parameter in the beginning of the script. The storage account context was created using the New-AzureStorageContext cmdlet. This is required for all the subsequent commands to ensure that the cmdlet executes against the correct storage account. This is required especially if you have more than one storage account associated with your subscription.

The next part of the script uses Get-AzureStorageBlob cmdlet to determine the number of blobs in a container. I used a foreach loop to determine the contents of each container. The reason I am using local variables to ensure that I minimize the network call to the Azure Storage service to get the filtering for both types of blobs in the storage container.

The last if-else block determines if you have more than 40 disks in the storage account. If yes, then it flags off an issue because if all the disks are active, they will not be able to scale to their potential of 500 IOPS. This is due to the fact that 20,000 IOPS per storage account limit will be divided across the disks available in the storage account.

The Powershell script and a screenshot of the output is shown below.

# Assign the storage account name

$StorageAccount = "<storage account name>"

 # Get the storage account key as this is needed for creating the storage context

$StorageKey = Get-AzureStorageKey $StorageAccount | %{ $_.Primary }

 # Create a new storage context for use in the next sections on the code block

$StorageContext = New-AzureStorageContext -StorageAccountName $StorageAccount -StorageAccountKey $StorageKey

 # Find out all the containers in the storage account

$Containers = Get-AzureStorageContainer -Context $StorageContext

 # Local variables for tracking the page and block blobs

$PageBlob = 0

$BlockBlob = 0

 # Get the disks in each container and count each type of blog present for the summary output

foreach ($name in $Containers)

{

$temp = Get-AzureStorageBlob -Container $name.Name -Context $StorageContext

$Blob = $temp | Where-Object {$_.BlobType -eq "PageBlob"}

$PageBlob += $Blob.Count 

$Blob = $temp | Where-Object {$_.BlobType -eq "BlockBlob"}

$BlockBlob += $Blob.Count 

}

Write-Host "INFO: There are" $Containers.Count "containers in the storage account:" $StorageAccount "which have" $PageBlob "Page Blob(s) and" $BlockBlob "Block Blob(s)" -ForegroundColor Green

 # Determine if 20K IOPS limit will be crossed due to having 40+ disks

if ($PageBlob -le 40)

{

Write-Host "INFO: This storage account has" $PageBlob "disk(s) which can support a total of" ($PageBlob*500) "IOPS" -ForegroundColor Green

 }

else

{

Write-Host "ISSUE: This storage account has" $PageBlob "disk(s) which can ONLY support a total of 20,000 IOPS" -ForegroundColor Red

 }

image

Previous post in the series

Azure Storage and SQL Server – Part 1
https://troubleshootingsql.com/2014/11/10/azure-storage-for-sql-server/

 

References

Azure Storage
http://azure.microsoft.com/en-us/services/storage/

Azure Subscription and Service Limits, Quotas, and Constraints
http://azure.microsoft.com/en-us/documentation/articles/azure-subscription-service-limits/#storagelimits

* This blog post has been written based on the service details available on 9th November, 2014.

T-SQL Tuesday #21: It’s easy to be lazy

This month’s revolving blog party is being hosted by the creator himself, Adam Machanic (blog|twitter). This month’s topic chosen by Adam is about the forgettable practices/habits of the past that can be avoided. This is a topic on which I can write long rants about… But I shall limit myself to writing about three areas where I have managed to chuck some bad habits and thereby reduced the amount of redundant time that I needed to spend due to these avoidable habits!

Your code needs some English!

One of the practices to be avoided pertaining to T-SQL coding was brought out in unison by the community members during last month’s T-SQL Tuesday was about adding relevant comments to your code. I shall take this a notch higher and talk about adding comments for two reasons:

a. Remember why a change/modification was made to the code
b. So that someone supporting your code understands the logic behind what you have written

More often than not it’s a convenience and sometimes sheer laziness which has prevented. This happens especially during crisis situations where a quick change to the code resolves the situation but you forgot to mention in the file why and when the change was made. This little indiscretion which I have been a victim as well as a perpetrator of has cost me a lot of precious time in the past. So, don’t be lazy… Add a few English lines to your code to help the person supporting it!

Simplicity in source control can prevent hair loss

Not maintaining any form of version control for your source code is quite common when it is an individual managing the entire project or the tool development. But I still believe that sometimes simple forms of source control can save you a lot of painful hours and wishing that you had one pill that Bradley Copper had in Limitless. You don’t need to have a version control software or VSTS setup to manage your code. You could use simple logic like taking a backup of your code or naming your files v1, v1.2 and so on and so forth before making a huge change to your application code that cannot be handled by the way of comments without writing a small essay. This is also a form of laziness which in the past has cost me hours of coding time. So be smart and maintain your code in such a way that you can revert your code back to a previous version in a matter of minutes!!

Documentation is an important key to saving time

I am sure all of you are well aware that documentation is necessary but it can be mundane at times. The entire reason I setup this blog was to document those unique quirks about SQL Server and those unique solutions that I arrived at by looking at disparate pieces of information. I must admit that I do not have photographic memory and I need to have reference to documentation which is available with the helps of a keywords search. I use the search on my blog to find my old posts!! So if I didn’t document a unique solution, I would end up spending the same X amount of hours working on the same issue the next time I encounter it. The value of documentation is that it saves time!

A wise man/woman learns from his/her mistakes. But a wiser man/woman learns from other people’s mistakes.

Performance benchmarking: Explicit vs Implicit Transactions

imageBenchmarking… The single most important word in Performance Tuning. Without an established benchmark, you really don’t know when to actually call it a day!

If you have ever worked in testing on database projects, you would have probably done testing by trying to narrow down the data set involved which is nearly representative of the actual production data set or used a sub-set of the queries which are representative of the workload which is expected or is active on the production environment.

Now this brings me to the reason of this post. It is quite often during testing that we forget that SQL Server executes all DML operations by default in autocommit mode. This means that all individual statements are committed if they complete successfully. If you want to avoid this behavior, then you either need to set IMPLICIT_TRANSACTIONS setting to ON for your database connection or use a BEGIN TRANSACTION command before you execute your query.

SQL Server follows Write Ahead Logging protocol which means no data modifications are written to disk before the associated log record is written to disk. This maintains the ACID properties for a transaction. (when you involve disk-caching, that opens up another can of worms but that discussion is outside the scope of this blog post)

My belief is that if you are comparing execution times between two different environments, then you should be doing that on exactly the same hardware configuration and if that is not feasible, you should at-least match the CPU, physical RAM and disk sub-system on both sides. I had already documented in an earlier blog post why RAM/CPU configurations are important when comparison execution statistics between two different SQL Server environments. Ideally, you should have the same exact environment as your test environment including the workload (i.e. number of users, connections etc.). However, we all know that Utopia is not a place where we live in and hence the toned down test scripts, scaled down/up environments, shared resources and I could keep lamenting!!

In the last month, I dealt with two such issues where in a T-SQL batch performing a large number of inserts on a database table was being used to compare the performance between two different environments. Since I get called in to fix a problem and no-one generally calls CSS to tell us that their server is running as expected… the problem invariably happened to be that a bigger, beefier server was taking a longer time to execute the same batch. I shall demonstrate where not knowing about the WAL protocol can cause such test mechanisms to be skewed and take you down the proverbial rabbit-hole!

The script that I shall be using for demonstrating the pitfalls of this behavior is pretty simple:

declare @cntr int = 1
while @cntr <= 1000000
begin
insert into tblInserts (SNo,RowVal) values(@cntr,’Record ID ‘ +CAST(@cntr as varchar(7)))
set @cntr += 1
end

The script inserts 1 million rows into a database table (a heap table) using a while loop. During the time of the insert, I shall capture various performance counter values during the execution along with wait statistics for the query.

Modification: September 9th, 2011: Based on Kendra’s (Twitter) feedback, changing the sub-headings. Test 1 makes use of auto-commit mode of SQL Server which is the default and Test 2 can be achieved either by using implicit transaction mode or performing an explicit transaction (BEGIN…COMMIT).

Test 1: With AutoCommit mode

For two iterations the above script on an average took 8 minutes and 30 seconds to complete. When I looked into the wait statistics captured (at 5 second intervals), I don’t see anything standing out other than a few non-zero wait durations for WRITELOGs during certain periods. The non-zero wait times (in milli-seconds) are shown below with the highest value being captured being 10 milli-seconds. The Average Disk Secs/(Read/Write) don’t show me any outstanding values during the data capture to indicate that the disk was a bottleneck, then why does the data below show so many WRITELOG waits. Keep in mind that the total amount of time waited for as indicated by the data below is also not a significantly large amount. So why is it taking over eight minutes to insert the data??image

Runtime Session_id Status Wait_type Wait_time Last_Wait_Type
12:46 AM 51 suspended WRITELOG 10 WRITELOG
12:46 AM 51 suspended WRITELOG 7 WRITELOG
12:47 AM 51 suspended WRITELOG 1 WRITELOG
12:47 AM 51 suspended WRITELOG 2 WRITELOG
12:47 AM 51 suspended WRITELOG 3 WRITELOG
12:47 AM 51 suspended WRITELOG 2 WRITELOG
12:48 AM 51 suspended WRITELOG 7 WRITELOG
12:49 AM 51 suspended WRITELOG 8 WRITELOG
12:49 AM 51 suspended WRITELOG 1 WRITELOG
12:49 AM 51 suspended WRITELOG 10 WRITELOG
12:49 AM 51 suspended WRITELOG 6 WRITELOG
12:50 AM 51 suspended WRITELOG 2 WRITELOG
12:51 AM 51 suspended WRITELOG 3 WRITELOG
12:51 AM 51 suspended WRITELOG 4 WRITELOG
12:51 AM 51 suspended WRITELOG 1 WRITELOG
12:51 AM 51 suspended WRITELOG 2 WRITELOG
12:52 AM 51 suspended WRITELOG 5 WRITELOG
12:52 AM 51 suspended WRITELOG 7 WRITELOG
12:53 AM 51 suspended WRITELOG 1 WRITELOG
12:53 AM 51 suspended WRITELOG 8 WRITELOG

To explain the query, I would need to fall back of SQL Server Performance Monitor counters (sampled at 5 second intervals). On analysis of the performance monitor counters, I found the following:
a. SQLServer:Databases: Log Bytes Flushed/sec showed that on an average 3.1 MB of log bytes were being flushed every 5 seconds for the database on which I was performing the inserts.
b. SQLServer:DatabasesLog Flushes/sec showed that about 6000 log flushes were occurring for this database every 5 seconds on an average.
c. SQLServer:Wait Statistics: Log write waits shows me that on an average there were about 9000+ waits started per second. However, the Cumulative wait time (ms) per second for the same counter shows me negligible values and the Average wait time (ms) value is nearly zero through the data collection period.

So how can I explain where the time is being spent?? Now I decided to run another test using implicit transactions.

imageTest 2: Without AutoCommit mode

I ran two iterations of the above T-SQL batch within BEGIN TRANSACTION…COMMIT block and the average duration was 14 seconds! Whattttt!??!?! Yes.. And all this can be simply explained using the same counters that I looked above. This time around the performance monitor data told me the following story:
a. SQLServer:Databases: Log Bytes Flushed/sec showed that on an average 26 MB of log bytes were being flushed every 5 seconds for the database on which I was performing the inserts.
b. SQLServer:DatabasesLog Flushes/sec showed that about 468 log flushes were occurring for this database every 5 seconds on an average.
c. SQLServer:Wait Statistics: Log write waits shows me that on an average there were about 23(approx.)+ waits started per second.

If you look at the Performance Monitor graphs for the disk statistics that I captured for a single run for Test 1 (screenshot above) and Test 2 (screenshot on the left), you will see that the %Idle Time for the disk, on which the database files resided on (in my case F: drive), shows was busy 50% of the time (see blue squiggly above) during the test and the value climbed back to ~100% after the test completed. On the contrary, the %Idle Time for the same disk has a very zig-zag pattern (see black squiggly on left) during Test 2 which indicates that the disk was sporadically busy and it was not a continuous pattern as observed in Test 1.

The Avg. Disk Sec/Write counter shows absolutely no latency which means that the there is no problem in terms of latency on the disks.

During Test 1, data was being written to the disk at the rate of 907 KB/s where as during Test 2, the write rate was 5MB/s which was determined by monitoring the Disk Write Bytes/sec counter.

The reason for this difference is that for every insert in Test 1 is followed by an autocommit. This means that you need to flush the log buffer for each insert. However in Test 2, the log buffer was being flushed much lesser but at the same time more data was being written to the disk for each log flush. Since SQL Server follows a WAL protocol, till the earlier log flush is completed, it cannot commit/move onto the next transaction.

If you are running a similar system with two different sets of hardware, then you would start having to chase down disk speeds/RPMs between the two servers. For servers which have disks performing optimally, this behavior is quite difficult to gather without the right set of data.

With SQL Server 2008, tracking down the waits for a single session is much, much easier with the help of XEvents. But that shall be a story for another day… errrr.. post!!

Moral of the story: If you are using a test query similar to the one shown above to test/benchmark performance and the query batch is not indicative of your actual workload, then you would probably not be able to gauge or establish an accurate performance benchmark for the queries executing on your system. And REMEMBER that SQL Server runs in auto-commit mode by default!

HAPPY TESTING! Smile

T-SQL Tuesday #19: Disasters and Recovery

This month’s revolving blog party a.k.a. T-SQL Tuesday is being hosted by Allen Kin (blog | twitter). The topic for this month is Disasters and Recovery. The creation of fail-safe mechanisms is probably one of the most important facet of any IT administrator role in today’s world where online transactions have become synonymous to daily lives. When creating a DR strategy, you need to keep three things in mind:

1. RPO Recovery Point Objective
2. RTORecovery Time Objective
3. SLA – Service Level Agreements

Mike Walsh has already documented about the above three points in an earlier post on SQL University DBA Week. So what am I going to talk about in this post….. Well, since a major part of my daily job involves working on critical situations which sometimes involve disaster recovery, I will talk about some key but simple points that you should be aware of while restoring your database[s] in case your server encounters a disaster.

1. Always have backups – This point no matter how many times reiterated is still less! You should always have backups of your databases. You should store your backups on a separate media which is not the same as the disk drives which store the database files. This ensures that you don’t have a single point of failure. I have seen a lot of scenarios where the backups and the database files are stored on the same drive. Once the drive goes BOOM!!… You are left with zilch!! A bad scenario to be in!

2. Test your backups – Just taking regular backups doesn’t ensure that you will be safe when a disaster strikes. You need to restore your backups and ensure that the backups can be restored successfully. If you have an automated DR strategy in place, then it is always good to perform dry-runs to ensure that your team is well versed with the recovery process when the need arises. You don’t want to be grappling with your restore scripts during a crisis situation. The next nugget of information is to ensure that a DBCC CHECKDB on the restored database completes without any errors. Just because the restore was successful, doesn’t mean that the database is consistent!

3. Know your environment – An application doesn’t just depend on your database[s]. There might be customized connection settings, connection aliases, specific logins, database users, linked servers etc. which need to be kept handy in case you need to bring a new environment online which was a clone of your previous disaster ridden system. I have seen multiple times where the databases have been restored successfully but the logins and linked specific to the application are missing. So now you have an environment which has the application databases but other specifics pertaining to the application’s functioning are missing.

4. System databases need to be backed up also – System databases do need to be backed up as well. Eg. Without the master database backup in a disaster scenario, you will be missing the necessary logins that your application needs to login to the user database.

5. Benchmarking is very important – As I mentioned earlier, a dry-run is very important. This is primarily due to the fact that if you do not know how much time a restore is going to take, you cannot define your RTO and adhere to your agreed SLAs. A classic situation is that the application needs to be up within 4 hours but since no once tested the entire restore cycle, no one knows how long it will take to restore the set of full/differential/log backups that are available.

6. Have multiple points of failure – This is mostly considered as a good to have but in critical environments, I consider this as a must-have! A simple implementation of this would be redundancy. Keep two copies of your database backups. If one set of database backups are inconsistent, you have a redundant set of backups to fall back on. A decision taken to disk space by reducing the number of redundant copies can look very daft when you are not able to bring a production system online due to the unavailability of consistent backups.

7. Never rely on REPAIR ALLOW DATA LOSS as your savior – The REPAIR ALLOW DATA LOSS option provided with CHECKDB should always and always be your last resort! This means that when all else fails, then you resort to repair options. This repair option should never be your first option for recovering from a disaster because as the name states it always results in data loss!!

8. Know how long a CHECKDB takes to complete on the database – If you do not run CHECKDB regularly on the database for which you are creating a DR strategy, then you are inviting trouble. Always run periodic CHECKDB on your databases and note the time taken so that you have a fair estimate on how long a CHECKDB should take to complete successfully on the given database.

9. Redundant database copies – A lot of environments use Database Mirroring, Log Shipping and Replication to maintain duplicate copies of the database. If you are using any of these features to maintain copies of the existing databases, then you need to note two things: first being the latency between the primary and secondary copies. This will define your RPO as the average latency will be the amount of data loss that you should be prepared to deal with and this will also define RPO to some measure as the time taken to recover the missing data would be defined by latency. Another point to keep in mind is that if you decide to use one of the alternate database copy of the database as the new production database, then you need to ensure that you avoid certain gotchas. Example: Orphaned users for SQL Authenticated logins when you use log shipping or database mirroring.

10. Keep in mind the additional SQL Server features being used – If you are using replication or mirroring or log shipping on the primary database being recovered, then you need to account for additional steps before restoring the databases as a simple restore of a database backup for such a database will not do. Eg. Special considerations need to be followed for restoring replicated databases.

For the non-technical aspects, a disaster recovery plan should include the following:

Disaster recovery plan types include the following (from Books Online):

  1. A list of people to be contacted if a disaster occurs
  2. Information about who owns the administration of the plan
  3. A checklist of required tasks for each recovery scenario. To help you review how disaster recovery progressed, initial each task as it is completed, and indicate the time when it finished on the checklist.

The above points might seem like basics but it would be surprising that they don’t get religiously followed on some production environments!