This month’s revolving blog party a.k.a. T-SQL Tuesday is being hosted by Allen Kin (blog | twitter). The topic for this month is Disasters and Recovery. The creation of fail-safe mechanisms is probably one of the most important facet of any IT administrator role in today’s world where online transactions have become synonymous to daily lives. When creating a DR strategy, you need to keep three things in mind:
1. RPO – Recovery Point Objective
2. RTO – Recovery Time Objective
3. SLA – Service Level Agreements
Mike Walsh has already documented about the above three points in an earlier post on SQL University DBA Week. So what am I going to talk about in this post….. Well, since a major part of my daily job involves working on critical situations which sometimes involve disaster recovery, I will talk about some key but simple points that you should be aware of while restoring your database[s] in case your server encounters a disaster.
1. Always have backups – This point no matter how many times reiterated is still less! You should always have backups of your databases. You should store your backups on a separate media which is not the same as the disk drives which store the database files. This ensures that you don’t have a single point of failure. I have seen a lot of scenarios where the backups and the database files are stored on the same drive. Once the drive goes BOOM!!… You are left with zilch!! A bad scenario to be in!
2. Test your backups – Just taking regular backups doesn’t ensure that you will be safe when a disaster strikes. You need to restore your backups and ensure that the backups can be restored successfully. If you have an automated DR strategy in place, then it is always good to perform dry-runs to ensure that your team is well versed with the recovery process when the need arises. You don’t want to be grappling with your restore scripts during a crisis situation. The next nugget of information is to ensure that a DBCC CHECKDB on the restored database completes without any errors. Just because the restore was successful, doesn’t mean that the database is consistent!
3. Know your environment – An application doesn’t just depend on your database[s]. There might be customized connection settings, connection aliases, specific logins, database users, linked servers etc. which need to be kept handy in case you need to bring a new environment online which was a clone of your previous disaster ridden system. I have seen multiple times where the databases have been restored successfully but the logins and linked specific to the application are missing. So now you have an environment which has the application databases but other specifics pertaining to the application’s functioning are missing.
4. System databases need to be backed up also – System databases do need to be backed up as well. Eg. Without the master database backup in a disaster scenario, you will be missing the necessary logins that your application needs to login to the user database.
5. Benchmarking is very important – As I mentioned earlier, a dry-run is very important. This is primarily due to the fact that if you do not know how much time a restore is going to take, you cannot define your RTO and adhere to your agreed SLAs. A classic situation is that the application needs to be up within 4 hours but since no once tested the entire restore cycle, no one knows how long it will take to restore the set of full/differential/log backups that are available.
6. Have multiple points of failure – This is mostly considered as a good to have but in critical environments, I consider this as a must-have! A simple implementation of this would be redundancy. Keep two copies of your database backups. If one set of database backups are inconsistent, you have a redundant set of backups to fall back on. A decision taken to disk space by reducing the number of redundant copies can look very daft when you are not able to bring a production system online due to the unavailability of consistent backups.
7. Never rely on REPAIR ALLOW DATA LOSS as your savior – The REPAIR ALLOW DATA LOSS option provided with CHECKDB should always and always be your last resort! This means that when all else fails, then you resort to repair options. This repair option should never be your first option for recovering from a disaster because as the name states it always results in data loss!!
8. Know how long a CHECKDB takes to complete on the database – If you do not run CHECKDB regularly on the database for which you are creating a DR strategy, then you are inviting trouble. Always run periodic CHECKDB on your databases and note the time taken so that you have a fair estimate on how long a CHECKDB should take to complete successfully on the given database.
9. Redundant database copies – A lot of environments use Database Mirroring, Log Shipping and Replication to maintain duplicate copies of the database. If you are using any of these features to maintain copies of the existing databases, then you need to note two things: first being the latency between the primary and secondary copies. This will define your RPO as the average latency will be the amount of data loss that you should be prepared to deal with and this will also define RPO to some measure as the time taken to recover the missing data would be defined by latency. Another point to keep in mind is that if you decide to use one of the alternate database copy of the database as the new production database, then you need to ensure that you avoid certain gotchas. Example: Orphaned users for SQL Authenticated logins when you use log shipping or database mirroring.
10. Keep in mind the additional SQL Server features being used – If you are using replication or mirroring or log shipping on the primary database being recovered, then you need to account for additional steps before restoring the databases as a simple restore of a database backup for such a database will not do. Eg. Special considerations need to be followed for restoring replicated databases.
For the non-technical aspects, a disaster recovery plan should include the following:
Disaster recovery plan types include the following (from Books Online):
- A list of people to be contacted if a disaster occurs
- Information about who owns the administration of the plan
- A checklist of required tasks for each recovery scenario. To help you review how disaster recovery progressed, initial each task as it is completed, and indicate the time when it finished on the checklist.
The above points might seem like basics but it would be surprising that they don’t get religiously followed on some production environments!
Very good contribution Amit. Your last point is exactly what I was thinking, “how often does all this happen?”. The answer is probably (sadly) hardly never.
Thanks Mark. A lot of these steps if implemented would prevent real crisis situations or atleast a few cool heads during crisis situations 🙂
Pingback: T-SQL Tuesday #19 Wrapup | Allen Kinsel - SQL DBA
This was very good contribution worth putting forward to the DBA community as we tend to forget and forget any of the above points resulting in huge escalation from our clients and millions worth of data loss.