T-SQL Tuesday #19: Disasters and Recovery

This month’s revolving blog party a.k.a. T-SQL Tuesday is being hosted by Allen Kin (blog | twitter). The topic for this month is Disasters and Recovery. The creation of fail-safe mechanisms is probably one of the most important facet of any IT administrator role in today’s world where online transactions have become synonymous to daily lives. When creating a DR strategy, you need to keep three things in mind:

1. RPO Recovery Point Objective
2. RTORecovery Time Objective
3. SLA – Service Level Agreements

Mike Walsh has already documented about the above three points in an earlier post on SQL University DBA Week. So what am I going to talk about in this post….. Well, since a major part of my daily job involves working on critical situations which sometimes involve disaster recovery, I will talk about some key but simple points that you should be aware of while restoring your database[s] in case your server encounters a disaster.

1. Always have backups – This point no matter how many times reiterated is still less! You should always have backups of your databases. You should store your backups on a separate media which is not the same as the disk drives which store the database files. This ensures that you don’t have a single point of failure. I have seen a lot of scenarios where the backups and the database files are stored on the same drive. Once the drive goes BOOM!!… You are left with zilch!! A bad scenario to be in!

2. Test your backups – Just taking regular backups doesn’t ensure that you will be safe when a disaster strikes. You need to restore your backups and ensure that the backups can be restored successfully. If you have an automated DR strategy in place, then it is always good to perform dry-runs to ensure that your team is well versed with the recovery process when the need arises. You don’t want to be grappling with your restore scripts during a crisis situation. The next nugget of information is to ensure that a DBCC CHECKDB on the restored database completes without any errors. Just because the restore was successful, doesn’t mean that the database is consistent!

3. Know your environment – An application doesn’t just depend on your database[s]. There might be customized connection settings, connection aliases, specific logins, database users, linked servers etc. which need to be kept handy in case you need to bring a new environment online which was a clone of your previous disaster ridden system. I have seen multiple times where the databases have been restored successfully but the logins and linked specific to the application are missing. So now you have an environment which has the application databases but other specifics pertaining to the application’s functioning are missing.

4. System databases need to be backed up also – System databases do need to be backed up as well. Eg. Without the master database backup in a disaster scenario, you will be missing the necessary logins that your application needs to login to the user database.

5. Benchmarking is very important – As I mentioned earlier, a dry-run is very important. This is primarily due to the fact that if you do not know how much time a restore is going to take, you cannot define your RTO and adhere to your agreed SLAs. A classic situation is that the application needs to be up within 4 hours but since no once tested the entire restore cycle, no one knows how long it will take to restore the set of full/differential/log backups that are available.

6. Have multiple points of failure – This is mostly considered as a good to have but in critical environments, I consider this as a must-have! A simple implementation of this would be redundancy. Keep two copies of your database backups. If one set of database backups are inconsistent, you have a redundant set of backups to fall back on. A decision taken to disk space by reducing the number of redundant copies can look very daft when you are not able to bring a production system online due to the unavailability of consistent backups.

7. Never rely on REPAIR ALLOW DATA LOSS as your savior – The REPAIR ALLOW DATA LOSS option provided with CHECKDB should always and always be your last resort! This means that when all else fails, then you resort to repair options. This repair option should never be your first option for recovering from a disaster because as the name states it always results in data loss!!

8. Know how long a CHECKDB takes to complete on the database – If you do not run CHECKDB regularly on the database for which you are creating a DR strategy, then you are inviting trouble. Always run periodic CHECKDB on your databases and note the time taken so that you have a fair estimate on how long a CHECKDB should take to complete successfully on the given database.

9. Redundant database copies – A lot of environments use Database Mirroring, Log Shipping and Replication to maintain duplicate copies of the database. If you are using any of these features to maintain copies of the existing databases, then you need to note two things: first being the latency between the primary and secondary copies. This will define your RPO as the average latency will be the amount of data loss that you should be prepared to deal with and this will also define RPO to some measure as the time taken to recover the missing data would be defined by latency. Another point to keep in mind is that if you decide to use one of the alternate database copy of the database as the new production database, then you need to ensure that you avoid certain gotchas. Example: Orphaned users for SQL Authenticated logins when you use log shipping or database mirroring.

10. Keep in mind the additional SQL Server features being used – If you are using replication or mirroring or log shipping on the primary database being recovered, then you need to account for additional steps before restoring the databases as a simple restore of a database backup for such a database will not do. Eg. Special considerations need to be followed for restoring replicated databases.

For the non-technical aspects, a disaster recovery plan should include the following:

Disaster recovery plan types include the following (from Books Online):

  1. A list of people to be contacted if a disaster occurs
  2. Information about who owns the administration of the plan
  3. A checklist of required tasks for each recovery scenario. To help you review how disaster recovery progressed, initial each task as it is completed, and indicate the time when it finished on the checklist.

The above points might seem like basics but it would be surprising that they don’t get religiously followed on some production environments!

T-SQL Tuesday #13 – What the Business Says Is Not What the Business Wants

This is my second post for T-SQL Tuesday. This time around, T-SQL Tuesday is being hosted by Steve Jones (Twitter | Blog), The Voice of the DBA on SQLServerCentral.com this month. The topic this time around is not a technical topic but nonetheless very important in the grander scheme of things.

I have seen multiple scenarios where the teams involved in deploying solutions/projects work in silos. There is the management team which decides what the final outcome of the project should look like and what the end deliverables are. Then the set of developers who can throw together gazillion lines of code (in multiple languages) to put together a front-end, maybe a middle-tier, which talks to a database server fetches data and presents it to the end-user. And finally a set of DBAs who are always guilty of the post-deployment issues “till the database is proven innocent” as most applications hit a database to fetch the data at one time or another.

There are some key things to consider and contemplate about before you start writing code and deploying your solutions.

What’s the business?

Business

Knowing your business is like knowing your backyard. The bigger it is the more lethargic you tend to become. Business acumen is not something that is missing in individuals but it has got to do more with “I want to” versus “I don’t want to”! The positive side of knowing enough about your business is that it can make your life easier. This will prevent you from spending additional hours troubleshooting database performance issues since you deployed something that was not needed for the business and unfriendly to your current database environment. So unless you have a fair idea of why you are implementing what you are implementing, you are going to invite trouble later if not sooner!

Requirement Analysis: The Who, What and How

analysisIf you have background in Software Design or have been part of a project, then you will know what RA is. However, due to the time-to-market commitments becoming tighter and tighter in a competitive environment, RA becomes a non-entity or a formality with developers developing without knowing why they are doing what they are doing. A definite recipe for disaster! Unless you understand WHO is going to be using what you are developing, WHAT purpose is it going to serve and HOW they will be using it; you will not be in a position to explain why a module or feature may be an overkill or completely unnecessary. Figuring out the Who, What and How requires a certain understanding of the business (my first point) and without this, you are bound to face challenges!

Good-to-have and Must-have

Under_construction_icon-blueYou must have heard this a million times. What is a must-have as opposed to a good-to-have. Without knowing this, you might not be able to land a project where end deliverables are met. It is never a good idea to over-commit. Timelines and deliverables need to be frozen before you start developing which would make me point back to the planning phase. Just because the business asks for it doesn’t mean that the business needs it. Find out if that is going to serve the end-goal. What is being prioritized may not be a must-have but actually a good-to-have. Must-haves in a project are non-comprisable (if the right RA has been done) and the good-to-have is well… (the phrase is self explanatory)

The shops where there are smart people who can look beyond their egos and work efficiently in a cross-team environment usually don’t land up in a soup post-deployment; using band-aids to fix a “gone-to-hell” deployment.

Conclusion: Cohesion in communication among teams is required to formulate a concise design goal to achieve or solve a specific business need. Without this, you will be back to the drawing board in no time after spending substantial amount of effort and $$ in a futile exercise.

Note: The above post is not penned to finger-point at any set of individuals or a group. It is just an attempt to help understand the need for cross collaboration between business groups. Additionally, it is equally pertinent to know your backyard before you deploy something. This is applicable to all business groups and not just DBAs!

T-SQL Tuesday: Common SQL Server myths – Series I–Setup and Services

imageAs part of the Microsoft CSS group, we get to deal with a lot of production as well as non-production related issues. While we are troubleshooting such issues, we also end of dispelling some or the other myth that exists regarding SQL Server architecture.

So, this time when Sankar Reddy (Blog | Twitter), a SQL Server MVP, decided to host T-SQL Tuesday, a community blogging effort started by long time SQL Server MVP, Adam Machanic (Blog | Twitter), I decided to contribute a post on this month’s topic “Misconceptions in SQL Server”.

There are multiple posts by very eminent people in the SQL community that have debunked various SQL Server myths. If you are not a SQL Server specialist and buy these myths that have spread around in the community like wild fires, you would probably think that SQL Server and MS Access are the same. And as any good SQL Server DBA would know that this is not true! Since, T-SQL Tuesday is an effort to enhance and share community knowledge, I shall attempt to refute and clarify certain common myths that I deal with day-in and day-out during the course of my work with SQL CSS. My first series will be on

Series #1: SQL Server Services and Patching

Thumbs downIt is alright to change the SQL Service account and password from the Service Manager snap-in
This is one of the most common misconceptions that people across the world have in this area. The SQL Server Configuration Manager has been designed to do the following (other than a host of other configuration activities):

  1. Change SQL service account
  2. Update the password of the SQL service account
  3. Make configuration related changes like adding trace flags

If you have not used Configuration Manager to do any of the above and have not faced an issue, then you are lucky! But Lady Luck may not always be on your side! Sarcastic smileThe Configuration Manager (available from SQL Server 2005 onwards) does a host of other activities when you make a change to your service account like granting the necessary permissions required on the Registry for the SQL service account etc. which the Services Manager doesn’t do. Even in SQL Server 2000, service account changes should be done through Enterprise Manager as it grants the necessary permissions required to use the Full-text feature. As new features get added to the product, a lot of considerations w.r.t. ACLs, Registry permissions, encryption keys, etc. need to be taken into account while modifying SQL Service accounts. The Configuration Manager was designed keeping these considerations in mind. So, this is not a new recommendation but has been around for more than a decade.

The other option that you have is to use WMI and the SQL Server namespaces to programmatically change the SQL service account. I had posted an example of this on Technet. If you want to modify Reporting Services service accounts, then it would be a bit more complex because you would need to manage the encryption keys etc.


Thumbs downWindows Cleanup Utility can be used to uninstall SQL Server
All I have to say to this is NOT a good idea if you still want to be the DBA, Windows Cleanup Utility shouldn’t be used unless and until Microsoft has published the same (via an official blog post or KB Article) to be used for a certain scenario. Windows Cleanup Utility is a generic tool that uses Windows Installation GUIDs from the registry to annihilate the product’s existence from your box! Since, we use Windows Installer to install and patch SQL Server (true for SQL Server 2005 to SQL Server 2008 R2), the SQL component registry entries are centrifugal to the SQL’s existence on the box. If these are not cleaned up by our uninstaller program correctly, then you can have a server which behaves very goofily the next time you run a SQL installation program. The worst case scenarios that I have seen are:

  1. No further patching possible for existing instances of SQL on the box
  2. No further new installations of SQL Server on the box

You wouldn’t want to land up in either situation. In such situations, you might even have to rebuild the box! Not pretty when you have multiple instances running on the server! Surprised smile


Thumbs downDifferent nodes can have different builds for the same SQL instance
SQL Server clusters are not supported in scenarios where different nodes belonging to the same clustered instance are on different builds. The reason behind this is that are be code differences in the SQL builds that we release. So if you failover to a lower build of the SQL Server, then you can expect to see some known issues surfacing which do not occur on the other node/nodes.


Thumbs downSQL Server setup will automatically patch the newly added node
If you have recently added a node to the SQL instance, then you should patch it immediately to the current build of the SQL clustered instance that you added the node to, before putting that node into production. Till SQL Server 2005, there was no option of slip-streaming such installations. You can use the steps mentioned in the following KB Article in case you run into issues while patching the newly added node.


Thumbs downSQL Server services don’t go offline while patching the SQL instance
Another common misconception that makes applications connecting to the SQL Server instance unhappy Sad smile (read: application timeouts) if they are kept online during SQL setup. At a high level, I like to look at SQL Server setup as a two step process. The first step involves MSI (Windows Installer) actions, which are responsible for making the necessary registry changes, laying down the binaries and other Windows related configuration changes including .NET (if necessary). The second step is when the SQL Server instance is started up and configuration scripts (T-SQL) are executed to configure the SQL instance and make the necessary meta-date changes required for the system metadata residing in the SQL system databases. For the first step, the SQL instance needs to be offline as the system database files(like mssqlsystemresource) cannot be updated if the SQL instance has a lock on the files. When the SQL Service is started for configuring the instance, it is in single user mode with certain parameters, so that applications cannot connect to it during that time. It is always advisable to take a downtime during patching of SQL instances. Hot patching is still not available in SQL Server.


Thumbs downDeleting files from the C:\Windows\Installer folder will free up disk space
A mistake that I have seen so many people make, that I have lost count. Yes, it will free up your disk space but at a cost where it is possible that you might not be able to patch your existing applications which use Windows Installer for installation/patching. Surprisingly, this is a hidden folder and still people think it’s wise to delete files from that folder. We made it hidden for a reason!! As mentioned before in this blog post, that SQL installation uses Windows Installer. The files .MSI and .MSP) used during installation are cached in the Windows Installer folder. If you delete these files, then you will land up with various missing MSI/MSP files error. SQLSeverFAQ (Blog | Twitter), a SQL CSS MSDN blog, has multiple posts has step-by-step instructions when you run into such issues. Disclaimer: The steps mentioned in the posts are NOT for the faint-hearted! So, avoid Shift+Delete of such files!

Reference :
Part – 1 – SQL Server 2005 Patch Fails to install with an error Unable to install Windows Installer MSP File
SQL SERVER 2005 SP3 Install Fails with Error 1612- Unable to install Windows Installer MSP file

In case I have missed out any other common myth related to SQL Services and Setup, please feel free to leave a comment and I shall add it to the post. Leave your name and Twitter handle (if you have one) and I shall acknowledge the credit for the myth to you.

CU10 is a must for all instances where mirroring is being used

If you happen to use Database Mirroring on SQL Server 2005, then Cumulative Update 10 for SQL Server 2005 Service Pack 3 is a must for your instances. There are 2 database mirroring related fixes in this cumulative package. I would suggest that all SQL Server 2005 instances that have database mirroring enabled need to have this update installed ASAP.

982933 (http://support.microsoft.com/kb/982933/ )
Error message when you shrink data files on principal in a Database Mirroring for two SQL Server 2005 servers

983480 (http://support.microsoft.com/kb/983480/ )
FIX: Various issues occur when a database is involved in a database mirroring session and in a transactional replication if trace flag 1448 is enabled

983500 (http://support.microsoft.com/kb/983500/ )
FIX: The role switch is delayed when a mirroring automatic failover occurs in SQL Server 2005

SQL performance gotchas for Distributed Queries

I am not sure how many of us pay attention to BOL notes under certain topics. I must admit that I have overlooked quite a few useful notes once in a while. But while working on a Linked Server permissions issue, I came across this point under the “Security for Linked Servers”:

To create the best query plans when you are using a table on a linked server, the query processor must have data distribution statistics from the linked server. Users that have limited permissions on any columns of the table might not have sufficient permissions to obtain all the useful statistics, and might receive a less efficient query plan and experience poor performance. If the linked server is an instance of SQL Server, to obtain all available statistics, the user must own the table or be a member of the sysadmin fixed server role, the db_owner fixed database role, or the db_ddladmin fixed database role on the linked server.

The reason for this is that the query optimizer is dependant on the table/index statistics to decide on the most optimal plan for the query. For remote queries, that perform a large number of joins or have complex sub-queries or complex filter conditions, you are bound to run into permission issues if the linked server remote login doesn’t have the necessary permissions to extract the statistics information. If you are not willing to give the linked server login the above rights on the remote table, then it would be better to either use an alternative means to fetch the remote data or maintain a local copy of the data using mirroring/log shipping/replication and operate on the local data to make sure that your business logic doesn’t get affected as the data increases on your server.

IMHO linked server queries should never be used for implementing complex business logic! Remote queries should be used to fetch as minimal data as possible. But Utopia is not always a reality!