CHECKDB with Repair Allow Data Loss: Should I shoot myself?

Over the past years, I have seen multiple DBAs execute DBCC CHECKDB with REPAIR ALLOW DATA LOSS to repair the SQL Server database corruption in their database without understanding repercussions. Even when you are working with CSS on such issues, CSS engineers actually explain what the command does and also send across a disclaimer stating why this command should NOT be executed on a database and what the repercussions are. Alas due to a lack of a sound database backup strategy, a lot of DBAs are forced to resort to losing data. In a nutshell, Repair Allow Data Loss tries to first repair the corruption in the page but if it cannot, then it checks if it can drop the page and repair the error and hence the name Allow Data Loss! I had a question for all the DBAs who advocate the use of this command:

How would you like your Payroll DBA to run the command? (Good enough food for thought!!)

The only way to compare how much data you lost would be to perform a CHECKDB before running the CHECKDB with REPAIR ALLOW DATA LOSS and after the CHECKDB has completed. The row counts returned by the two CHECKDB outputs would have to be compared. If CHECKDB is failing due to system catalog corruption, then you are plain out-of-luck.

What most people fail to understand that CHECKDB can verify integrity of the database based on the SQL database rules imposed on it like referential integrity, data length, record structures, page structures, header correctness etc. But these are database integrity rules that we are aware of. What about the business logic rules that you imposed while designing the database.

Eg: Let’s say you had a database table called Employee which depended on a Department ID valued to be present in the Department table. No foreign key relationship exists between the two tables. You experienced corruption in the database and used Repair Allow Data Loss command along with CHECKDB which dropped a few pages from the Department table. However, CHECKDB cannot know that a Department ID is missing in the Department table because you didn’t create a Foreign Key between the two tables. So when your application now tries to retrieve the Department ID of the Employee, it will BARF on you! This is the kind of scenario we specifically want to avoid which is why we advise against the use of data loss commands.

SQL Server wouldn’t know what data is supposed to present in the table rows. The only thing that SQL can verify while you are retrieving data or checking integrity of the database is that the column data and column datatype match. I have heard questions like “Why can’t SQL Server figure out if the value is correct?”. Well, would you know why a person’s name is Uncle Sam and not Uncle Bill? 🙂 So, how would SQL Server know that the FirstName column (an example) was supposed to store Sam and not Bam. As per SQL Server, both strings match the column datatype, hence from a structural and logical integrity, the data is correct.

Check Integrity task in Maint Plan fails on SQL Server 2000

When a SQL Server 2000 Database Maintenance Plan is executed, you find that the following message in printed in the Maint Plan .OUT file (Found in the SQL Server LOG folder):
 
The backup was not performed since data verification errors were found
 
OR
 
[Microsoft SQL-DMO (ODBC SQLState: 42000)] Error 5058: [Microsoft][ODBC SQL Server Driver][SQL Server]Option ‘SINGLE_USER’ cannot be set in database ‘MASTER’. [Microsoft][ODBC SQL Server Driver][SQL Server]sp_dboption command failed. [1] Database master: Check Data and Index Linkage… [Microsoft SQL-DMO (ODBC SQLState: 42000)] Error 7919: [Microsoft][ODBC SQL Server Driver][SQL Server]Repair statement not processed. Database needs to be in single user mode.
 

 
This happens because you had selected the option of "Attempt to repair any minor problems" (refer above pic). This option needs CHECKDB to be executed with the database in SINGLE USER mode. If you have users connected to the database during that time, the database cannot be put into SINGLE USER mode due to which the Integrity Check task will fail. This will prevent backups from occurring on the database and you would see the above messages in the log files.
 
DBCC CHECKDB (<database name>, repair_fast) executed by <account name> found 0 errors and repaired 0 errors.  Elapsed time: 0 hours 0 minutes 50 seconds.
 
If you look in the SQL Server ERRORLOG, you will find that the CHECKDB is executed with REPAIR_FAST flag. This requires Single User mode. Another note to keep in mind is if you check the Integrity Check option and "Attempt to repair any minor problems", then exclude MASTER, MSDB and MODEL databases from the Plan.
 
Reference Article:
290622 BUG: Database maintenance plan on system databases fails on integrity check if "Attempt to repair minor problems" is selected
http://support.microsoft.com/default.aspx?scid=kb;EN-US;290622

CHECKDB message in the SQL Errorlog

Many a time, we have seen the following messages in the SQL Server after a SQL Service startup or after a database has been restored or a database was brought ONLINE:
 
2009-10-09 01:13:42.110 spid53       CHECKDB for database ‘msdb2’ finished without errors on 2008-11-07 19:59:09.643 (local time). This is an informational message only; no user action is required.
 
What does this message signify? There has been lot of confusion regarding this message. It doesn’t mean that SQL Server performed a CHECKDB against this database and reported the outcome. This is just an informational message to let you know when was the last time a CHECKDB was executed against your database. This information (last checkdb runtime) is stored in your database header which is picked up and reported in the SQL Errorlogs.
 
If you notice the highlighted information in the message above, you will see that the time reported in the message doesn’t correspond to the Errorlog entry time.

Database Integrity Checks

Database integrity checks are recommended to ensure that the database consistency is intact and if there is a problem with consistency, then it is reported to the appropriate team(s) so that necessary action can be taken to rectify it. This can be done with the help of Database Maintenance Plans. In the event that a CHECKDB on a database fails, then it needs to be reported by Email(using xp_sendmail or SQL Agent Mail or Database Mail for SQL Server 2005) or events fired in the Operating System Event Logs (if these are monitored regularly) with the help of Operators which can be configured for SQL Server Agent. The xp_sendmail feature is not available for SQL 2005 64-bit versions.
The frequency of these checks largely depends on the following factors:
1. Importance of the database
2. How often data changes in the database (If a database integrity check fails for a database where data is not modified, then it would be advisable to restore the last known good backup rather than trying to repair the consistency database)
3. The size of the database
4. In the event of consistency checks failing, it needs to be determined which is the most feasible option:
a. Restore the last known good backups in accordance with the recovery model for that database to allow for a minimum amount of data loss
b. Or try and repair the database and falling back on Option (a) only if this fails In case a repair option is suggested in the CHECKDB output, it is important to note that a REPAIR_ALLOW_DATA_LOSS be never performed on the database without understanding its full consequences and consulting Microsoft PSS. In the event, that this route needs to be taken, it is always recommended to fall back on the last known good backups if possible. The REPAIR_FAST and REPAIR_REBUILD repair options can be performed without having any data loss. Please note that these are time consuming operations and in the event of database inconsistency it is not possible for us to predict how long these tasks would run for. Also, the time taken for CHECKDB on a database cannot be predicted. An educated guess can be made to how long it would take by referring to the last durations of the CHECKDB operations on that particular database. For the above mentioned repair options, please refer the following article: http://msdn2.microsoft.com/en-us/library/aa258278(SQL.80).aspx

REPAIR_ALLOW_DATA_LOSS
Performs all repairs done by REPAIR_REBUILD and includes allocation and deallocation of rows and pages for correcting allocation errors, structural row or page errors, and deletion of corrupted text objects. These repairs can result in some data loss. The repair may be done under a user transaction to allow the user to roll back the changes made. If repairs are rolled back, the database will still contain errors and should be restored from a backup. If a repair for an error has been skipped due to the provided repair level, any repairs that depend on the repair are also skipped. After repairs are completed, back up the database. This option shouldn’t be used without the assistance of a SQL Support Engineer.
REPAIR_FAST
Performs minor, nontime-consuming repair actions such as repairing extra keys in nonclustered indexes. These repairs can be done quickly and without risk of data loss.
REPAIR_REBUILD
Performs all repairs done by REPAIR_FAST and includes time-consuming repairs such as rebuilding indexes. These repairs can be done without risk of data loss. In case of SQL Server 2005, you have the option of checking the suspect_pages in the MSDB database to find out the affected pages.

Please refer the following articles for more detailed information:
Suspect_pages table (SQL 2005)
http://msdn2.microsoft.com/en-us/library/ms174425.aspx
Understanding and managing the suspect_pages table
http://msdn2.microsoft.com/en-us/library/ms191301.aspx
Designing a Backup and Restore Strategy
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/adminsql/ad_bkprst_63eh.asp

In the event, that only a few pages have a problem, then for SQL Server 2005 a page level restore can be performed. Online page restore operation is a feature available in SQL Server 2005 Enterprise Edition but all other Editions of SQL Server 2005 support offline page level restores. The page level restores are done using the NORECOVERY option for that database. Then a backup of the current transaction log is taken and applied to the database with the RECOVERY option. This feature is applicable to databases in FULL or BULK-LOGGED recovery models.
For performing page level restores, please refer the following article: http://msdn2.microsoft.com/en-us/library/ms175168.aspx
It is highly important that a disaster recovery plan is in place to ensure the following:

  • · A plan to acquire hardware in the event of hardware failure
  • · A communication plan.
  • · A list of people to be contacted in the event of a disaster.
  • · Instructions for contacting the people involved in the response to the disaster.
  • · Information on who owns the administration of the plan.
  • · A checklist of required tasks for each recovery scenario. To help you review how disaster recovery progressed, initial each task as it is completed, and indicate the time of completion on the checklist.

In conclusion, if the database is of a considerably large size, then an integrity check needs to be scheduled during a window when the load on the server is at a minimum. The definition of minimum here refers to a load which is lesser than the normal workload on the server. If the database sizes are quite small, then daily integrity checks on the database would be the order of the day.
It is recommended that DBCC CHECKDB be run during hours when the load is light on the server. If DBCC CHECKDB is run during heavy peak usage time, expect a performance hit on the transaction throughput as well as DBCC CHECKDB completion time.

Recommendations for Good DBCC Performance
· Run CHECKDB when the system usage is low.
· Be sure that you are not performing other disk I/O operations, such as disk backups.
· Place tempdb on a separate disk system or a fast disk subsystem.
· Allow enough room for tempdb to expand on the drive. Use DBCC with ESTIMATE ONLY to estimate how much space will be needed for tempdb.
· Avoid running CPU-intensive queries or batch jobs.
· Reduce active transactions while a DBCC command is running.
· Use the NO_INFOMSGS option to reduce processing and tempdb usage significantly.