Volume Shadow barfs with 0x80040e14 code

I was replying on a MSDN forum thread recently and found that the a VSS backup was failing due to large number of database volumes residing on a particular volume.

I re-created the issue on my box by limiting the number of Max Worker Threads to 128 on my SQL Server 2008 instance. After making the change to the number of worker threads, I created 99 databases on the D: drive of my server using the following script:

set nocount on
declare @cntr int = 1,@sql varchar(8000)
while @cntr < 100
begin
set @sql = 'CREATE DATABASE [db'+cast(@cntr as varchar(5))+'] ON  PRIMARY

( NAME = N' 'db'+cast(@cntr as varchar(5))+''', FILENAME = N' 'D:\DBFiles\db'+cast(@cntr as varchar(5))+'.mdf' ')
LOG ON
( NAME = N' 'db'+cast(@cntr as varchar(5))+'_log' ', FILENAME = N' 'D:\DBFiles\db'+cast(@cntr as varchar(5))+'.ldf'')'

set @cntr = @cntr + 1
exec(@sql)
end

I then attempted to backup the DBFiles folder on the D: drive using Windows Server Backup and it failed.

image

After that I looked into the SQL Server Errorlog and found the following errors:

2010-12-06 03:39:15.280 Backup       Error: 3041, Severity: 16, State: 1.2010-12-06 03:39:15.280 Backup       BACKUP failed to complete the command BACKUP DATABASE db68. Check the backup application log for detailed messages.

2010-12-06 03:39:15.370 spid53       Error: 18210, Severity: 16, State: 1.2010-12-06 03:39:15.370 spid53       BackupVirtualDeviceFile::PrepareToFreeze:  failure on backup device ‘{DB840D1B-DD17-4D73-83F1-7390F676DEE4}1’. Operating system error 995(The I/O operation has been aborted because of either a thread exit or an application request.).

 

I had already documented earlier that the infamous 995 error message could be a problem with the sqlvdi.dll but in this case, it is not!

The next thing I did was to pick the errors from the Application Event logs and I found the following errors repeated multiple times:

Error    12/6/2010 3:39:15 AM    SQLVDI    1    None    SQLVDI: Loc=SignalAbort. Desc=Client initiates abort. ErrorCode=(0). Process=3920. Thread=11604. Client. Instance=SQL2008. VD=Global\{DB840D1B-DD17-4D73-83F1-7390F676DEE4}1_SQLVDIMemoryName_0.Error    12/6/2010 3:39:15 AM    SQLWRITER    24583    None    Sqllib error: OLEDB Error encountered calling ICommandText::Execute. hr = 0x80040e14. SQLSTATE: 42000, Native Error: 3013

Error state: 1, Severity: 16

Source: Microsoft SQL Server Native Client 10.0

Error message: BACKUP DATABASE is terminating abnormally.

SQLSTATE: 42000, Native Error: 3224

Error state: 1, Severity: 16

Source: Microsoft SQL Server Native Client 10.0

Error message: Cannot create worker thread.

If you look closely at the above error, then you will find that the following error message stands out like a sore thumb: “Cannot create worker thread”. After this issue occurs, you will find that the SQLWriter shows the following when you execute the command: vssadmin list writers from a command prompt window:

Writer name: ‘SqlServerWriter’   Writer Id: {a65faa63-5ea8-4ebc-9dbd-a0c4db26912a}

   Writer Instance Id: {9075f235-fdee-4445-851b-a69c54bd8b33}

State: [8] Failed

Last error: Non-retryable error

Ensure that you do not have any memory related errors in the SQL Server Errorlogs or any DBCC MEMORYSTATUS outputs printed in the Errorlog during the time of the backup failure. If there aren’t, then the above issue occurs due to the SQL instance hitting a worker thread limit.

The reason for this is explained by Rob Dorr (Principal SQL Server Escalation Engineer at Microsoft) in his blog post:

The controller and database worker come from the worker pool.  So if your max worker thread is configured for 255 the answer would be a max of 255 / 2 = 127 databases.    You probably won’t get to 127 with this configuration as you have system workers (lazy writer, checkpoint, log writer, and others) that already are using workers.

On my SQL instance, I have 120 database and I need about 240 worker threads to complete this operation. However, I have only 128 worker threads configured for my SQL instance. If you have Dedicated Administrator Connection enabled for the SQL instance, then you can this troubleshooting further and execute a select against the DMV sys.dm_os_schedulers (for SQL Server 2005 or above) and get an output of the following type:

status                 counts
———————- ——–

HIDDEN ONLINE          168

VISIBLE ONLINE         8

VISIBLE ONLINE (DAC)   1

Query used:

select status, count(*) as counts
from sys.dm_os_schedulers
group by status

As you will see that there are about 168 hidden online schedulers created for performing the Snapshot backups. But SQL instance in question doesn’t have sufficient work threads to facilitate this operation.

What can you do in such situations:

1. Increase the number of worker threads to account for the additional worker threads required for completing the VSS backups. This can be counter-productive as it can lead to non-Buffer Pool memory contention on 32-bit instances of SQL Server. The higher the number of max worker threads on your SQL Server instance, more the amount of non-BPool memory is consumed for the thread stack. This results in depletion of the available Buffer Pool memory.

2. Change the physical location of the SQL Server database files so that all the database files do not reside on the same volume.

3. Perform a VSS Backup with SQL Server VSS Writer service turned OFF. However, you cannot recover any SQL Server database files from such a Volume Shadow Backup.

4. Use SQL Native backups or other VDI backup tools (non-Snapshot) to perform SQL Server database backups.

Note: This also happens when you select any file on a drive which has a high number of SQL Server database files. This is because the SQL Server VSS Writer tries to enumerate the database files for the SQL Server instance that reside on the volume that has been selected for the Snapshot backup.

Location of SQL Binaries can flip out Bare Metal backups

Bare Metal recovery allows system administrators to recover a system from failure within a short period of time and also manage system state backups in a more efficient manner. However, there is a slight catch here when you have SQL Sever binaries installed in a non-default location.

I recently worked on an issue where Windows Backup Admin was listing a non-system drive as a critical component when trying to perform a bare-metal backup using the following command:

wbadmin start backup -allcritical -backupTarget:<any existing drive name>:

When I looked into the contents of the drive, I found that the drive only had SQL Server database files and binaries on it. This is the output that I got:

wbadmin start backup -allcritical -backupTarget:S:
wbadmin 1.0 – Backup command-line tool
(C) Copyright 2004 Microsoft Corp.

Retrieving volume information…
This will back up volume OS(C:),New Disk(D:),New Disk(E:) to S:.
Do you want to start the backup operation?
[Y] Yes [N] No N

The operation ended before completion.

I then decided to use DiskShadow to show me all the critical components marked by the VSS Writers. I got the following output:

WRITER “System Writer”:

– Volumes affected by this component:

– \\?\Volume{2407f815-f9c4-11df-aef7-806e6f6e6963}\ [C:\]
– \\?\Volume{22c3ff31-f965-11df-a20b-00155df1c42a}\ [D:\]
– \\?\Volume{0342683d-f96a-11df-8a1f-00155df1c42a}\ [E:\]

This is because the SQL Server program binaries were installed on the D: and E: drives. And these are the paths that show up as critical as per the System Writer:

– d:\program files\microsoft sql server\100\dts\binn
– d:\program files\microsoft sql server\msas10.mssqlserver\olap\bin
– d:\program files\microsoft sql server\msrs10.mssqlserver\reporting services\reportserver\bin
– d:\program files\microsoft sql server\mssql10.mssqlserver\mssql\binn
– e:\program files\microsoft sql server\msas10.inst01\olap\bin
– e:\program files\microsoft sql server\msrs10.inst01\reporting services\reportserver\bin
– e:\program files\microsoft sql server\mssql10.inst01\mssql\binn

DiskShadow command used: list writers detailed

From http://msdn.microsoft.com/en-us/library/aa384630(VS.85).aspx

A volume is a critical volume if it contains system state information. The boot and system volumes are included automatically. The requester must include all volumes that contain system-critical components reported by writers, such as the volumes that contain the Active Directory. System-critical components are marked as “not selectable for backup.” In VSS, “not selectable” means “not optional.” Thus, the requester is required to back them up as part of system state.

Based on the above description, it is clear that I cannot perform a Bare Metal backup without including D: and E: drives.

Note: The above is just an example. There might be other system-critical components which are marked as not selectable for backup by other VSS Writers. VSS works on Volume Shadow concept. So, if a particular drive in a volume is marked as critical, then the entire volume would need to be backed up.

As I mentioned earlier, for system critical components, the “Is selectable: FALSE” is set (VSS_CF_NOT_SYSTEM_STATE flag is set for non-system components). In the above example, the drive on which the SQL binaries exist is marked as critical by the System Writer.

Wiki: SQL Database Engine Startup Failures for Clusters

I had recently written a Wiki article on my blog for a troubleshooting startup methodology for SQL Server startup failures for Standalone instances. In this article, I shall cover a similar roadmap for troubleshooting database engine startup failures for clustered SQL Server instances. Some of the steps are going to be the same as troubleshooting start-up failures for stand-along instance but there are some minute differences. So here goes…

The first thing that you need to do is to try and bring the SQL Server resource online from the Failover Cluster Administrator in case you are using a Windows Server 2003 cluster or below. If you are using a Windows Server 2008 or above, you would need to use the Failover Cluster Management Snap-in. If the SQL Server resource fails to come online, then check the SQL Server Errorlogs for any errors that were reported.

Locate the SQL Server Errorlog location and check if any errors were observed in the SQL Server Errorlog. This can be done by looking up the –e startup parameter value for the SQL Server instance in question. If you are using SQL Server 2005 and above, then right-click on the Service Name in Configuration Manager –> Properties –> Advanced Tab. Alternatively, you can use a VBScript (or Powershell) to fetch this information. Here is an example for obtaining the Startup Parameters using WMI and VBScript.

Verify that the files pointed to by the –d (master data file location), –e (Errorlog location) and –l (master log file location) actually exist and have the correct file paths. Starting from SQL Server 2005, the Errorlog will have the startup parameters printed in it when the SQL instance starts up. If you find no errors in the SQL Errorlogs, look into the Windows Application Event logs and ensure that there are no errors related to the cluster resource for the SQL instance or the cluster group (applicable for Windows 2003 clusters) that the SQL resource is a part of.

If the Windows Application Event logs, don’t provide any useful errors, then look at any warnings or errors pertaining to the SQL Server resource in the Windows Cluster logs.

Refer the following post to generate the Cluster log on a Windows Server 2008 Failover cluster or higher. For a Windows Server 2003 Cluster, the default location of the cluster logs is: <system driver>:\<windows folder>\Cluster\Cluster.log.

Now try to start the SQL Server instance using command prompt using the sqlservr.exe to check if the service comes online without bringing any user database online using -T3608.

For a default SQL Server instance:

sqlservr.exe –c –m –T3608

For a named instance of SQL Server:

sqlservr.exe –c –m –T3608 –s<instance name>

If the above fails, then the console window should provide you with an error message as to why the SQL instance fails to start. If the above succeeds, then try and start the service using the Configuration Manager.

Note: Make sure that you are not falling prey to a Desktop Heap exhaustion issue.

If you determine that the SQL Server registry entries have an erroneous entry, then you would need to follow the steps mentioned in the KB Article below to rectify the problem (due to Registry Synch operations that happen on a cluster):
http://support.microsoft.com/kb/953504

You might also find that the SQL Server resource entries in the Cluster Registry hive (HKLM\Cluster\Resources\GUID) are missing. You would then have to re-create the same using the steps mentioned in the KB Article below (applicable for SQL Server 2000 and SQL Server 2005 only):
http://support.microsoft.com/kb/810056

NOTE: Remember that modifying registry keys incorrectly on a Windows Server can have disastrous effects. Make sure that you take a registry backup before making any registry modifications.

Apart from the following common startup failure root causes, there are an additional few which are applicable to clustered SQL Server instances only (in BOLD below):

The most common root causes for service startup failures are:

  1. Access denied on the folders for the SQL Server Service Account on the DATA folder which contains the SQL Server system databases or due to some other process (like anti-virus holding a lock on the SQL database files)
  2. Insufficient permissions granted to the SQL Server Service account.
  3. File not found errors due to either accidental deletion of files or disk failures.
  4. System database files having inconsistencies prevent SQL Server startup.
  5. Password of the service account was changed but not updated on the server that has the SQL instance installed.
  6. Startup parameters have incorrect file path locations
  7. The dependent resources for the SQL Server clustered instance are failing to come online. (Eg. Shared disks on which the database files reside on)
  8. Incorrect (read:messed up) Registry entries in the CLUSTER registry hive for the concerned SQL instance which have been checkpointed already and due to which the instance fails to start.
  9. Operating system/Resource crunch/Environment related issues on one or more nodes due to which the instance comes online on one node but not on the other.

Additional references:
T-SQL Tuesday: Common SQL Server myths – Series I–Setup and Services
SQL SERVER 2000 CLuster FAILS TO COME ONLINE ON OTHER NODE
SQL SERVER 2005 RESOURCE FAILS TO COME ONLINE ON CLUSTER AFTER CHANGING THE SAN DRIVE TO ADD MORE DISK SPACE
Sql Server 2005 Resource Taking long time to come online on Windows Cluster with Resource Manager Creation Failed reported in Errorlog
Sql Server Resource Fails to come online on Cluster with Error Error Locating Server/Instance Specified [xFFFFFFFF]
SQL Server 2008 service fails to start, after applying Service Pack 1

Versions that this roadmap applies to:
SQL Server 2000, SQL Server 2005, SQL Server 2008, SQL Server 2008 R2

Removing duplicate rows from a table

This seems to be a self-explanatory title and before you start raising your eyes about the need for a blog post on this, let me set the context. I am not going to teach how to do some thing which is mentioned in a KB Article:

How to remove duplicate rows from a table in SQL Server
http://support.microsoft.com/kb/139444

What I am going to cover in a post is “when you have inconsistencies in a SQL Server database table regarding duplicate entries in a column which has a Primary Key constraint defined on it, you need to be aware of a gotcha before you start tracking down the duplicate rows”.

For this I have table with the following definition:

create table t (c1 int primary key)

When I do a SELECT on the table, I get the following output:

c1


100

100

Yes, there are duplicate rows in the Primary Key column! And no, I am not going to show how I made that happen. There are some pretty smart people out there who already know this! Smile And for the rest who are still in the dark as to how this can be accomplished, I am not covering any undocumented stuff here!

A DBCC CHECKTABLE on the table yields the following error:

DBCC results for ‘t’.

Msg 2512, Level 16, State 2, Line 1

Table error: Object ID 85575343, index ID 1, partition ID 72057594038976512, alloc unit ID 72057594039894016 (type In-row data). Duplicate keys on page (1:192) slot 0 and page (1:192) slot 1.

There are 2 rows in 1 pages for object “t”.

CHECKTABLE found 0 allocation errors and 1 consistency errors in table ‘t’ (object ID 85575343).

repair_allow_data_loss is the minimum repair level for the errors found by DBCC CHECKTABLE (dbCorrupt.dbo.t).

 

When I use the query from the above KB Article to find the duplicate rows, I get back no records:

select c1,COUNT(*) as counts
from t
group by c1
having COUNT(*) > 1

In the above query, if you remove the HAVING clause, you will find that both the rows are returned with a counts column value of 1 each. This is because one of the values was changed to 100 without the knowledge of the Storage Engine.

When I execute the query below, I get back a single row:

select c1 from t where c1 = 100

If you look at the Scan Count from the Statistics IO output of the query, you will find that the Scan Count is 0. The logic behind this is explained on a previous post that I had written on the MSDN SQLServerFAQ blog some time back. This is because SQL Server doesn’t expect to retrieve more than 1 row when searching for a value belonging to a primary key column.

Now, I put all the contents of the table “t” into another table “t1” and the above query to find out duplicates returned the c1 = 100 having 2 records. Also, when I drop the primary key column on the table, I was able to retrieve the duplicate rows. The catch now is to determine which duplicate row needs to be deleted if the non-primary key columns of the table have different values. This requires understanding of the application business logic and tracing down the prior inserts done in the table which have been logged at a different layer. NEVER delete duplicate rows without understanding the repercussions of deleting a row from a production database table unless and until the duplicate rows have the same data. Eg. In the above table, I can use a DELETE TOP 1 command to delete from the table “t” because I only have one column in the table and both the duplicate rows have the same values.

To summarize, if your Primary Key column has duplicate values, then you need to either put all the rows into another table or remove the Primary Key constraint and then find out the duplicates. Just thought I would put this out for everyone in case you weren’t able to identify why duplicate rows were not showing up using a SELECT..GROUP BY.. HAVING query where as DBCC CHECK commands were reporting errors.

Common SQL Server myths – Series II – SQL Memory

Not so long ago, I had written a blog post for SQL Server myths related to Setup & Service startup myths for T-SQL Tuesday which was hosted by Sankar Reddy (Blog | Twitter). The round-up of the T-SQL Tuesday posts had some awesome content where renowned SQL Community members debunked common SQL Misconceptions. Read the round-up here.

I found that that this was quite a compelling topic for a series and decided to do a Part II for SQL Myths around memory related misconceptions. SQL Memory by itself is a murky topic and there are quite a few urban legends that are so well imbibed in people’s memories that we have to battle against them to get users to see light at the end of the tunnel. So, here goes another round of debunking some more misconceptions.

Thumbs downAWE needs to be enabled on 64-bit instances for SQL Server
I think everyone who has posted an answer for this question on MSDN forums has probably gone tired of saying the same thing over and over again. This configuration option is required for 32-bit instances of SQL Server only to enable the instance to benefit from the Address Windowing Extension. AWE option on 64-bit instances has no meaning. This is not required and basically does nothing. Please refer the following blog post more details:
http://blogs.msdn.com/b/slavao/archive/2006/03/13/550594.aspx


Thumbs downSystem.OutOfMemoryException encountered while executing a query from SSMS is not a Database Engine Out-of-Memory condition
I have seen multiple suggestions on forums for reducing Max Server Memory. This error is a .NET exception telling the user that an OOM condition occurred. The database engine is not a managed application. It was coded in native language and hence, wouldn’t throw a .NET OOM error. I have written about this in an earlier blog post: System.OutOfMemoryException for Management Studio


Thumbs downWindows Task Manager will tell me SQL Server’s memory usage
Another urban legend that CSS deals with day in and day out. The “Monitoring Memory Usage” topic in SQL Server Books Online talks about how to monitor SQL Server memory usage. None of the tools you typically use to inspect application memory use (Task Manager, Perfmon/Sysmon, etc.) show the amount of AWE memory used by individual processes. There’s no indication of the amount of AWE memory used by each process, nor is this memory included in the working set size reported for a given process. Reference: http://blogs.msdn.com/b/psssql/archive/2009/09/11/fun-with-locked-pages-awe-task-manager-and-the-working-set.aspx


Thumbs downMax Server Memory will limit the SQL Server memory
This is true for only the SQL Server Buffer Pool. Max Server Memory setting is used to control the size of the Buffer Pool only. For a 64-bit instance, the non-BPool region can grow to a maximum of Total Physical Memory – (Max Server Memory + Memory used by Windows & other applications). I had blogged about this in the past in my post: Non Bpool region on 64bit SQL instances may be larger than you think


Thumbs downSQL Server eats up all the available memory on the box
SQL Server is a server application. Ideally, you should have a dedicated server for a SQL instance that has a high number of database transactions. If that is not the case, then the Max Server Memory needs to be capped after taking into account the memory requirements for Windows, SQL Server and the other applications on the box. SQL Server (especially 64-bit) is designed to not back off it’s memory footprint once it grows unless it is asked to because of a low memory notification received from the Operating System. The thumb rule of leaving just 2GB for the OS and other applications may not be enough when you have File Copies, SSRS, SSIS and SQL Database Engine running on the same instance. SQL Server 2005 and above is designed to react to low memory notifications sent out by Windows and scale back it’s memory usage as required.
Reference: http://msdn.microsoft.com/en-us/library/ms178145.aspx


Thumbs downVirtual Address Space – Isn’t that an excuse?
Virtual Address Space defines the amount of memory that can be addressed by any application and the limiting factor is always the processor architecture: 32-bit or 64-bit. 32-bit Windows can reference upto 4GB (232) and 64-bit can reference a maximum of 8TB (264) depending on the Windows edition. The addressing limitations on a 64-bit instance is limited to the availability of your Physical RAM which is an exact opposite of the story of a 32-bit server. With Physical Address Extension enabled on the Windows servers, Windows can use more than 4GB of memory and SQL Server (or any other Windows application if coded correctly) can take advantage of this by using AWE APIs. If you are falling prey to VAS limitations, then you either have a large VAS consumer or your workload is not the most optimal workload for 32-bit architecture! If you are facing a Virtual Address crunch on your SQL instance, then follow the steps mentioned in the post to work towards determining the root cause of the problem: Who is eating up the Virtual Address space on your SQL Server
Reference: http://technet.microsoft.com/en-us/library/ms187499(SQL.90).aspx


Thumbs downI can change –g to whatever I want!
Absolutely not! The hard coded upper limit for –g is 1024GB. Whenever you increase your MemToLeave on a 32-bit machine, you can decreasing the SQL Serve Buffer Pool memory proportionately. Enabling this for a 64-bit instance is bit of a moot-point. So increasing –g is just a workaround and sometimes it might lead you to face BPool pressure on the server. I had documented the layout of the buffer pool for a 32-bit instance in an earlier blog post: SQL Server Memory Basics
Reference: http://msdn.microsoft.com/en-us/library/ms190737(v=SQL.90).aspx


Thumbs downI am Smart! I will enable /3GB and then increase –g. All problems solved!
Well, I would say you are too smart for your own good. For systems with above 16GB of RAM, this can prove to be counter-productive. This will only cause a lack of available System PTEs on the Windows server and create more problems than it actually solves. From “Using AWE” topic in SQL Server Books Online:

If there is more than 16 GB of physical memory available on a computer, the operating system requires 2 GB of virtual address space for system purposes and therefore can support only a 2 GB user mode virtual address space. For the operating system to use the memory range above 16 GB, be sure that the /3gb parameter is not in the Boot.ini file. If it is, the operating system cannot use any physical memory above 16 GB.


Thumbs downLock Pages in Memory is the next best thing after sliced bread!
How many times have we heard this: “Enable LPIM for the SQL service account and everything will be alright”! Everything will be alright only when you have taken into consideration the memory requirements for the SQL Server instance(s) on the box, other applications and Windows and then capped the Max Server Memory. Without that enabling LPIM for the SQL Service account is as good riding a bike on the freeway without a helmet. It’s all smooth sailing till disaster strikes!

 
Hope the above points help clear out some urban legends regarding SQL Server Memory concepts and next time during a coffee break or a water cooler discussion, the above points help you convince a *misguided SQL Server expert*!! Winking smile