Troubleshooting that darn backup error

Backup and restore operations even though they are supposed to be a no-brainer, it can leave you wanting to tear your hair out. Considering the different elements that influence the outcome of a successful backup operation like backup media, consistency of the backup, network issues (if backing up to a UNC path), consistency of the database being backed up etc.

I shall attempt to outline a set of troubleshooting steps specifically for non-native backup/restore related errors for SQL Server databases.

Any BACKUP command that completes successfully or with errors has associated messages logged in the SQL Server Errorlog.

A successful backup from the SQL Server Errorlog:

Database backed up. Database: adventureworks, creation date(time): 2010/05/28(03:00:18), pages dumped: 17291, first LSN: 67:7733:56, last LSN: 67:7761:1, number of dump devices: 1, device information: (FILE=1, TYPE=DISK: {‘F:\adventureworks.bak’}).

You will see that the backup completion message has the database name, the number of pages that were contained in the backup, the date the backup was created and the number of devices and their type (in this case files). So you can reconstruct the sequence of backups even if your backup information history is missing from the MSDB system tables and you just have your SQL Server Errorlogs to play with. If you want to get real fancy about this, then you could spin up a nifty little Powershell or VBScript to parse through the SQL Errorlogs and provide the backup sequence to you as well.

Now, let’s look at a failed backup message from the SQL Errorlog.

Error: 18204, Severity: 16, State: 1.
BackupDiskFile::CreateMedia: Backup device ‘R:\adventureworks.bak’ failed to create.
Operating system error 3(The system cannot find the path specified.).
Error: 3041, Severity: 16, State: 1.
BACKUP failed to complete the command BACKUP DATABASE adventureworks.

You will notice above that the Operating System error code associated with the backup failure is reported in the error message. More often than not, the error message is self explanatory. If the error code doesn’t throw back an error text along with it, you can get the windows error code associated with the error code using net helpmsg <error number> from a command prompt window.

In the above case, there wasn’t any R: drive on my server. Now that I have finished stating the most obvious of troubleshooting methodologies for SQL native backups which was done to set the context for the next part of this post. The troubleshooting methodology for non-native backups.

Non-native SQL backups of databases use one of the following methods:

1. Use of APIs exposed through SQLVDI.dll to perform a VDI based backup
2. Use of VSS framework to perform a snapshot backup of the SQL Server database

Note: In this post, I shall not be addressing any storage level backup solutions that have options or features to handles SQL database backups.

This post is aimed at discussing the tackling of VSS/VDI related errors that you face while performing a database backup.

If a VDI backup fails, then you can try the following sequence of troubleshooting steps:

1. Attempt a backup to the same location using native SQL backup from a query window or SQLCMD. This might not always be possible as some of the backup tools do not allow any other backup application to connect to the backup share apart from the backup tool’s agents.
2. The next thing that you can do is to ensure that your SQLVDI.DLL is updated to the latest build. For this you can use the SQL Server Backup Simulator available on Code Gallery using the “Validate VDI Installation” option in the tool. This check in the tool will perform basic checks like current DLL version and additional checks based on root causes of common scenarios that CSS has seen in the past for VDI backup failures. Additional information on the usage of the tool is available here.
3. Perform a backup of the database using the tool for which the backup is failing (to the same destination if possible). The parameter tweaks at this point are not available in the current version like striped backups, changes to MAXTRANSFERSIZE/BUFFER COUNT etc. but if the backup from this tool is successful, then you know for a fact that the SQLVDI APIs are working as expected. Then it is either an additional piece of logic in the backup software that is failing or there is an external factors like network/backup media or there is a resource crunch on the SQL instance on which the database resides.


Common errors that you might see during a VDI backup operation failure:

Error message 1
BackupVirtualDeviceFile::ClearError: failure on backup device ‘VDI_ DeviceID ‘. Operating system error 995(The I/O operation has
been aborted because of either a thread exit or an application request.).

Error message 2
Error: 18210, Severity: 16, State: 1.
BackupMedium::ReportIoError: write failure on backup device ‘VDI_ DeviceID ‘. Operating system error 995(The I/O operation has been aborted because of either a thread exit or an application request.)

Error message 3
Error: 18210, Severity: 16, State: 1. 
BackupVirtualDeviceFile::RequestDurableMedia: Flush failure on backup device ‘VDI_ DeviceID. Operating system error 995(The I/O operation has been aborted because of either a thread exit or an application request.)

The above errors are mostly due to outdated versions of SQLVDI.DLL on SQL Server 2000 or SQL Server 2005 instances. The “Validate VDI Installation” option in the Backup Simulator should be able to identify any known issues with DLL version mismatch for you.


If you have an application that uses Volume Shadow Service to backup your SQL database and that backup fails, then you your troubleshooting steps would be a bit different. The tell-tale sign of a VSS Snapshot backup is in the I/O Frozen message being written into the SQL Errorlogs for the databases being backed up. In such a failure scenario, take a backup of the database using Windows Server Backup Admin (Windows Server 2008 and above) or NTBackup utility to perform a backup of the database files involved. If this backup is successful, it means that a snapshot backup is successfully completing for the database. Now depending on your backup application’s implementation logic, this might be a resource crunch or a backup storage media incompatibility or even a resource crunch.

Common errors associated with VSS backups:

Issue# 1
Error: 3041, Severity: 16, State: 1.
Backup      BACKUP failed to complete the command BACKUP DATABASE <database name>. Check the backup application log for detailed messages.
SubprocessMgr::EnqueueSubprocess: Limit on ‘Max worker threads’ reached.
This error message is raised when your backup application uses VSS to backup a large number of databases simultaneously. More information around this can be found in this post Volume Shadow barfs with 0x80040e14 code

Issue #2
While performing a Bare Metal backup you might run into issues if your SQL Server binaries are located on a different drive other than the C: drive. More information on the same is mentioned here: Location of SQL Binaries can flip out Bare Metal backups

Issue #3

Event Type:    Error
Event Source:    VSS
Event Category:    None
Event ID:    6005
Description:
Sqllib error: Database %1 is stored on multiple volumes, only some of which are being shadowed.

The above error occurs when you backup only one of the volumes that contains SQL database data/log files which are spread over multiple volumes. This is explained in detailed in this post: Why is VSS complaining about SQL Data files


More information on how VSS/VDI work is present in the blog post by Sudarshan: INFORMATIONAL- SHEDDING LIGHT on VSS & VDI Backups in SQL Server

Previous posts related to Volume Shadow Backups
IO Frozen messages while taking NT Backup for SQL databases

Previous posts related to VDI backups
SQL VDI backup fails with 0×80770007

29 thoughts on “Troubleshooting that darn backup error

  1. Pingback: Microsoft TechEd 2012–Multiple rows affected « TroubleshootingSQL

  2. Hi Amit, I find your blogs very useful. I am a backup admin and facing intermittent SQL transaction log backup failures. These failure when they happen mostly happen when backup devices on the backup server are busy (duing weekend full backups). My concern is that if backup devices are busy the SQL transaction log backups should be queued. Why are they dropping in so soon? Please note that there are no inactivity timeouts applied on backup server. Below are sequence of events, in the order, during a failure

    22:12:15 Networker information: XBSA-1.0.1 rb_nmsql522.Build.23 4072 Wed Sep 19 22:12:15 2012 _nwbsa_is_retryable_error: received a retryable network error (Severity 0 Number -13): busy

    22:12;45 MSSQL error: BackupIoRequest::ReportIoError: write failure on backup device ‘Legato#f7905fac-e367-4c88-8521-3ba581b304a9’. Operating system error 995(The I/O operation has been aborted because of either a thread exit or an application request.).

    22:12:45 SQLVDI error: Loc=TriggerAbort. Desc=invoked. ErrorCode=(0). Process=4680. Thread=4324. Server. Instance=SHDSQLTST. VD=Global\Legato#f7905fac-e367-4c88-8521-3ba581b304a9_SQLVDIMemoryName_0.

    22:12:45 MSSQL error: BackupVirtualDeviceFile::RequestDurableMedia: Flush failure on backup device ‘Legato#f7905fac-e367-4c88-8521-3ba581b304a9’. Operating system error 995(The I/O operation has been aborted because of either a thread exit or an application request.).

    22:12:45 MSSQL error: BACKUP failed to complete the command BACKUP LOG SQLAdmin. Check the backup application log for detailed messages.

    22:12:45 Networker information: XBSA-1.0.1 rb_nmsql522.Build.23 4072 Wed Sep 19 22:12:45 2012 _nwbsa_is_retryable_error: received a retryable network error (Severity 0 Number -13): busy

    22:12:45 Networker error: Internal system error, please see nsr\applogs\xbsa.messages on the client system for reason.

    22:12:45 SQL VDI error: SQLVDI: Loc=SignalAbort. Desc=Client initiates abort. ErrorCode=(0). Process=4072. Thread=5132. Client. Instance=SHDSQLTST. VD=Global\Legato#f7905fac-e367-4c88-8521-3ba581b304a9_SQLVDIMemoryName_0.

    Any clue? we are running Windows2008R2 and SQL server2008 and Networker7.6.3

    My interpretation is that SQLVDI gets an i/o error from windows while writing to a virtual file in memory. Please comment.

    Thank you!

    Like

  3. This is quite a generic error. If the error is intermittent, then you need to look into the networker logs or contact networker support to identify which API is actually failing.

    Check if your SQLVDI dll is functioning properly using the SQL Server Backup Simulator:

    SQL Server Backup Simulator v2 available now

    There are multiple fixes/updates available for the SQLVDI dll which you need to ensure are applied to your environment.

    Once the above is done, then you would need to debug the above issue if it continues to occur by contacting Networker support. OS Error 995 is typically reported for most VDI failures. SQLVDI provides APIs to call by a backup application. How the backup buffers are handled is determined completely by the application code. So there could be multiple things that could be a culprit. The above data is not sufficient to comment on the same other than the inputs that I have already provided above.

    Like

  4. Pingback: A year that was « TroubleshootingSQL

  5. We use custom-made VDI backup-restore tool from RAR-archive (vdc 0.9.4 – restoring MS SQL database from RAR archive – test version), it works with versions of SQL Server from 2000 to 2008R2, but not with compressed backups from 2008 and 2008R2.
    SQL Server throws a series of errors on RESTORE VERIFYONLY:

    spid53 Error: 17066, Severity: 16, State: 1.
    spid53 SQL Server Assertion: File: , line=2699 Failed Assertion = ‘m_pDecodeSplitInput->GetAvailableFreeSize () >= lengthInNext’. This error may be timing-related…
    spid53 Error: 3624, Severity: 20, State: 1.
    spid53 A system assertion check has failed…
    spid53 Error: 17066, Severity: 16, State: 1.
    spid53 SQL Server Assertion: File: , line=281 Failed Assertion = ‘GetAvailableFreeSize () >= size’. This error may be timing-related…
    spid53 Error: 17066, Severity: 16, State: 1.
    spid53 SQL Server Assertion: File: , line=2699 Failed Assertion = ‘m_pDecodeSplitInput->GetAvailableFreeSize () >= lengthInNext’. This error may be timing-related…
    spid53 Error: 17066, Severity: 16, State: 1.
    spid53 SQL Server Assertion: File: , line=281 Failed Assertion = ‘GetAvailableFreeSize () >= size’. This error may be timing-related…
    spid53 Error: 17066, Severity: 16, State: 1.
    spid53 SQL Server Assertion: File: , line=2699 Failed Assertion = ‘m_pDecodeSplitInput->GetAvailableFreeSize () >= lengthInNext’. This error may be timing-related…
    spid53 Error: 17066, Severity: 16, State: 1.
    spid53 SQL Server Assertion: File: , line=281 Failed Assertion = ‘GetAvailableFreeSize () >= size’. This error may be timing-related…
    Backup Error: 18210, Severity: 16, State: 1.
    Backup BackupIoRequest::ReportIoError: read failure on backup device ‘RARDEVICE’. Operating system error 13(failed to retrieve text for this error. Reason: 15105).
    spid53 Internal I/O request 0x000000015A6ACDA0: Op: Read, pBuffer: 0x0000000000000000, Size: 65536, Position: 1251328, SOS: Internal: 0x0, InternalHigh: 0x0, Offset: 0x0, OffsetHigh: 0x0, m_buf: 0x0000000000000000, m_len: 65536, m_actualBytes: 0, m_errcode: 13, BackupFile: RARDEVICE

    The error returned is:
    Unexpected termination: -2139684860
    PerformTransfer:Done
    Closing device. SqlState:HY000 NativeError:3624
    A system assertion check has failed…
    Waiting for restore thread done.
    Location: readEncoded.cpp:2699
    Expression: m_pDecodeSplitInput->GetAvailableFreeSize () >= lengthInNext
    SPID: 53
    Process ID: 2276
    Restore Thread Execution – failed

    SQL Server @@version is:
    Microsoft SQL Server 2008 (SP3) – 10.0.5512.0 (X64)
    Aug 22 2012 19:25:47
    Copyright (c) 1988-2008 Microsoft Corporation
    Developer Edition (64-bit) on Windows NT 6.1 (Build 7601: Service Pack 1)

    Could you please shed some light on this?

    Best regards,
    Ilya

    Like

  6. Are you using a native T-SQL RESTORE VERIFYONLY command? If you are not doing it natively, then how are you issuing the RESTORE VERIFYONLY command? Is the backup unzipped completely before running the VERIFYONLY command or do you do something else?

    Like

  7. 1) The command is native: RESTORE VERIFYONLY FROM VIRTUAL_DEVICE = ‘RARDEVICE’ but prior issuing it the tool creates Virtual Device “RARDEVICE” on the server
    2) No, the backup is inside RAR-archive to save space, it is unpacked by the tool and UNRAR.DLL on the fly. It is similar to http://mssqlcompressed.sourceforge.net/ and http://www.codeproject.com/Articles/19449/Accessing-the-SQL-Server-Virtual-Device-Interface
    As I understnad it works like that: database backup inside RAR-archive -> VDC + unrar.dll + sqlvdi.dll -> SQL Server. So VDC creates virtual device, creates UNRAR unpacking thread and separate thread which connects to SQL Server, issues command and handles results. After that VDC through IClientVirtualDevice opens virtual device and performs transfer by reacting on VDI commands: VDC_READ, VDC_FLUSH, VDC_CLEARERROR and VD_E_CLOSE. It works all right with non-comressed backups inside RAR-archive, but fails with message “Unexpected termination: -2139684860” (0x80770004) if backup was compressed by SQL Server.
    Nacho Alonso Portillo kindly trying to help me sort it out on his blog post http://blogs.msdn.com/b/ialonso/archive/2012/02/08/backup-compression-and-checksum.aspx

    Like

  8. If I don’t compress backup (compressed by SQL Server or not) into RAR, I can’t use the tool, because it is designed to accept RAR-files only. I should mentioned earlier that this is not a special backup taken by some other backup solution, e.g. LiteSpeed, these are regular backups but archived into RAR. One backup – one RAR archive.
    Another problem, that I’ve ran into is access violation after CreateEx call which I’ve added to support non-default SQL instance. Error message after CreateEx is:
    Creating device: RARDEVICE on instance: sql2008 FAILED, error code 0x80070002
    The calling line is CreateEx(‘MYSERVER’,’RARDEVICE’,&config)
    There is default SQL Server 2000 instance installed along with .\SQL2008. So I might be wrong to call CreateEx with the non-NULL 1st param. Yet changing it to NULL throws 0x80770006 ( An invalid parameter was supplied ).
    This is difficult to debug on the server, so I’ve set up SQL Server 2008 instance on computer where programming IDE is located, so step-by-step debug will be possible.

    Like

  9. With CreateEx(‘SQL2008′,’RARDEVICE’,&config) I’m getting 0x80070002 which is non-VDI error code, not 0x80770007.
    Yep, I’ve read that article and in fact it stirred my curiosity to try CreateEx instead of Create.
    Also there is some ambiguity with CreateEx lpInstanceName parameter: “The “lpInstanceName” parameter identifies the SQL Server instance to which the SQL command needs to be sent to. If the CreateEx method has NULL as the first parameter, then it would always connect to the Default instance. If the server doesn’t have a default SQL instance, then the first parameter needs to be provided with the instance name”. So non-NULL instancename value is needed when there is no default instance installed, not that is ALWAYS required for named instance.
    Let’s see all the cases:
    1) when only default instance installed, then one should call EITHER
    Create(‘RARDEVICE’,&config) OR CreateEx(NULL,’RARDEVICE’,&config)
    2) when only named instance installed, then one should call ONLY
    CreateEx(‘INSTANCENAME’,’RARDEVICE’,&config)
    3) when both default and named-instance installed you should call EITHER
    Create(‘RARDEVICE’,&config) OR CreateEx(‘INSTANCENAME’,’RARDEVICE’,&config)
    and issue SQL statement to the correct SQL instance.
    VDC tool works with any instance through VDI Create call when there is default instance installed, and it fails when CreateEx is called.

    Like

  10. 80070002 translates to OS Error “The system cannot find the file specified”. The best practice is to creat the device against the instance that you are restoring/backing up. If NULL is specified, then it will connect to the default instance. If you are using a non-default instance to perform the backup/restore, then you should be using the Instance Name in the CreateEx command.

    Like

  11. Ok, I roger that, lets move on. With CreateEx(‘SQL2008′,’RARDEVICE’,&config) SQLVDI logs the message:
    SQLVDI: Loc=CVDS::OpenInSecondary. Desc=Open(control). ErrorCode=(2)The system cannot find the file specified. Process=260. Thread=856. Client. Instance=. VD=Global\SQL2008_SQLVDIMemoryName_0.
    This is rather strange to me, because Client Instance is empty, and Virtual Device (VD) should start from “Global\RARDEVICE” not “Global\SQL2008”, shouldn’t it?
    I suspect there is something wrong with IClientVirtualDeviceSet2 interface in the translated header file I have to use, because there is a direct jump from my code calling CreateEx to OpenInSecondary. Yet according to changes in vbackup.chm:
    “Multi-instance support. IClientVirtualDevice2 supercedes IClientVirtualDevice. The new interface is identical to the original except for the addition of the ‘CreateEx’ and ‘OpenInSecondaryEx’ methods. These extend the original ‘Create’ and ‘OpenInSecondary’ to allow for an instance name to be specified.” OpenInSecondaryEx inside CreateEx will be invoked.
    I think I’ve got it: in the translated header I’m using (vdi.h -> vdi.pas) the last method in IClientVirtualDeviceSet is SignalAbort, but in vdi.h there are 3 additional methods: OpenInSecondary, GetBufferHandle and MapBufferHandle. Because IClientVirtualDeviceSet2 is based on IClientVirtualDeviceSet, and IClientVirtualDeviceSet2.CreateEx method lands directly to IClientVirtualDeviceSet.OpenInSecondary. Please excuse my vague language.

    Like

  12. You might want to use the new header file and check if that helps. I have used the same files to code the SQL Server Backup Simulator and it works fine without any issues while creating a virtual device against the named instance of SQL Server.

    Like

  13. I’ve just done that, and the problem with non-default instance has been solved. Yet the main one – SQL Server Assertion on compressed backup described in the comment to this post on February 6, 2013 at 6:00 pm – hasn’t. Have you any clue what’s going wrong here?

    Like

  14. I use similar code for the SQL Server Backup Simulator (which also uses compression) and this works fine without the asserts. Without debugging the code line by line and seeing why the assert is being raised by the database engine, it would be difficult to understand the root cause.

    One thing I can suggest is to unzip the file completely and then create the Virtual Device to see if that works to begin with.

    Like

  15. Hello, I’ve found the issue in rarthread. SQL Server retrieves slightly differently non-compressed and compressed backups:
    -non-compressed
    Read pos: 0 size: 512 cc: 0 bytes: 512
    Read pos: 512 size: 512 cc: 0 bytes: 512
    Read pos: 1024 size: 512 cc: 0 bytes: 512
    Read pos: 1536 size: 512 cc: 0 bytes: 512
    Read pos: 2048 size: 512 cc: 0 bytes: 512
    Read pos: 2560 size: 512 cc: 0 bytes: 512
    Read pos: 3072 size: 512 cc: 0 bytes: 512
    Read pos: 3584 size: 512 cc: 0 bytes: 512
    Read pos: 4096 size: 512 cc: 0 bytes: 512
    Read pos: 4608 size: 1024 cc: 0 bytes: 1024
    Read pos: 5632 size: 1024 cc: 0 bytes: 1024
    Read pos: 6656 size: 512 cc: 0 bytes: 512
    …and so on to the end of file
    Read pos: 1593856 size: 512 cc: 38 bytes: 0 <–EOF
    -compressed
    Read pos: 0 size: 512 cc: 0 bytes: 512
    Read pos: 512 size: 512 cc: 0 bytes: 512
    Read pos: 1024 size: 512 cc: 0 bytes: 512
    Read pos: 1536 size: 512 cc: 0 bytes: 512
    Read pos: 2048 size: 65536 cc: 0 bytes: 65536
    Read pos: 67584 size: 65536 cc: 0 bytes: 65536
    Read pos: 133120 size: 65536 cc: 38 bytes: 31744 <–EOF
    -cc here is commandcode: 0 – ERROR_SUCCESS, 38 – ERROR_HANDLE_EOF.
    SQL Server changes the requested size to 65536 after first 4*512 bytes. Mistake was made when dealing with filesize-positioncmd.size at the end of the file.
    Of course the whole operation is handled much faster with bigger blocksize. Does SQL Server Backup Simulator operates the same way? Can I set blocksize to 65536 somehow for non-compressed backups too?

    Like

  16. Yes you can set the max transfer size using the configuration options for the virtual device. Glad that you were able to nail down your issue.

    For SQL Server Backup Simulator, I set the maxtransfersize using the VDConfig object (maxTransferSize parameter). You can set buffercount and max transfer size for the Virtual Device by setting the appropriate values for the Virtual Device configuration parameters.

    Like

  17. Thank you, I appreciate your willingness to help. I’ve set up VDConfig config:
    config.deviceCount := 1;
    config.maxTransferSize := 1024*1204; //tried 64*1024 also
    before calling CreateEx
    hr = vds.CreateEx(instancename,devname,config);
    following by
    hr := vds.GetConfiguration(1000,config);
    hr := vds.OpenDevice(devname,VD);
    but it seems this doesn’t influence cmd.size from GetCommand in any way. Where do I have to set maxtransfersize?

    Like

  18. cmd is a VDC_Command struct from vdi.h:
    struct VDC_Command
    {
    DWORD commandCode;
    DWORD size;
    DWORDLONG position;
    BYTE *buffer;
    } ;
    in simple.cpp:
    VDC_Command * cmd;
    vd->GetCommand (INFINITE, &cmd);
    if (cmd->commandCode == VDC_Read) fread (cmd->buffer, 1, cmd->size, fh);
    After GetCommand call cmd->size is almost always 512 bytes when backup stream is not compressed by SQL Server, it become as high as 64KB when stream is compressed.

    Like

  19. Are you unzipping your file completely before running the RESTORE commands or are you partially unzipping it? Why is it that you need to control this size? I don’t remember if that is exposed. I will have to check the public documentation on this.

    Like

Comments are closed.