SQL Server is hung!

In the recent past, I had to work on a SQL Server 2000 instance which became unresponsive after a short period of time the service was restarted. Since this was SQL Server 2000, I didn’t have the opportunity to use a Dedicated Administrator Connection (DAC) to log into the SQL Server instance to see if a DAC connection succeeded. And if it did, could I figure out what was happening on the SQL Server engine that it was not accepting a new connection.

Post the SQL Server service restart, the ERRORLOG very happily indicated no issues and if you weren’t already ready to tear your hair out due to the lack of error messages, the connection failure reported the most generic of errors messages:

Server: Msg 11, Level 16, State 1
[Microsoft][ODBC SQL Server Driver][TCP/IP Sockets]General network error. Check your network documentation.

I did the basic due diligence to check if the network protocols were enabled and if the port on which the SQL Server instance was supposed to listen on was actually open. I did happen to check the netstat output to check the activity on the port and found a large number of connections on the SQL Server port. I did a quick check of the count of the number of connections showing up to determine if this was a TCP port exhaustion issue. But that was not the case either! The Errorlog didn’t even report a Deadlocked Schedulers condition for me to know that there was an issue.

So now that I had run out of options, the only option I had was to capture a memory dump of the SQL Server process and investigate the same.

As you can see that capturing a memory dump of the SQL Server process is the last option and all other options available should be exercised prior to going down this path.

I captured a dump of the sqlservr.exe using SQL Dumper. Since I didn’t have an error, I was pretty much looking for a patter in the dump which stands out. First load the dump in WinDbg and load the symbols using the following commands. The command below sets the symbol path to the Microsoft symbol server and caches the symbols in a local cache in the D: drive.

.sympath srv*D:\PublicSymbols*http://msdl.microsoft.com/download/symbols

Then I got the symbols using the command .reload /f. Now that I had the correct public symbols, I used the !uniqstack command to show me a list of all the unique callstacks in the dump file. The output of the command is shown below.

Total threads: 276
Duplicate callstacks: 249

I now know that the number of total threads in my dump is 276 and I have 249 callstacks which are duplicate. So there is a good possibility that there are callstacks which have a unique pattern and will give me a good idea of what the root cause of the issue is. On manual inspection of the duplicate callstacks, I found that the most common callstack signature was as shown below:

ntdll!KiFastSystemCallRet
ntdll!NtWaitForSingleObject+0xc
ntdll!RtlpWaitOnCriticalSection+0x1a3
ntdll!RtlEnterCriticalSection+0xa8
ntdll!LdrShutdownThread+0x33
kernel32!ExitThread+0x2f
kernel32!BaseExitThreadPoolThread+0xd
ntdll!EtwpEventPump+0x335
kernel32!BaseThreadStart+0x34

There were 249 callstacks with the same callstack as shown above. This very clearly signifies that the threads are all waiting on a critical section. I then used the !locks command to find out all the critical section locks currently held on the system and the output that I got from using the command is shown below:

0:043> !locks

CritSec ntdll!LdrpLoaderLock+0 at 7c88a7a0
WaiterWoken        No
LockCount          249
RecursionCount     2
OwningThread       144c
EntryCount         0
ContentionCount    121
*** Locked

Scanned 3107 critical sections

Now I know that I have 249 waiters on the above Critical Section. Let us now find out what the thread owning the critical section is actually doing. Using ~~[144s]s, I can switch to the thread that is currently owning the critical section so that I can retrieve the call stack for the thread. This happened to be Thread ID 25. I was lucky enough to get a meaningful callstack from the thread as the thread which held the critical section was waiting for an operation to complete. A part of the callstack for the thread owning the critical section is mentioned below.

0:043> ~~[144c]s

0:025> kL100

ntdll!KiFastSystemCallRet
user32!NtUserWaitMessage+0xc
user32!InternalDialogBox+0xd0
user32!SoftModalMessageBox+0x94b
user32!MessageBoxWorker+0x2ba
user32!MessageBoxTimeoutW+0x7a
user32!MessageBoxTimeoutA+0x9c
user32!MessageBoxExA+0x1b
user32!MessageBoxA+0x45


ntdll!LdrpCallInitRoutine+0x14
ntdll!LdrpRunInitializeRoutines+0x367
ntdll!LdrpLoadDll+0x3cd
ntdll!LdrLoadDll+0x198
kernel32!LoadLibraryExW+0x1b2

The name of the functions are self-explanatory and it show s that the thread is trying to display a message box for the SQL Server process. This is not allowed as the SQL service runs as a non-interactive process which means that it cannot show pop-up boxes to a user on the desktop.

On additional investigation into the complete callstack, it was figured out that the message box was being raised by code in a third party DLL. Now the action would be to remove the DLL from operating within the SQL Server process space so that it wouldn’t try a raise a message box.

This is an issue which not only affects SQL Server 2000 instances but all versions of SQL Server upto SQL Server 2012. This is one scenario where the SQL Server process is in a “hung” state as the thread owning the critical section is waiting. All the other threads are waiting for the owning thread to release the loader lock. No messages are being printed into the SQL Errorlog about this condition because every thread needs to pass through the code path that the other threads are currently hung on and SQL Server ultimately goes into an unresponsive state giving the end user a perception that SQL Server is hung. In this case, SQL Server is actually a victim.

2 thoughts on “SQL Server is hung!

  1. I am really looking forward for that day when I will learn some memory dump from you.This is really cool stuff.Wow !!!

    Like

Comments are closed.