Desktop Heap Issues

Over the past year I have seen multiple servers falling prey to the desktop heap issue due to which CSS has received multiple support calls. And where do I fit into all this. Well SQL Server setup works in two phases, the second phase being service startup. When you have a large (don’t ask me what this number is because it varies based on multiple parameters) number of services running on your server, then the SQL service startup might fail due to which the setup would fail. That is when you would either curse the setup and give it another shot(s) or call CSS.

Read about what the Platforms Escalation team’s blog post about what Desktop Heap is all about:

http://blogs.msdn.com/ntdebugging/archive/2007/01/04/desktop-heap-overview.aspx 

This is when we collect DHeapMon output to analyze the current status of the Desktop Heap. Now, since you have a large number of windows services running on your server, you either have an option of shutting down the non-essential services or follow the steps mentioned in the article below (KB126962).

WORD OF CAUTION: The default values shouldn’t be changed unless Microsoft Support has asked you to do so.

If the total desktop heap becomes the system-wide buffer size, the following event is logged:

Event Type: Warning
Event Category: None
Event ID: 244
Date: Date
Time: Time
User: N/A

Description: Failed to create a desktop due to desktop heap exhaustion.

From the system event logs, you would see the following message:

Failed to create a desktop due to desktop heap exhaustion.

Timeout (30000 milliseconds) waiting for the <SQL Server name> service to connect.
The <SQL Server name> service failed to start due to the following error: The service did not respond to the start or control request in a timely fashion.
The Windows Installer service entered the stopped state.

System      07/08/2009 14:07:07 Warning            244    Win32k                   User information unavailable…    Failed to create a desktop due to desktop heap exhaustion.
System      07/08/2009 14:07:07 Error              7009   Service Control Manager  User information unavailable…    Timeout (30000 milliseconds) waiting for the <SQL Server name> service to connect.
System      07/08/2009 14:07:07 Error              7000   Service Control Manager  User information unavailable…    The SQL Server (<instance name>) service failed to start due to the following error: The service did not respond to the start or control request in a timely fashion.

The above messages would be noticed in the Windows Event Logs around the same time when the SQL setup failed if you are not able to install a SQL patch due to a desktop heap exhaustion issue.

Useful Link

"Out of Memory" error message appears when you have a large number of programs running

http://support.microsoft.com/kb/126962/

184802    User32.dll or Kernel32.dll fails to initialize
http://support.microsoft.com/default.aspx?scid=kb;EN-US;184802

How to search for malicious strings in your database

In the past, CSS had a number of support calls regarding SQL Injection. Nowadays, people are much more aware of how to secure their front-end and back-end systems so that they do not fall prey to malicious attackers and become victims of issues like SQL Injection. Now let’s take a scenario where you have identified a malicious string in your database table. You are not sure how many more table(s)/column(s) can have this kind of malicious string. Searching this would be a daunting task. You can use the script below to run a search on all tables columns that have character data and look for the malicious string pattern that you provide the @string variable. This search would be a very performance intensive search as this would take a long time to execute as it is combing through all character columns in the entire database. So, here’s the warning that you have heard multiple times 🙂 : DO NOT RUN THIS DURING PRODUCTION HOURS. However, if you have a case of SQL Injection, then production should definitely be halted till you have plugged the hole in the security. 

set nocount on
DECLARE @T varchar(255), @C varchar(255),@string varchar(255);
SET @string = '<script' -- Malicious value to be searched for
DECLARE Table_Cursor CURSOR FOR
SELECT a.name, b.name
FROM sysobjects a, syscolumns b
WHERE a.id = b.id AND a.xtype = 'u' AND
(b.xtype = 99 OR --ntext
b.xtype = 35 OR -- text
b.xtype = 231 OR -- nvarchar
b.xtype = 175 OR -- char
b.xtype = 167) -- varchar
declare @sql varchar(8000)
OPEN Table_Cursor;
FETCH NEXT FROM Table_Cursor INTO @T, @C
WHILE (@@FETCH_STATUS = 0)
BEGIN
  select 'Table Name:'+@T
  EXEC('select * from [' + @T + '] where [' + @C + '] ' + 'like  ''%'+@string+'%'' OR ' + @C + ' like ''%EXEC %'''  )
  FETCH NEXT FROM Table_Cursor INTO @T, @C;
END
CLOSE Table_Cursor;
DEALLOCATE Table_Cursor;
set nocount off

PAGE_VERIFY option in SQL Server

A new feature is available in SQL Server 2005 and higher versions to enabled Page Checksum for a database page to help detect errors in the IO path.

The SQL Server storage engine has already blogged about this:

http://blogs.msdn.com/sqlserverstorageengine/archive/2006/06/29/Enabling-CHECKSUM-in-SQL2005.aspx 

The reason I’m posting an additional article is the misconceptions related enabling Checksum for a SQL Server database. You can find out what is the PAGE VERIFY option for your database using the system catalog sys.databases with the help of the query below:

select database_id, page_verify_option, page_verify_option_desc

from sys.databases

The PAGE VERIFY option has 3 possible values:

0 = NONE

1 = TORN_PAGE_DETECTION

2 = CHECKSUM

Let’s take an example here:

You created a database without CHECKSUM enabled. So, all pages that were written to disk now do not have the Checksum property was enabled on it. After you enable Checksum on the database all the objects that were created prior to enabling Checksum on the database will not have the Checksum written into the Page header. When these pages are brought from the disk into the SQL Server Buffer Pool (RAM) and the dirty bit of the page is set (when any DML operation modifies the page), only then would the Checksum get calculated and written into the page header before it gets written back to the physical media (disk). This happens during the Checkpoint operation. So, if you brought a page into the SQL Server Buffer Pool and didn’t dirty the page (i.e. only for a SELECT operation), then the checksum will not be written onto the page header.

One of the recommendations would be to dirty all pages using operations like DBCC DBREINDEX so that all pages that didn’t have Checksums written into the header now have Checksums written onto it after enabling Checksum on a database.

Why is this by-design?

The reason for this is we cannot estimate the size of the database and the number of database pages that would need checksums calculated and written onto the disk. This would involve a huge performance degradation which is avoided by the current design.

From SQL Server Books Online:

PAGE_VERIFY

When CHECKSUM is specified, the Database Engine calculates a checksum over the contents of the whole page and stores the value in the page header when a page is written to disk. When the page is read from disk, the checksum is recomputed and compared to the checksum value stored in the page header.

When TORN_PAGE_DETECTION is specified, a specific bit for each 512-byte sector in the 8-kilobyte (KB) database page is saved and stored in the database page header when the page is written to disk. When the page is read from disk, the torn bits stored in the page header are compared to the actual page sector information.

When NONE is specified, database page writes will not generate a CHECKSUM or TORN_PAGE_DETECTION value. SQL Server will not verify a checksum or torn page during a read even if a CHECKSUM or TORN_PAGE_DETECTION value is present in the page header.

Another interesting post by Bob Dorr (Principal Escalation Engineer – SQL Server):

http://blogs.msdn.com/psssql/archive/2008/05/22/how-it-works-sql-server-2005-possible-data-recovery-when-checksum-failure-reported.aspx

MTL Consumers

A previous post of mine talked about SQL Server memory architecture in a nutshell. Let’s find out what are the common consumers of the much talked about MemToLeave region and what you should be looking for in-case you run into MemToLeave issues.

Major Consumers of MTL

    1. Use of Linked Servers – You can find out the linked servers that being used in your environment using the sysservers system catalog.
    2. Use of XML documents – You would have to find out if any queries or procedures perform any kind of XML data manipulation or use sp_xml_preparedocument.
    3. Extended Stored Procedures or sp_OAcreate calls
      1. Extended stored procedure usage can be identified by inspecting the SQL Server Errorlog and searching for the word using. The first XSP calls is logged in the Errorlog in the following manner: Using ‘<dll name>’ version ‘<version>’ to execute extended stored procedure ‘<XSP name>’.
      2. If you are using sp_OAcreate, then this information would be logged in the SQL Errorlog for the first invocation of sp_OAcreate using the same pattern mentioned above. The only difference would be that the DLL name would be odsole70.dll.
    4. Query Plans which are larger than 8 KB in size
      1. If you are using SQL Server 2000, query syscacheobjects and search for entries which have values greater than 1 for the pagesused column.
      2. If you are using SQL Server 2005, search for plans greater than 8KB using the DMV sys.dm_exec_cached_plans. Inspect the size_in_bytes column value and search for entries which have values greater than 8192 bytes.
    5. Use of SQLCLR (Applicable to SQL Server 2005 and above) – Check the memory usage for SQLCLR clerk using DBCC MEMORYSTATUS.
    6. Backups using larger MAXTRANSFERSIZE parameter – In such a case find out BACKUP DATABASE/LOG commands and inspect the MAXTRANSFERSIZE parameter value. The lowest size that can be provided is 65536 bytes.
    7. Using Network Packet Size higher than 8192 bytes. You can find all connections on SQL Server 2005 that use more than 8KB of connection using the following DMV: select * from sys.dm_exec_connections:

select * from sys.dm_exec_connections
where net_packet_size > 8192

How to find the largest free block in the MTL

Often we need to find the largest free block when troubleshooting “MemToLeave” issues with SQL Server Memory. Since, this is related to virtual address space and we didn’t expose any method to query the process address space on SQL Server 2000, you would have to rely on a tool called VMSTAT to find out the largest free block. This tool is used by CSS to troubleshoot MTL issues. However in SQL Server 2005, we added a new DMV which would inform you the status of the virtual address space for SQL. The query below can be used to find the largest contiguous block on the system.

If you are getting the following error:

WARNING:  Failed to reserve contiguous memory of Size= 2162688 (*This value is always in bytes)

If the query below is run during the problem period, then you will find one of the following conditions to be true:

1. The value returned by the query is larger than the value in the error (once you have done the necessary conversions) – This means that when the error occurred, the MemToLeave region of SQL Server was being used by some component due to which a current allocation request failed. This is due to a virtual address space crunch.

2. The value returned by the query is lower than the requested size – This means that you either have a virtual address space fragmentation issue or some component has used up all the available MemToLeave region. (A lot of my colleagues believe that MTL is a misnomer but I will continue to use these throughout the post as it makes referencing the memory region simpler.) In such a case, it would always be advisable to get a Microsoft CSS Engineer to setup data collection tools in place for the next problem occurrence and analyze the existing logs for any tell-tale signs.

Query  (for 64-bit plaform)

SELECT convert(varchar,getdate(),120) as [Timestamp], max(region_size_in_bytes)/1024 [Total max contiguous block size in KB] from sys.dm_os_virtual_address_dump where region_state = 0x0000000000010000 --MEM_FREE
 
Query  (for 32-bit plaform)
SELECT convert(varchar,getdate(),120) as [Timestamp], max (region_size_in_bytes)/1024 [Total max contiguous block size in KB] from sys.dm_os_virtual_address_dump where region_state = 0x00010000 -- MEM_FREE
 
For more information on this DMV, read about it on sys.dm_os_virtual_address_dump
NOTE: The above query is very very performance intensive and shouldn’t be run very frequently on a production system. On a 64-bit system, frequent use of this query can actually degrade performance severely due to the large virtual address space of 64-bit systems.