System Health Session: Part 1

SQL Server 2008 introduced a new feature called Extended Events which opened a new vista for troubleshooting SQL Server issues. Bob Ward in a post on the CSS SQL Server Engineers blog had written about the system_health Extended Events session which is started by default on any instance of SQL Server 2008 or above. This session uses a ring buffer target so the information stored about the events tracked by this session don’t stick around forever much like the SQL Server default trace. Jonathan (Blog | Twitter) blogged about the System Health session and way it is setup in detail. Note that if you have an event session setup for a ring_buffer target and the data you feed the target exceeds 4Mb, you may not be able to retrieve all XML nodes from the target data. Bob Ward mentioned about this in detail in his blog post. So the System_Health session also suffers from this drawback.

This is the first post of a series of post that I will be writing about the System Health session and how you can use this to your SQL Server 2008 monitoring on the next level without having to install anything new on your servers… Maybe for the first few posts Winking smile

The first thing that you would need to do is get the different events reported with the oldest event time present in the ring buffer and the last event reported time using the query below:

SET NOCOUNT ON

— Store the XML data in a temporary table
SELECT CAST(xet.target_data as xml) as XMLDATA
INTO #SystemHealthSessionData
FROM sys.dm_xe_session_targets xet
JOIN sys.dm_xe_sessions xe
ON (xe.address = xet.event_session_address)
WHERE xe.name = ‘system_health’

— Get the different events reported in a grouped manner
;WITH CTE_HealthSession AS
(
SELECT C.query(‘.’).value(‘(
/event/@name)[1]’, ‘varchar(255)’) as EventName,
C.query(‘.’).value(‘(
/event/@timestamp)[1]’, ‘datetime’) as EventTime
FROM #SystemHealthSessionData a
CROSS APPLY a.XMLDATA.nodes(‘/RingBufferTarget/event’) as T(C))
SELECT EventName,
COUNT(*) as Occurrences,
MAX(EventTime) as LastReportedEventTime,
MIN(EventTime) as OldestRecordedEventTime
FROM CTE_HealthSession
GROUP BY EventName
ORDER BY 2 DESC
— Drop the temporary table
DROP TABLE #SystemHealthSessionData

One of the set of  events tracked are errors reported by the database engine. The u_tables.sql(C:\Program Files\Microsoft SQL Server\<install folder>\MSSQL\Install) script which contains the specification of the system_health session shows that the following errors are tracked:
1. Any error which has a severity greater than or equal to 20.
2. Errors reported for out of memory conditions (Error #: 17803, 701, 802, 8645, 8651, 8657 or 8902)

The script below will give you a total number of errors reported along with the error reported counts. The second query will list out all the recorded information for a particular error number shown in the first queries output:

SET NOCOUNT ON

— Store the XML data in a temporary table
SELECT CAST(xet.target_data as xml) as XMLDATA
INTO #SystemHealthSessionData
FROM sys.dm_xe_session_targets xet
JOIN sys.dm_xe_sessions xe
ON (xe.address = xet.event_session_address)
WHERE xe.name = ‘system_health’

— Get statistical information about all the errors reported
;WITH CTE_HealthSession (EventXML) AS
(
SELECT C.query(‘.’) EventXML
FROM #SystemHealthSessionData a
CROSS APPLY a.XMLDATA.nodes(‘/RingBufferTarget/event’) as T(C)
),

CTE_ErrorReported (EventTime, ErrorNum) AS
(
SELECT EventXML.value(‘(
/event/@timestamp)[1]’, ‘datetime’) as EventTime,
EventXML.value(‘(/event/data/value)[1]’, ‘int’) as ErrorNum
FROM CTE_HealthSession
WHERE EventXML.value(‘(
/event/@name)[1]’, ‘varchar(255)’) = ‘error_reported’
)
SELECT ErrorNum,
MAX(EventTime) as LastRecordedEvent,
MIN(EventTime) as FirstRecordedEvent,
COUNT(*) as Occurrences,
b.[text] as ErrDescription
FROM CTE_ErrorReported a
INNER JOIN sys.messages b
ON a.ErrorNum = b.message_id
WHERE b.language_id = SERVERPROPERTY(‘LCID’)
GROUP BY a.ErrorNum,b.[text]

— Get information about each of the errors reported
;WITH CTE_HealthSession (EventXML) AS
(
SELECT C.query(‘.’) EventXML
FROM #SystemHealthSessionData a
CROSS APPLY a.XMLDATA.nodes(‘/RingBufferTarget/event’) as T(C)
WHERE C.query(‘.’).value(‘(
/event/@name)[1]’, ‘varchar(255)’) = ‘error_reported’

)
SELECT
EventXML.value(‘(
/event/@timestamp)[1]’, ‘datetime’) as EventTime,
EventXML.value(‘(/event/data/value)[1]’, ‘int’) as ErrNum,
EventXML.value(‘(/event/data/value)[2]’, ‘int’) as ErrSeverity,
EventXML.value(‘(/event/data/value)[3]’, ‘int’) as ErrState,
EventXML.value(‘(/event/data/value)[5]’, ‘varchar(max)’) as ErrText,
EventXML.value(‘(/event/action/value)[2]’, ‘varchar(10)’) as Session_ID
FROM CTE_HealthSession

— Drop the temporary table
DROP TABLE #SystemHealthSessionData

Now you have a set of queries which you can use to fetch information about all the errors reported for your SQL Server instance. Remember that the target of the extended events session is a ring buffer. So, if your server instance have been running for a long period of time, the data will be overwritten.

Modification: Sept 15, 2011: WordPress doesn’t seem to like the @name XML attribute in the XQuery. After multiple attempts at formatting the query above, I finally decided to upload the query as System_Health_Session_1.sql to SkyDrive. You can download it here.

Modification: Sept 21, 2011: Jonathan (Blog | Twitter) dropped me an email stating that the above script could fall prey to incorrect timestamp issue. This issue is already filed as a Connect item. Jonathan described on a viable workaround for this solution in his blog post.

Modification: March 3, 2012: The timestamp value reported is in UTC (GMT) format. So if you want to report event time as per your SQL Server instance time, then you would need to perform the calculation and account for the GMT time difference.

As the title says, Part 1… There is definitely more to come on this topic. Stay tuned!

One thought on “System Health Session: Part 1

  1. Pingback: System Health Session: Part 2 « TroubleshootingSQL

Comments are closed.