KBI 310127 Microsoft’s PDH.DLL Causing Blue Screen In Argent

Version

All

Date

24 Oct 2008

Summary

Microsoft Windows operating systems have built-in facilities to collect performance monitoring information. In addition, they provide an Application Programming Interface (API) to enable software developers to interrogate the gathered information. Argent uses this API to gather performance counters.

When using a single process to gather performance counters from multiple servers, if connecting to ServerA had problems, the API would often indicate it had trouble connecting to ServerB, whereas this was not actually the case. Connection timeouts and retries are done by the Microsoft API, so coding Argent to workaround these Microsoft code issues wasn’t simple. Microsoft released patches, however issues continued into Windows 2000. Windows 2003 and above do not appear to exhibit these connection issues, but still can suffer blue screens.

Technical Background

The advantages of using the Microsoft API are two-fold. Firstly, there is no need for extra code to be deployed on the monitored server. Secondly, the impact of collecting the performance data is very low.

However, there is one drawback: the disadvantage of this approach is that the Argent is reliant on Microsoft’s code, and the stability of Microsoft’s off-patched operating systems. Whilst this code generally works well in recent versions of Windows, this hasn’t always been the case.

Central to the performance API is Microsoft Window’s Performance Data Helper Library (pdh.dll). There were many problems with this library in Windows NT 4.0 and below, in relation to connection issues.

Examples Of Some Microsoft pdh.dll Issues:

http://support.microsoft.com/kb/263221

http://support.microsoft.com/kb/170576

http://windowsitpro.com/article/articleid/14021/i-am-getting-a-blue-screen–completely-hung-machine–server-restart-on-my-sql-serverclient.html

Resolution

As a workaround to these Microsoft issues and other pdh.dll-related problems, Argent created a facility to use a separate process for each connection to each monitored server. And in Argent’s normal fashion, an option was provided to use either the traditional (shared connections) or the new (single connection) method, or a combination of both.

There are advantages and disadvantages to each approach.

Shared Connection Method — AMC_SharedPerf.exe

This consists of a single process for all monitored servers and all Relators. The connection to each server is kept open until the Argent specifically closes it (e.g. Argent shutting down), or it has been idle for 15 minutes.

The key advantage of this method is that in many cases the same connection is re-used over and over. Overhead is reduced because re-connecting is often not required. This saves network traffic and has less CPU impact.

The disadvantage of this method is the historical Microsoft problems for Windows NT and Windows 200x as described earlier.

Separate Connection Method — AMC_NonSharedPerf.exe

This consists of a separate process for each server for each Relator. i.e. every time a Relator needs to collect a performance metric from a monitored server the Argent starts a process specifically to collect that information. Once received the process terminates.

(Here is the overhead of process creation and termination: on a 2005-vintage laptop running W2003 Server, the lifecycle of process creation and termination is around 690 microseconds of CPU time, and 2.3 milliseconds of elapsed time; Argent benchmarks…)

The advantages of this method are that it is more robust and avoids the historical Microsoft problems for Windows NT and Windows 200x as described earlier.

The disadvantages of this method are two-fold. Firstly, there is some extra CPU utilization on the monitoring server (see benchmark above). Secondly, there is an increase in the network traffic due to the process of establishing and terminating connections to monitored servers for each Relator collecting performance counters.

Specifying The Option To Use

The method used is controlled at the Relator level, with each Relator having the ability to use either method. This provides Customers with the maximum flexibility – some Relators can use one approach while other Relators use the other approach.

The “Execute Performance Routines In The Separate Processes” box on the “Basic” tab of Relator screens defines the method – see screenshot below:

If this option is NOT selected, then the Shared Connection Method (AMC_SharedPerf.exe) is used.

If this option IS selected, then the Separate Connection Method (AMC_NonSharedPerf.exe) is used.

Argent Recommendation

When using the Separate Connection Method the additional CPU overhead is extremely low (see benchmark above). However the additional network utilization can sometimes be significant on slow or heavily-utilized links. Therefore the Argent recommends to use the more robust Separate Connection Method in all cases except where the additional network utilization could be significant.

Common Issues Relating To Gathering Performance Monitoring Information

Gaps In Performance Data

It is still all too common for a heavily loaded W200x server to drop performance data. This is simple to reproduce – start Performance Monitor then start about 500 processes using cascading cmd files with

START sibling.cmd

Non-Existent Performance Data

When no performance data is presented, this is not the Microsoft bug described above, but rather this is most often a security or access issue. Argent needs to be able to access each server being monitored remotely, and gather the performance counters.

This means the Remote Registry Service needs to be running and the Argent needs to be able to access the performance counters in the remote server’s registry.

This is a common issue found on locked-down servers (e.g. in a DMZ). The simplest way to verify Argent has the appropriate permissions is to use the “Connectivity And Accessibility Test” from the Server’s Properties screen (N14A). An alternative, non-Argent, method is to use PerfMon to access any counters on the remote server – the same results will be observed.