KBI 310361 Handling False Alerts In Argent AT System Down Rules

Version

Argent Advanced Technology 3.1A-1301-T3 or below

Date

3 Apr 2013

Summary

It has been a best practice to use NetRemoteTOD API option in System Down Rule to check up/down status of Windows machines, as it checks more than the responsive NIC compared to simple ping. There are many occasions that can go like following: User gets alert of a down server because NetRemoteTOD API fails; user pings the server and it works; user remote terminals into the server, the server is still running. Now user complains that he gets a false alarm.

Technical Background

NetRemoteTOD relies on RPC calls to retrieve system time of remote machine. The API call can fail on many reasons. It can be a temporary network condition and when user checks, it has recovered already. It can also be a pure issue of user perception or user’s definition of down server. It is not unusual that user can remote terminals into a server while RPC calls on the server fail always.

Resolution

AT 3.1A-1301-T4 and later have enhanced System Down Rule to address the issue.

Connect To Remote Desktop Service

When using this new option, engine tries to connect to the remote desktop service at the target machine. It is similar to what user does when he remote terminals into the target machine.

Note: Due to Windows API restriction, this option does not work for XP.

Connect To WMI Service

When using this new option, engine tries to connect to WMI name space ‘root\cimv2‘.

This option is very reliable to check if remote target machine is functioning as it goes even more hops than NetRemoteTOD API. On the other hand, this option is VERY expensive. It is not recommended to check WMI connection once every minute for all 500 nodes. Use it with caution.

Enhanced NetRemoteTOD API Option

NetRemoteTOD API option is the most common one that causes ‘false‘ alarms. It is enhanced in following areas:

  • API timeout and retry are implemented to deal with temporary network issue.
  • Pre-requisite ping option is implemented to reduce the overhead of NetRemoteTOD API.

    The normal assumption is that server is definitely down if ping fails. If engine cannot ping the server, it can skip the much more expensive NetRemoteTOD API call.

    Note: the assumption may not be true always especially firewall can block ICMP traffic. So use this option with caution.

  • Double checking mechanism is implemented to match user’s definition of down server.

The logic of double checking goes like following:

  1. If NetRemoteTOD succeeds, the server is up.
  2. If NetRemoteTOD fails, ping the server as the first step of double-checking.
  3. If ping fails, the server is down as both Net API and ping have failed.
  4. If ping succeeds, do the second step of double-checking to get the third opinion. The second step can be one of WMI, Remote Desktop, file share and system up (no further checking).
  5. If the step succeeds, system is up; if it fails, system is down.

Advanced Cascaded Logic

Cascaded Logic allows user implementing sophisticated logic to check the up/down status of target server. The engine goes through the ordered tests one by one. When a test succeeds, the logic can declare the target server is up, or take one more step to confirm. When a test fails, the logic can declare the target server is down, or take more steps to double check. All the direct methods are available for the combination.