KBI 311119 New Feature: Fire Urgent Group Alert When All Nodes In Monitoring Group Are Broken For The Same Rule

Version

Argent Advanced Technology 3.1A-1410-A And Later

Date

Monday, 17 Nov 2014

Summary

The new feature allows firing conditional alerts when all Nodes in the selected Monitoring Group are broken for the same Rule

This is extremely useful to monitor critical applications or clustered resources

For example, a web farm may span over a few IIS servers

If some of the web servers are offline, website may be less responsive, but it is still available

If all web servers are offline, the website will be completely down

It is straightforward to define such an Urgent Group Alert

There are two restrictions when defining Urgent Group Alerts:

  1. The Selected Monitoring Group cannot be associated with Monitoring Engine {dynamic}

    The reason behind is that Nodes in the Monitoring Group must be scheduled by one Supervising Engine or Motors at the same location

  2. It is not prohibited but not recommended to use dynamic scheduling

    It would be better to find out the condition for all Nodes at similar time

Technical Background

A Relator can have one or multiple Monitoring Groups to monitor

The schedule of Relator defines the interval of Relator executions

Each interval can be considered as a batch to run Rules for the monitored server/devices

The basic idea of Urgent Group Alert is to check the result of Relator in each batch; if a Rule is broken ALL Nodes in the selected Monitoring Group, fire Events for the individual Nodes, then fire the conditional Urgent Group Alert

To understand Urgent Group Alert, customer should carefully read following technical details:

  1. An Urgent Group Alert is simply an alert that tells customer that a serious condition has happened

    There are Argent Console Events for individual Nodes, but no such an Event for Urgent Group Alert

    Customer does not need to answer the correspondent Event on Argent Console (A1x)

    Instead, he should read the Events of individual Nodes for details

  2. Urgent Group Alert is normally only fired after Events for individual server/device are fired

    For example, Monitoring Group ‘&MG_X‘ includes server A, B and C

    In batch 1, A and B are found down and Events are fired

    Urgent Group Alert is not fired because C is still up

    In batch 2, C is found down and an Event is fired

    In batch 3, A, B and C are still down

    This default behavior is useful to prevent alert flooding as it can take a while to correct the condition

    For extremely critical condition, if customer does not mind to be alerted repetitively, he can override the behavior by checking following option in Relator setting:


    When this option is checked, an Urgent Group Alert is fired for each batch as long as all Nodes are still down and no matter if individual Events are fired

  3. Customer may want Urgent Group Alert to be fired repetitively but find it is too much especially when Relator runs with very short interval

    This is especially true for connectivity test using SLA or System Down Rules

    To limit the alerts, customer can use following option in Relator setting:


  4. If some Nodes in the Monitoring Group are suspended, in maintenance mode, excluded in Relator or not licensed, they are not counted when determining if the Urgent Group Alert should be fired

    Take previous example, if A and B are down, C is up, but C is in maintenance mode, an Urgent Group Alert will be fired as C is not counted

It is quite straightforward to debug or analyze Urgent Group Alerts

When an Urgent Group Alert is fired, it contains list of Events fired for the individual Nodes

It explains why the Urgent Group Alert is fired. Following is an Urgent Group Alert example:

If an Urgent Group Alert is not fired, Argent AT service log also contains detail explaining the logic. For example:

25 Aug 2014 14:02:32.890 W2008R264PANWS A\Administrator NOT FIRING Urgent Group Alert for Monitoring Group ‘&MG_X‘ in Relator ‘REL_WINDOWS_API_CONNECT

Reason: Rule ‘NDE_API_CONNECT‘ has not found broken for Node ‘DBMWIT‘ in current batch

Resolution

Upgrade to Argent Advanced Technology 3.1A-1410-A or later