KBI 311119 New Feature: Fire Urgent Group Alert When All Nodes In Monitoring Group Are Broken For The Same Rule

Version

Argent Advanced Technology 3.1A-1410-A And Later

Date

Monday, 17 Nov 2014

Summary

The new feature allows firing conditional alerts when all Nodes in the selected Monitoring Group are broken for the same Rule

This is extremely useful to monitor critical applications or clustered resources

For example, a web farm may span over a few IIS servers

If some of the web servers are offline, website may be less responsive, but it is still available

If all web servers are offline, the website will be completely down

It is straightforward to define such an Urgent Group Alert

Click For Full Size

There are two restrictions when defining Urgent Group Alerts:

The Selected Monitoring Group cannot be associated with Monitoring Engine {dynamic}

The reason behind is that Nodes in the Monitoring Group must be scheduled by one Supervising Engine or Motors at the same location

Click For Full Size

It is not prohibited but not recommended to use dynamic scheduling

It would be better to find out the condition for all Nodes at similar time

Click For Full Size

Technical Background

A Relator can have one or multiple Monitoring Groups to monitor

The schedule of Relator defines the interval of Relator executions

Each interval can be considered as a batch to run Rules for the monitored server/devices

The basic idea of Urgent Group Alert is to check the result of Relator in each batch; if a Rule is broken ALL Nodes in the selected Monitoring Group, fire Events for the individual Nodes, then fire the conditional Urgent Group Alert

To understand Urgent Group Alert, customer should carefully read following technical details:

An Urgent Group Alert is simply an alert that tells customer that a serious condition has happened

There are Argent Console Events for individual Nodes, but no such an Event for Urgent Group Alert

Customer does not need to answer the correspondent Event on Argent Console (A1x)

Instead, he should read the Events of individual Nodes for details
Urgent Group Alert is normally only fired after Events for individual server/device are fired

For example, Monitoring Group ‘&MG_X‘ includes server A, B and C

In batch 1, A and B are found down and Events are fired

Urgent Group Alert is not fired because C is still up

In batch 2, C is found down and an Event is fired

In batch 3, A, B and C are still down

This default behavior is useful to prevent alert flooding as it can take a while to correct the condition

For extremely critical condition, if customer does not mind to be alerted repetitively, he can override the behavior by checking following option in Relator setting:

When this option is checked, an Urgent Group Alert is fired for each batch as long as all Nodes are still down and no matter if individual Events are fired
Customer may want Urgent Group Alert to be fired repetitively but find it is too much especially when Relator runs with very short interval

This is especially true for connectivity test using SLA or System Down Rules

To limit the alerts, customer can use following option in Relator setting:

If some Nodes in the Monitoring Group are suspended, in maintenance mode, excluded in Relator or not licensed, they are not counted when determining if the Urgent Group Alert should be fired

Take previous example, if A and B are down, C is up, but C is in maintenance mode, an Urgent Group Alert will be fired as C is not counted

It is quite straightforward to debug or analyze Urgent Group Alerts

When an Urgent Group Alert is fired, it contains list of Events fired for the individual Nodes

It explains why the Urgent Group Alert is fired. Following is an Urgent Group Alert example:

Click For Full Size

If an Urgent Group Alert is not fired, Argent AT service log also contains detail explaining the logic. For example:

25 Aug 2014 14:02:32.890 W2008R264PANWS A\Administrator NOT FIRING Urgent Group Alert for Monitoring Group ‘&MG_X‘ in Relator ‘REL_WINDOWS_API_CONNECT‘

Reason: Rule ‘NDE_API_CONNECT‘ has not found broken for Node ‘DBMWIT‘ in current batch

Resolution

Upgrade to Argent Advanced Technology 3.1A-1410-A or later