KBI 310893 Issue Addressed: Group Of Relators Cancel When Monitoring Engine Is Overloaded

Version

Argent Advanced Technology 3.1A-1401-E or below

Date

Tuesday, 25 Mar 2014

Summary

A heavily saturated Monitoring Engine Process Pool may cause a group of Relators to cancel

Technical Background

This issue may occur on a heavily utilized Monitoring Engine processing numerous tasks during the Monitoring Engine Recycle Process

By default, every 30 minutes the Monitoring Engine Processes are recycled

If a bunch of Monitoring Engine processes happen to recycle at almost the same time, numerous tasks can be reassigned to a few available Monitoring Engine processes in the pool

As a result, the few running Monitoring Engine processes can be overloaded

After being stuck in the queue for more than 5 minutes, the pending tasks are cancelled by the Supervising Engine

The Supervising Engine log files will have multiple repeated lines such as:

About to cancel task (Relator: REL_AIX_DISK_SPACE, server: ARGENTSRV)

Reason: it has been pending for more than 5 minutes

An occasional occurrence of these cancelled tasks is likely due to temporary network congestion or unusually high load on the Monitoring Engine server and does not indicate an issue

However, 20 or more these cascaded lines of cancelled tasks in the log file warrants the following resolution

The issue is addressed in Argent AT 3.1A-1401-T4

Resolution

1: Upgrade to Argent AT 3.1A-1401-T4

OR

2: For customers unable to upgrade immediately, use the following workaround:

Disable the Monitoring Engine Recycle process:

To disable, add the following new D-Word Key to the Windows Registry:

RULE_ENGINE_MAX_RUN_SECONDS in ALL affected Argent AT products

Set the Decimal value to 0 on EACH Monitoring Engine