KBI 311690 Issue Addressed: High Frequency SLA Rule Might Generate False Positive When Monitoring More Than 100 Server/Devices

Version

Argent Advanced Technology 5.1A-1804-A or below

Date

Thursday, 12 July 2018

Summary

Service Level Agreement Rule checks target server/device up/down status by using ICMP Ping

The Rule can run in high frequency (< 1 minute) by using option ‘Run Lightweight SLA Checking With Interval xx Seconds’

However it might potentially generate false positive when monitoring more than 100 server/devices

The issue has been addressed in Argent Advanced Technology 5.1A-1807-A

Technical Background

The issue is caused by system overloading by either heavy I/O or router

SLA Rule uses ICMP Ping to check up/down status of target server/device

ICMP traffic is not guaranteed to deliver

When router is overloaded, it can arbitrarily drop ICMP packets

When it happens, Argent Advanced Technology Engine won’t be able to get echo result back from remote server/device, which will be deemed as being offline

When high frequency option is used, Argent Advanced Technology Engine could generate large amount of ping packets in short period of time, which could cause router to be overloaded

Argent Advanced Technology Engine uses work order files to run monitoring tasks

High frequency monitoring can generate significant amount of WO files every minute

It can overload the file system and cause monitoring result unreliable

Argent Advanced Technology 5.1A-1807-A addresses the issue by providing option to run high frequency SLA Rule within an internal thread pool

It removes the need of using WO files so that I/O load can be greatly reduced

More importantly, by controlling the size of thread pool, the flooding of ICMP packets can be avoided

Argent Advanced Technology 5.1A-1807-A introduces DWORD registry entry HKLM\Software\Argent\{PRODUCT}\RUN_DOWN_RULE_THREAD_MODE

It can take following value:

  • 0 – Turn off internal SLA thread pool

    It falls back to the old behavior of versions earlier than 1807-A. WO files are used

    SLA Rule is executed in Monitoring Engine process

  • 1 – Run High Frequency SLA Rule in internal SLA thread pool

    It is the default value

  • 2 – Run both SLA Rule and System Down Rule in internal SLA thread pool

    Use this option if system performance issue is identified and System Down Rule is running against more than 100 server/devices with interval less than 5 minutes

There is another DWORD registry entry HKLM\Software\Argent\{PRODUCT}\RUN_DOWN_RULE_THREAD_LIMIT

It controls the thread pool size

The default value is 100

It usually does not need modification

Resolution

Upgrade to Argent Advanced Technology 5.1A-1807-A or above

For customer who cannot upgrade immediately, he can alleviate the issue by using Dynamic Scheduling in the Relator that runs high frequency SLA Rule