KBI 311690 Issue Addressed: High Frequency SLA Rule Might Generate False Positive When Monitoring More Than 100 Server/Devices
Version
Argent Advanced Technology 5.1A-1804-A or below
Date
Thursday, 12 July 2018
Summary
Service Level Agreement Rule checks target server/device up/down status by using ICMP Ping
The Rule can run in high frequency (< 1 minute) by using option ‘Run Lightweight SLA Checking With Interval xx Seconds’
However it might potentially generate false positive when monitoring more than 100 server/devices
The issue has been addressed in Argent Advanced Technology 5.1A-1807-A
Technical Background
The issue is caused by system overloading by either heavy I/O or router
SLA Rule uses ICMP Ping to check up/down status of target server/device
ICMP traffic is not guaranteed to deliver
When router is overloaded, it can arbitrarily drop ICMP packets
When it happens, Argent Advanced Technology Engine won’t be able to get echo result back from remote server/device, which will be deemed as being offline
When high frequency option is used, Argent Advanced Technology Engine could generate large amount of ping packets in short period of time, which could cause router to be overloaded
Argent Advanced Technology Engine uses work order files to run monitoring tasks
High frequency monitoring can generate significant amount of WO files every minute
It can overload the file system and cause monitoring result unreliable
Argent Advanced Technology 5.1A-1807-A addresses the issue by providing option to run high frequency SLA Rule within an internal thread pool
It removes the need of using WO files so that I/O load can be greatly reduced
More importantly, by controlling the size of thread pool, the flooding of ICMP packets can be avoided
Argent Advanced Technology 5.1A-1807-A introduces DWORD registry entry HKLM\Software\Argent\{PRODUCT}\RUN_DOWN_RULE_THREAD_MODE
It can take following value:
-
0 – Turn off internal SLA thread pool
It falls back to the old behavior of versions earlier than 1807-A. WO files are used
SLA Rule is executed in Monitoring Engine process
- 1 – Run High Frequency SLA Rule in internal SLA thread pool
It is the default value
- 2 – Run both SLA Rule and System Down Rule in internal SLA thread pool
Use this option if system performance issue is identified and System Down Rule is running against more than 100 server/devices with interval less than 5 minutes
There is another DWORD registry entry HKLM\Software\Argent\{PRODUCT}\RUN_DOWN_RULE_THREAD_LIMIT
It controls the thread pool size
The default value is 100
It usually does not need modification
Resolution
Upgrade to Argent Advanced Technology 5.1A-1807-A or above
For customer who cannot upgrade immediately, he can alleviate the issue by using Dynamic Scheduling in the Relator that runs high frequency SLA Rule