KBI 310668 Issue Addressed: Checkpoint Or Deadlock Errors When Large Amount Of Tasks Are Scheduled

Version

Argent Advanced Technology 3.1A-1308-A or earlier

Date

Friday, 13 September 2013

Summary

The Argent AT Engine uses the job watermark SQL table (ARGSOFT_XXXX_JOBWATERMARK) to hold the scheduled tasks.

When a large amount of work is scheduled (over 10,000 rows), the sheer amount of time updating the table can lock the shared resource more than five minutes, which causes either a checkpoint or deadlock error.

Technical Background

The Argent AT Engine checks the Relator SQL table and updates runtime information once every minute.

It locks the internal Relator data structure while doing so.

When a large amount of tasks are scheduled, updating the SQL table can take over five minutes, which is the threshold for checkpoint errors or deadlock detection.

As a result, the services are recycled.

The situation can become worse when SQL Server is stressed and slow and/or running on older hardware.

This can become so bad that the service keeps recycling, and no monitoring task can be done.

The issue is most likely to occur with the Argent Guardian Ultra and Argent for SNMP because of the number of licensed nodes and Relators.

For example, assume a customer has 1,000 IP addresses and 10 Relators, and that each Relator includes all 1,000 nodes. This translates to 10,000 scheduled tasks.

An enhancement is made in 3.1A-1310-A by moving the updating of job watermark table out of the protection area so that the shared resource won’t be locked during this lengthy update.

Also, the checkpoint event is set periodically during the update. As a result, both checkpoint errors and deadlocks are avoided.

Resolution

Upgrade to Argent AT 3.1A-1310-A or later