KBI 310184 Intermittent Or Permanent Failure To Monitor Some Or All Remote Servers

Version

All Versions

Date

4 Jun 2010

Summary

Intermittent or permanent failure to monitor some or all remote servers.

Technical Background

Each time a TCP/IP connection is made, a Windows socket is created.

TCP/IP connections can be both remote AND/OR local connections.

A socket works like this:

     Connect, Open Socket, Perform Task, Close Socket.

When the socket is closed, Windows doesn’t remove the socket immediately. It puts the socket into a state called TIME_WAIT.

In the TIME_WAIT state, the socket is still using up resources, and by default, it is removed after 240 seconds (4 minutes).

The issue seems to be that sockets are created faster than they are released by the Operating System.

This will happen on an overloaded system with very frequent TCP connections.

When it hits the maximum (a few thousand sockets), no new sockets can be created.

To test if “sockets” are the root cause, run:

     netstat

If there are thousands of connections with the TIME_WAIT state, this is likely the cause.

When the issue occurs, an alternate method to test is to try telnetting to any external machine from the “bad” machine.

Example:

     telnet www.yahoo.com 80

Remote registry and PerfMon are NOT good tests, as they do NOT use TCP connections (it uses Windows API), and no sockets are created.

To re-iterate, connecting FROM the “bad” machine to a remote machine via TCP should theoretically always fail when the TIME_WAIT issue occurs.

However, connecting from a “good” machine into the “bad” machine may work, as the sockets for receiving are different from the sockets for initiating.

Resolution

N/A