June 17, 2009 at 22:10 #1679
…and by “forever,” I actually mean forever. It stops sending results even though the agent process continues to run.
We have an installation of the most recent version of NSClient++ (0.3.6 stable) running on a few dozen servers, and most of them work fine, including some 64-bit and some 2008 servers. However, there is one in particular where the agent stops reporting for no apparent reason after only a day or so. The monitoring server is watching for passive checks only, and so it thinks the server is down even though it’s running just fine. This is obviously not ideal.
Here’s an excerpt from the log on the server with the agent:
2009-06-17 06:06:21: debug:modules\NSCAAgent\NSCAThread.cpp:245: Sending to server…
2009-06-17 06:06:21: debug:modules\NSCAAgent\NSCAThread.cpp:252: Looked up [HOST] to [IP]
2009-06-17 06:06:25: error:modules\DebugLogMetrics\PDHCollector.cpp:216: Failed to query performance counters: PdhCollectQueryData failed: : -2147481643: No data to return.
2009-06-17 11:28:26: debug:NSClient++.cpp:753: No shared session: ignoring change event!
(host/IP info hidden for security, but it does resolve correctly)
The second line (“Looked up …”) is the last real NSCA activity in the log. The next line about performance counters is repeated very, very often throughout the log, both before and after it stops sending results. This indicates to me that the service is still running even (in addition to the fact that the Services management console also shows the same thing). The last line shows up rarely, but every so often, after it stops sending results. It never executes any more checks, and never tries to send any results.
I looked at the NSCAThread.cpp code for any help, and nothing jumped out at me. I’m unfamiliar with the socket code. Is there any way it could be blocking somehow, attempting a connection with no timeout value in such a way that it never stops trying? Any other possible lock/hang points?
JeffJuly 14, 2009 at 14:29 #7619
I have same problem here:
I think the error message comes when you log in to the server.
Restarting nsclientpp service resolves the problem temporarily.
I tried nc_net with my setup and experienced same kind of problems, so I’m not sure if this is nsclient++ problem?July 14, 2009 at 19:13 #7621
What does the debug log say?
// MickeMJuly 15, 2009 at 08:14 #7622
Last lines are always:
2009-07-08 21:47:27: debug:modules\NSCAAgent\NSCAThread.cpp:182: Sending to server…
2009-07-08 21:47:27: debug:modules\NSCAAgent\NSCAThread.cpp:189: Looked up xxx.xxx.xxx.xxx to xxx.xxx.xxx.xxx
And nothing after that.
Checks may work only 2 hours or they may work for a week, but eventually all servers stop sending.
I tried also with servers in the same network where nagios server is, same result.
My active servers work 100% with nsclient++
I have 0.3.6 clients on win2003 32bit servers. Same was with 0.3.5.July 15, 2009 at 09:40 #7623
If you are interested I could hook you up with a build which logs more and wee can see if you can help me track down the problem…
// MickeMJuly 15, 2009 at 10:29 #7624
After browsing the code I think the problem is reading the input package that will (I think) read and read and read untill done so if it never gets done it will never finnish.
But it is just a theory so I would need to verify it (and hopefully fix it)
// MickeMJuly 15, 2009 at 13:52 #7625
Yes, I can do that with few servers.July 17, 2009 at 13:08 #7630
i’ve about 20 servers in my setup right now, and all stop sending passive checks at some point. all using the 0.3.6 nsclient++ service.
Is this problem also in older version?
LeonJuly 17, 2009 at 13:50 #7632
Yes, I have tried two older versions, same problem.July 17, 2009 at 19:11 #7635
If this is as I think it will be present for all versions of NSClient++ and possibly affect other parts as well (as for instance the NRPE parts).
I shall see if I can do a work around for this in the next version (will be out after the weekend as nightly) but for the 0.4.x branch there will be a new socket subsystem which I hope solves this issue permanently…
// Michael MedinJuly 19, 2009 at 16:21 #7641
Check now, hopefully fixed in the latest nightly build…
// Michael MedinJuly 20, 2009 at 12:29 #7644
I installed it on 8 servers, and will keep you informed.
MikkoJuly 20, 2009 at 19:14 #7650
I also have installed it on our problem server, and I’ll report back about whether it fails again or seems to be stable. Thanks!July 21, 2009 at 09:46 #7652
Running the nightly build for 24 hours without problems. I now installed on 10 more servers on different site.
Btw, I had problems in this other domain with the 0.3.7 msi installer, almost all servers failed with error message:
1: Failed to install firewall exception: get_LocalPolicy failed: -2147023143: There are no more endpoints available from the endpoint mapper
I did 0.3.6 install and copied nightly from .zip over it, that works ok.July 21, 2009 at 09:58 #7653
Yes the copy works.
And the issue is related to a disabled (?) windows firewall, the new installer features a windows firewall “add exception thingy” but it is bleeding edge so not 100% ironed out.
on the servers “which work” check the nsc.log file and check for any NSCA related errors (the problem should I think now manifest it self as an error)
// Michael Medin
You must be logged in to reply to this topic.