Ticket #397 (closed defect: fixed)

Opened 19 months ago

Last modified 7 months ago

NSClient++ memory leaks

Reported by: roman Owned by: mickem
Priority: 1 Milestone: 0.3.9
Component: CheckSystem Version: 0.3.8
Severity: Bugs Keywords: memory leak
Cc:

Description

Recently I installed NSClient++-0.3.7 x64 on numerous Windows Server 2008 and 2008 R2 servers. After a couple of weeks I noticed that some servers would show the NSClient++ using more than 1GB of memory. A restart brings it back down to about 15MB, but memory continues to slowly increase. This is more evident on servers where a large amount of services are being monitored.

I setup a test instance on a Server 2008 R2 box with an installation of NSClient++0.3.8 x64 on it, which seems to have the same memory leak issues as the 0.3.7 version. I created a loop that would execute a CheckProcState query every 5 seconds from a different server and left it running overnight. In about 16 hours, the NSClient++ memory went up from 16MB to 311MB. After I stopped the NRPE queries, the memory stayed steady at 311MB.

One thing I noticed is that the client does not leak any memory when it is idle.

Below I included my settings and an excerpt from the debug log file, which shows another potential issue with process enumeration when checking status of a specific process. Please let me know if I can assist in any way. I would have provided the process_info.csv that gets generated by the A_DebugLogMetrics module, but it doesn't seem to generate the file when running NSClient++ with the -test switch.

My registry settings: [HKEY_LOCAL_MACHINE\SOFTWARE\NSClient++]

[HKEY_LOCAL_MACHINE\SOFTWARE\NSClient++\External Alias] "alias_cpu"="checkCPU warn=80 crit=90" "alias_disk"="CheckDriveSize MinWarn=10% MinC" "alias_file_age"="checkFile2 filter=out \"file=$ARG1$\" filter-written=" "alias_file_size"="checkFile2 filter=out \"file=$ARG1$\" filter-size=>$A" "alias_file_size_in_dir"="checkFile2 filter=out pattern=*.txt \"file=$ARG1$\" filter-s" "alias_process"="checkProcState" "alias_service"="checkServiceS" "check_ok"="CheckOK Every"

[HKEY_LOCAL_MACHINE\SOFTWARE\NSClient++\modules] "A_DebugLogMetrics.dll"="" "CheckDisk.dll"="" "CheckEventLog.dll"="" "CheckExternalScripts.dll"="" "CheckHelpers.dll"="" "CheckSystem.dll"="" "CheckTaskSched.dll"="" "CheckWMI.dll"="" "FileLogger.dll"="" "NRPEListener.dll"="" "NSClientListener.dll"=""

[HKEY_LOCAL_MACHINE\SOFTWARE\NSClient++\NRPE] "allow_arguments"=dword:00000001 "allow_nasty_meta_chars"=dword:00000000 "port"=dword:00001622 "script_dir"="scripts\"

[HKEY_LOCAL_MACHINE\SOFTWARE\NSClient++\NSClient] "port"=dword:000004e0

[HKEY_LOCAL_MACHINE\SOFTWARE\NSClient++\Settings] "allowed_hosts"="10.30.40.50" "use_file"=dword:00000000 "use_reg"=dword:00000001

[HKEY_LOCAL_MACHINE\SOFTWARE\NSClient++\External Script] "allow_arguments"=dword:00000001 "allow_nasty_meta_chars"=dword:00000000 "script_dir"="scripts\"

[HKEY_LOCAL_MACHINE\SOFTWARE\NSClient++\External Scripts] "check_logs"="scripts\check_logfiles.exe -f etc\$ARG1$.cfg"

Attachments

nsclient.log Download (169.4 KB) - added by roman 19 months ago.
Small sample of nsclient.log that is at 1.1GB after 16 hours.

Change History

Changed 19 months ago by roman

Small sample of nsclient.log that is at 1.1GB after 16 hours.

comment:1 Changed 19 months ago by PaddyX

Hello,

i have the same effect with some CheckCounter queries on an Windows 2008 x64 machine. I have tried the versions 0.3.7 x64/x32 und 0.3.8 x64 with the same effect.

The nsclient.log tells me only normal messages like this:


2010-07-27 15:41:14: debug:NSClient++.cpp:1142: Injected Result: OK 'OK: diskqueue: 0' 2010-07-27 15:41:14: debug:NSClient++.cpp:1143: Injected Performance Result: diskqueue'=0;0;99; ' 2010-07-27 15:41:14: debug:NSClient++.cpp:1106: Injecting: CheckCounter: Counter:diskqueue=\DataCore Disk(_Total)\Current Disk Queue Length, ShowAll, MaxCrit=99


I created loops that are executing the upper command over check_nrpe, in 30 Minutes the memory consumption of NSClient++ is about 50 MB. Over 1 day it is going over 1,5 GBytes.

It seems that the leak is depending at the counter that is requested. The query of other counters for example:

"Counter:recovery=\\DataCore Mirroring(_Total)
% Recovery Progress" ShowAll MinCrit=99"

doesn't produce a memory leak.

Any ideas?

comment:2 Changed 19 months ago by mickem

  • Status changed from new to closed
  • Resolution set to worksforme

I know previously that HP insight manager has caused similar problems. And I believe it is related to the "provider" (since for instance no Microsoft counters produce this).

Regardless of this you might wanna try a few of the new PDH subsystem thingys to resolve this:

[Check System]
...
pdh_subsystem=thread-safe

Another option is to use the check command line (and wrap it as a script).

I would, if possible, like to have a reproducible scenario (which does not require me to buy any specific hardware) since it is very hard to debug it "blindly".

Michael Medin

comment:3 Changed 19 months ago by PaddyX

Thanks for your answer!

The option:

pdh_subsystem=thread-safe

does not improve the behaviour. But you are right, I have seen now the same effect with the Perfmon-Tool from Microsoft.

To reproduce this scenario, you need the DataCore?-Software SANMelody or SANSymphony from www.datacore.com . There should be a way to get an Eval-Test Version. The whole system can be installed in a virtual machine. No specific hardware is needed.

comment:4 Changed 19 months ago by mickem

Humm..

Well if you get the exact same results in perfmon then I guess your best bet is to wrap it in an external script and file a bug report with DataCore?.

I shall see if I have time would be interesting to see if I could fix this somehow though... but nothing that will happen "tomorrow" so to speak :P

Running

nsclient++ -m CheckSystem.dll -c CheckCounter ...

Should give you the result you want as an external script (ie. forking and reclamation memory).

So doing:

[External Scripts]
my_broken_check=nsclient++ -m CheckSystem.dll -c CheckCounter ...

Should allow you to call "my_broken_check" and get rid of the memory leaks (sacrificing performance, simplicity and what not...)

Michael Medin

comment:5 Changed 19 months ago by PaddyX

Thanks for your answer!

i have tried to call nsclient++ as an external script, but I have several problems:

The -m Flag as you described is not recognized, okay so i have tried:

nsclient++.exe -noboot CheckSystem -c CheckCounter "Counter:diskread=\\DataCore Disk(_Total)\\Avg. Disk sec/Read" ShowAll

but I only get on the nagios host via nrpe:

No output available from command

To see some debug output, i have tried to call NSClient directy on the Windows Machine

C:\Program Files\NSClient>nsclient++.exe -c CheckCounter "Counter:diskread=\\DataCore Disk(_Total)\\Avg. Disk sec/Read" ShowAll
e \CheckSystem.cpp(1082) ERROR: Counter not found: '''\\DataCore''': Unable to parse the counter path. Check the format and syntax of the  specified path.   (C0000BC0)
CRIT: Counter not found: \\DataCore: Unable to parse the counter path. Check the format and syntax of the  specified pat
h.   (C0000BC0)|

It seems that the spaces in the Counter Path makes the problems. I have tried quotes, without effect.

Any hints?

Patrick

comment:6 Changed 19 months ago by mickem

  • Status changed from closed to reopened
  • Resolution worksforme deleted

Which version?

the -m option is from the latest release.

Anyways, I tried this and the -m flag seems to have a set of issues especially with parsing the ":s since it does not...

You can get around this with indexes:

C:\source\nscp\branches\stable\stage\x64\binaries>nsclient++ -m CheckSystem.dll -c CheckCounter \1450(_Total)\1458 ShowAll index MaxWarn=10
perf data: 1
OK: \Utskriftsk÷(_Total)\Utskrifter: 0|'\Utskriftsk÷(_Total)\Utskrifter'=0;10;0

But then you get the info message (which I need to remove) "perf data: 1" which might cause you issues.

To find indexes:

nsclient++ -noboot CheckSystem pdhlookup Utskrifter
l \CheckSystem.cpp(273) --+--[ Lookup Result ]----------------------------------------
l \CheckSystem.cpp(274)   | Index for 'Utskrifter' is 1458
l \CheckSystem.cpp(275) --+-----------------------------------------------------------

Michael Medin

comment:7 Changed 19 months ago by PaddyX

Thanks for your answer.

I have a question about the pdhlookup Feature. How can I find out the indexes for

\DataCore Disk(_Total)\Current Disk Queue Length

I tried:

nsclient++ -noboot CheckSystem pdhlookup "Current Disk Queue Length"

It gives me

C:\temp\NSClient++-0.3.8-x64-20100728-1244>nsclient++.exe -noboot CheckSystem pdhlookup "Current Disk Queue Length"
l \CheckSystem.cpp(273) --+--[ Lookup Result ]----------------------------------------
l \CheckSystem.cpp(274)   | Index for 'Current Disk Queue Length' is 198
l \CheckSystem.cpp(275) --+-----------------------------------------------------------

But I have several values with "Current Disk Queue Length"

So the next step:

nsclient++ -noboot CheckSystem pdhlookup "DataCore Disk(_Total)"

doesn't find the index.

???

comment:8 Changed 19 months ago by mickem

Indexes are simply number which replaces the words...

so: Current Disk Queue Length = 198 means you replace Current Disk Queue Length with 198 in the lookup.

\DataCore Disk(_Total)\198

Then you need to lookup DataCore? Disk as well... (and possibly _Total)

Michael Medin

comment:9 Changed 19 months ago by PaddyX

Thanks for your answer.

That's working fine.

But the (index)workaround is not really useable for me, because i now found out the index is unique. Every Windows-Installation has other index values and I have about 60 installations that need to be monitored with WMI-Counters.

Please let me know when you have fixed the parsing of the ' " '

Thank you very much.

comment:10 Changed 19 months ago by mickem

Well.. You only need this for the counters which cause memory leaks. For the others use the built-in commands as usual.

Michael Medin

comment:11 Changed 18 months ago by PaddyX

Thanks for your answer,

thats right. But I have first to find out the indexes at 60 different Win-Installations and that means probably 60 different versions of the nsc.ini ;).

Thank you for your help.

comment:12 Changed 18 months ago by mickem

Well... I guess you could write a VB script or something along those lines as well...

Sorry for the inconvenience but this is something which will work better in the 0.4.x version...

comment:13 Changed 18 months ago by PaddyX

Thanks for your answer,

yes VB-script is a posibility

I'm in contact with the Datacore Support. Could you please describe me short, which Windows API NSClient uses to get the Performance-Data?

Do you use perfmon functions to gather the data? Otherwise perfmon-patches would not even have a chance to solve this.

comment:14 Changed 18 months ago by mickem

These are the function calls I am using:

			PDH_ = ::LoadLibrary(_TEXT("PDH"));
			
			/// ...

			pPdhLookupPerfNameByIndex = (fpPdhLookupPerfNameByIndex)::GetProcAddress(PDH_, "PdhLookupPerfNameByIndexW");
			pPdhLookupPerfIndexByName = (fpPdhLookupPerfIndexByName)::GetProcAddress(PDH_, "PdhLookupPerfIndexByNameW");
			pPdhExpandCounterPath = (fpPdhExpandCounterPath)::GetProcAddress(PDH_, "PdhExpandCounterPathW");
			pPdhGetCounterInfo = (fpPdhGetCounterInfo)::GetProcAddress(PDH_, "PdhGetCounterInfoW");
			pPdhAddCounter = (fpPdhAddCounter)::GetProcAddress(PDH_, "PdhAddCounterW");
			pPdhOpenQuery = (fpPdhOpenQuery)::GetProcAddress(PDH_, "PdhOpenQueryW");
			pPdhValidatePath = (fpPdhValidatePath)::GetProcAddress(PDH_, "PdhValidatePathW");
			pPdhEnumObjects = (fpPdhEnumObjects)::GetProcAddress(PDH_, "PdhEnumObjectsW");
			pPdhEnumObjectItems = (fpPdhEnumObjectItems)::GetProcAddress(PDH_, "PdhEnumObjectItemsW");
			
			/// ...
			
			pPdhRemoveCounter = (fpPdhRemoveCounter)::GetProcAddress(PDH_, "PdhRemoveCounter");
			pPdhGetFormattedCounterValue = (fpPdhGetFormattedCounterValue)::GetProcAddress(PDH_, "PdhGetFormattedCounterValue");
			pPdhCloseQuery = (fpPdhCloseQuery)::GetProcAddress(PDH_, "PdhCloseQuery");
			pPdhCollectQueryData = (fpPdhCollectQueryData)::GetProcAddress(PDH_, "PdhCollectQueryData");

So yes it is thr stock windows API for using the perdoamnce counters. But you mentioned you had the same problems with perfmon right? SO that might be a better angle.

Also the source code for NSClient++ is available if they want details.

I could maybe write up a simple program which does the same thing as the NSClient++ code does a lot of other things as well...

Michael Medin

comment:15 Changed 18 months ago by mickem

  • Version changed from 0.3.7 to 0.3.8
  • Milestone set to 0.3.x

comment:16 Changed 7 months ago by mickem

  • Status changed from reopened to closed
  • Resolution set to fixed
  • Milestone changed from 0.3.x to 0.3.9

This can be resolved using a wrapper

Note: See TracTickets for help on using tickets.