Ticket #397 (closed defect: fixed)
NSClient++ memory leaks
| Reported by: | roman | Owned by: | mickem |
|---|---|---|---|
| Priority: | 1 | Milestone: | 0.3.9 |
| Component: | CheckSystem | Version: | 0.3.8 |
| Severity: | Bugs | Keywords: | memory leak |
| Cc: |
Description
Recently I installed NSClient++-0.3.7 x64 on numerous Windows Server 2008 and 2008 R2 servers. After a couple of weeks I noticed that some servers would show the NSClient++ using more than 1GB of memory. A restart brings it back down to about 15MB, but memory continues to slowly increase. This is more evident on servers where a large amount of services are being monitored.
I setup a test instance on a Server 2008 R2 box with an installation of NSClient++0.3.8 x64 on it, which seems to have the same memory leak issues as the 0.3.7 version. I created a loop that would execute a CheckProcState query every 5 seconds from a different server and left it running overnight. In about 16 hours, the NSClient++ memory went up from 16MB to 311MB. After I stopped the NRPE queries, the memory stayed steady at 311MB.
One thing I noticed is that the client does not leak any memory when it is idle.
Below I included my settings and an excerpt from the debug log file, which shows another potential issue with process enumeration when checking status of a specific process. Please let me know if I can assist in any way. I would have provided the process_info.csv that gets generated by the A_DebugLogMetrics module, but it doesn't seem to generate the file when running NSClient++ with the -test switch.
My registry settings: [HKEY_LOCAL_MACHINE\SOFTWARE\NSClient++]
[HKEY_LOCAL_MACHINE\SOFTWARE\NSClient++\External Alias] "alias_cpu"="checkCPU warn=80 crit=90" "alias_disk"="CheckDriveSize MinWarn=10% MinC" "alias_file_age"="checkFile2 filter=out \"file=$ARG1$\" filter-written=" "alias_file_size"="checkFile2 filter=out \"file=$ARG1$\" filter-size=>$A" "alias_file_size_in_dir"="checkFile2 filter=out pattern=*.txt \"file=$ARG1$\" filter-s" "alias_process"="checkProcState" "alias_service"="checkServiceS" "check_ok"="CheckOK Every"
[HKEY_LOCAL_MACHINE\SOFTWARE\NSClient++\modules] "A_DebugLogMetrics.dll"="" "CheckDisk.dll"="" "CheckEventLog.dll"="" "CheckExternalScripts.dll"="" "CheckHelpers.dll"="" "CheckSystem.dll"="" "CheckTaskSched.dll"="" "CheckWMI.dll"="" "FileLogger.dll"="" "NRPEListener.dll"="" "NSClientListener.dll"=""
[HKEY_LOCAL_MACHINE\SOFTWARE\NSClient++\NRPE] "allow_arguments"=dword:00000001 "allow_nasty_meta_chars"=dword:00000000 "port"=dword:00001622 "script_dir"="scripts\"
[HKEY_LOCAL_MACHINE\SOFTWARE\NSClient++\NSClient] "port"=dword:000004e0
[HKEY_LOCAL_MACHINE\SOFTWARE\NSClient++\Settings] "allowed_hosts"="10.30.40.50" "use_file"=dword:00000000 "use_reg"=dword:00000001
[HKEY_LOCAL_MACHINE\SOFTWARE\NSClient++\External Script] "allow_arguments"=dword:00000001 "allow_nasty_meta_chars"=dword:00000000 "script_dir"="scripts\"
[HKEY_LOCAL_MACHINE\SOFTWARE\NSClient++\External Scripts] "check_logs"="scripts\check_logfiles.exe -f etc\$ARG1$.cfg"
Attachments
Change History
Changed 19 months ago by roman
-
attachment
nsclient.log
added
comment:1 Changed 19 months ago by PaddyX
Hello,
i have the same effect with some CheckCounter queries on an Windows 2008 x64 machine. I have tried the versions 0.3.7 x64/x32 und 0.3.8 x64 with the same effect.
The nsclient.log tells me only normal messages like this:
2010-07-27 15:41:14: debug:NSClient++.cpp:1142: Injected Result: OK 'OK: diskqueue: 0' 2010-07-27 15:41:14: debug:NSClient++.cpp:1143: Injected Performance Result: diskqueue'=0;0;99; ' 2010-07-27 15:41:14: debug:NSClient++.cpp:1106: Injecting: CheckCounter: Counter:diskqueue=\DataCore Disk(_Total)\Current Disk Queue Length, ShowAll, MaxCrit=99
I created loops that are executing the upper command over check_nrpe, in 30 Minutes the memory consumption of NSClient++ is about 50 MB. Over 1 day it is going over 1,5 GBytes.
It seems that the leak is depending at the counter that is requested. The query of other counters for example:
"Counter:recovery=\\DataCore Mirroring(_Total)
% Recovery Progress" ShowAll MinCrit=99"
doesn't produce a memory leak.
Any ideas?
comment:2 Changed 19 months ago by mickem
- Status changed from new to closed
- Resolution set to worksforme
I know previously that HP insight manager has caused similar problems. And I believe it is related to the "provider" (since for instance no Microsoft counters produce this).
Regardless of this you might wanna try a few of the new PDH subsystem thingys to resolve this:
[Check System] ... pdh_subsystem=thread-safe
Another option is to use the check command line (and wrap it as a script).
I would, if possible, like to have a reproducible scenario (which does not require me to buy any specific hardware) since it is very hard to debug it "blindly".
Michael Medin
comment:3 Changed 19 months ago by PaddyX
Thanks for your answer!
The option:
pdh_subsystem=thread-safe
does not improve the behaviour. But you are right, I have seen now the same effect with the Perfmon-Tool from Microsoft.
To reproduce this scenario, you need the DataCore?-Software SANMelody or SANSymphony from www.datacore.com . There should be a way to get an Eval-Test Version. The whole system can be installed in a virtual machine. No specific hardware is needed.
comment:4 Changed 19 months ago by mickem
Humm..
Well if you get the exact same results in perfmon then I guess your best bet is to wrap it in an external script and file a bug report with DataCore?.
I shall see if I have time would be interesting to see if I could fix this somehow though... but nothing that will happen "tomorrow" so to speak :P
Running
nsclient++ -m CheckSystem.dll -c CheckCounter ...
Should give you the result you want as an external script (ie. forking and reclamation memory).
So doing:
[External Scripts] my_broken_check=nsclient++ -m CheckSystem.dll -c CheckCounter ...
Should allow you to call "my_broken_check" and get rid of the memory leaks (sacrificing performance, simplicity and what not...)
Michael Medin
comment:5 Changed 19 months ago by PaddyX
Thanks for your answer!
i have tried to call nsclient++ as an external script, but I have several problems:
The -m Flag as you described is not recognized, okay so i have tried:
nsclient++.exe -noboot CheckSystem -c CheckCounter "Counter:diskread=\\DataCore Disk(_Total)\\Avg. Disk sec/Read" ShowAll
but I only get on the nagios host via nrpe:
No output available from command
To see some debug output, i have tried to call NSClient directy on the Windows Machine
C:\Program Files\NSClient>nsclient++.exe -c CheckCounter "Counter:diskread=\\DataCore Disk(_Total)\\Avg. Disk sec/Read" ShowAll e \CheckSystem.cpp(1082) ERROR: Counter not found: '''\\DataCore''': Unable to parse the counter path. Check the format and syntax of the specified path. (C0000BC0) CRIT: Counter not found: \\DataCore: Unable to parse the counter path. Check the format and syntax of the specified pat h. (C0000BC0)|
It seems that the spaces in the Counter Path makes the problems. I have tried quotes, without effect.
Any hints?
Patrick
comment:6 Changed 19 months ago by mickem
- Status changed from closed to reopened
- Resolution worksforme deleted
Which version?
the -m option is from the latest release.
Anyways, I tried this and the -m flag seems to have a set of issues especially with parsing the ":s since it does not...
You can get around this with indexes:
C:\source\nscp\branches\stable\stage\x64\binaries>nsclient++ -m CheckSystem.dll -c CheckCounter \1450(_Total)\1458 ShowAll index MaxWarn=10 perf data: 1 OK: \Utskriftsk÷(_Total)\Utskrifter: 0|'\Utskriftsk÷(_Total)\Utskrifter'=0;10;0
But then you get the info message (which I need to remove) "perf data: 1" which might cause you issues.
To find indexes:
nsclient++ -noboot CheckSystem pdhlookup Utskrifter l \CheckSystem.cpp(273) --+--[ Lookup Result ]---------------------------------------- l \CheckSystem.cpp(274) | Index for 'Utskrifter' is 1458 l \CheckSystem.cpp(275) --+-----------------------------------------------------------
Michael Medin
comment:7 Changed 19 months ago by PaddyX
Thanks for your answer.
I have a question about the pdhlookup Feature. How can I find out the indexes for
\DataCore Disk(_Total)\Current Disk Queue Length
I tried:
nsclient++ -noboot CheckSystem pdhlookup "Current Disk Queue Length"
It gives me
C:\temp\NSClient++-0.3.8-x64-20100728-1244>nsclient++.exe -noboot CheckSystem pdhlookup "Current Disk Queue Length" l \CheckSystem.cpp(273) --+--[ Lookup Result ]---------------------------------------- l \CheckSystem.cpp(274) | Index for 'Current Disk Queue Length' is 198 l \CheckSystem.cpp(275) --+-----------------------------------------------------------
But I have several values with "Current Disk Queue Length"
So the next step:
nsclient++ -noboot CheckSystem pdhlookup "DataCore Disk(_Total)"
doesn't find the index.
???
comment:8 Changed 19 months ago by mickem
Indexes are simply number which replaces the words...
so: Current Disk Queue Length = 198 means you replace Current Disk Queue Length with 198 in the lookup.
\DataCore Disk(_Total)\198
Then you need to lookup DataCore? Disk as well... (and possibly _Total)
Michael Medin
comment:9 Changed 19 months ago by PaddyX
Thanks for your answer.
That's working fine.
But the (index)workaround is not really useable for me, because i now found out the index is unique. Every Windows-Installation has other index values and I have about 60 installations that need to be monitored with WMI-Counters.
Please let me know when you have fixed the parsing of the ' " '
Thank you very much.
comment:10 Changed 19 months ago by mickem
Well.. You only need this for the counters which cause memory leaks. For the others use the built-in commands as usual.
Michael Medin
comment:11 Changed 18 months ago by PaddyX
Thanks for your answer,
thats right. But I have first to find out the indexes at 60 different Win-Installations and that means probably 60 different versions of the nsc.ini ;).
Thank you for your help.
comment:12 Changed 18 months ago by mickem
Well... I guess you could write a VB script or something along those lines as well...
Sorry for the inconvenience but this is something which will work better in the 0.4.x version...
comment:13 Changed 18 months ago by PaddyX
Thanks for your answer,
yes VB-script is a posibility
I'm in contact with the Datacore Support. Could you please describe me short, which Windows API NSClient uses to get the Performance-Data?
Do you use perfmon functions to gather the data? Otherwise perfmon-patches would not even have a chance to solve this.
comment:14 Changed 18 months ago by mickem
These are the function calls I am using:
PDH_ = ::LoadLibrary(_TEXT("PDH"));
/// ...
pPdhLookupPerfNameByIndex = (fpPdhLookupPerfNameByIndex)::GetProcAddress(PDH_, "PdhLookupPerfNameByIndexW");
pPdhLookupPerfIndexByName = (fpPdhLookupPerfIndexByName)::GetProcAddress(PDH_, "PdhLookupPerfIndexByNameW");
pPdhExpandCounterPath = (fpPdhExpandCounterPath)::GetProcAddress(PDH_, "PdhExpandCounterPathW");
pPdhGetCounterInfo = (fpPdhGetCounterInfo)::GetProcAddress(PDH_, "PdhGetCounterInfoW");
pPdhAddCounter = (fpPdhAddCounter)::GetProcAddress(PDH_, "PdhAddCounterW");
pPdhOpenQuery = (fpPdhOpenQuery)::GetProcAddress(PDH_, "PdhOpenQueryW");
pPdhValidatePath = (fpPdhValidatePath)::GetProcAddress(PDH_, "PdhValidatePathW");
pPdhEnumObjects = (fpPdhEnumObjects)::GetProcAddress(PDH_, "PdhEnumObjectsW");
pPdhEnumObjectItems = (fpPdhEnumObjectItems)::GetProcAddress(PDH_, "PdhEnumObjectItemsW");
/// ...
pPdhRemoveCounter = (fpPdhRemoveCounter)::GetProcAddress(PDH_, "PdhRemoveCounter");
pPdhGetFormattedCounterValue = (fpPdhGetFormattedCounterValue)::GetProcAddress(PDH_, "PdhGetFormattedCounterValue");
pPdhCloseQuery = (fpPdhCloseQuery)::GetProcAddress(PDH_, "PdhCloseQuery");
pPdhCollectQueryData = (fpPdhCollectQueryData)::GetProcAddress(PDH_, "PdhCollectQueryData");
So yes it is thr stock windows API for using the perdoamnce counters. But you mentioned you had the same problems with perfmon right? SO that might be a better angle.
Also the source code for NSClient++ is available if they want details.
I could maybe write up a simple program which does the same thing as the NSClient++ code does a lot of other things as well...
Michael Medin
comment:15 Changed 18 months ago by mickem
- Version changed from 0.3.7 to 0.3.8
- Milestone set to 0.3.x
comment:16 Changed 7 months ago by mickem
- Status changed from reopened to closed
- Resolution set to fixed
- Milestone changed from 0.3.x to 0.3.9
This can be resolved using a wrapper








Small sample of nsclient.log that is at 1.1GB after 16 hours.