Using NSClient++ from nagios with NSCA

NSCA (Nagios Service Check Acceptor) is a server which runs on the Nagios server and accepts passive checks results from various servers.

Passive in this context means that Nagios is not the initiator of the actual check commands above. Instead the client (when it is configure to do so) will submit the results to Nagios (thus it will initiate the data transfer). If you compare the above image with the one used with NRPE you will notice that the arrow points from the client to Nagios whereas the NRPE one points from Nagios to your client.

1. Overview of NSCA

As I stated before NSCA is "sort of the reverse" of NRPE and the diagram above illustrates the process by which Nagios receives the check results.

  1. NSClient++ decides it is time to send the results
  2. NSClient++ gathers all results
  3. NSClient++ connect to NSCA (server) and sends all results
  4. NSClient++ goes back to sleep

So in essence NSCA is (again) merely a transport mechanism to send the result of a check command over the network. But the big change is that this time it is NSClient++ who decides when it is time to do so.

2. NSClient++ configuration

Since NSCA is a server we shall start by configuring NSClient++ as thats were most things will happen. Also since this is an "advanced" guide it is assumed you have read at least the NRPE guide and are familiar with the basic working of both Nagios and NSClient++.

2.1 Modules

The first thing you do is to make sure you have all the proper modules loaded. The basic ones we will need for basic checks in addition to the NSCAAgent. One important thing to notice is that once the NSCAAgent is loaded it will start (attempting) to submit passive check results. This means that if it not properly configured it will result in a lot of error messages.

So lets start with the following modules:

ModuleDescriptionCommands
CheckSystem.dllHandles many system checksCheckCPU, CheckMEM etc
CheckDisk.dllHandles Disk related checksCheckDisk
CheckExternalScripts.dllHandles aliases (which is what we will use) and external scripts.N/A
CheckHelpers.dllHandles various "utility" checks like CheckOKCheckOK (amongst others)
FileLogger.dllLogs errors to a file so you can see what is going onN/A
NSCAAgent.dllSubmits passive checks results to NSCA (server) on NagiosN/A

The resulting modules section in NSC.ini will look like so:

[modules]
CheckSystem.dll
CheckDisk.dll
CheckExternalScripts.dll
CheckHelpers.dll
FileLogger.dll
NSCAAgent.dll

2.2 NSCA Configuration

Then we move on to configure NSCA which is not that hard a quick overview of the basic settings you need to edit:

interval
Perhaps the most important option. It controls the interval which NSClient++ will use when it runs the checks in essence this is the amount of time between a check will be submitted to Nagios (via NSCA). Since there is only one of these it will not be possible to have individual intervals for various checks instead all checks will be submitted using this interval. It is a good idea to set this LOW when you are debugging things as you will have to wit for this to fire before anything happens.
encryption_method
The encryption algorithm to use. It is often a good idea to set this to 0 (None) when you try this out as it will reduce the number things which might be broken. If you have the incorrect one it will be hard to know what is wrong. For production I would recommend using 14 (AES) at is it a fairly strong algorithm.
password
The password is the "secret" you share with NSCA it has to be the same on both ends (or again like with encryption) nothing will work.
nsca_host
This is the IP address of the NSCA server (often the same as the Nagios server). This will not default to the allowed_hosts directive so you HAVE to specify this option.

The resulting configuration will look something like this:

;# CHECK INTERVALL (in seconds)
;   How often we should run the checks and submit the results.
interval=10
;
;# ENCRYPTION METHOD
;   This option determines the method by which the send_nsca client will encrypt the packets it sends 
;   to the nsca daemon. The encryption method you choose will be a balance between security and 
;   performance, as strong encryption methods consume more processor resources.
;   You should evaluate your security needs when choosing an encryption method.
;
; Note: The encryption method you specify here must match the decryption method the nsca daemon uses 
;       (as specified in the nsca.cfg file)!!
; Values:
;	0 = None	(Do NOT use this option)
;	1 = Simple XOR  (No security, just obfuscation, but very fast)
;   2 = DES
;   3 = 3DES (Triple DES)
;	4 = CAST-128
;	6 = xTEA
;	8 = BLOWFISH
;	9 = TWOFISH
;	11 = RC2
;	14 = RIJNDAEL-128 (AES)
;	20 = SERPENT
encryption_method=0
;
;# ENCRYPTION PASSWORD
;  This is the password/passphrase that should be used to encrypt the sent packets. 
password=secret-password
;
;# NAGIOS SERVER ADDRESS
;  The address to the nagios server to submit results to.
nsca_host=192.168.0.1

2.3 NSCA Commands

Now we (hopefully) have configure NSCA which will work splendidly but untill we add some checks it wont actually do anything. Checks for NSCA is added under the NSCA Commands section. The syntax of this section is <service definition>=<check command>.

service definition
The service definition is the name of the service IN Nagios.
check command
The check command is the command to run inside NSClient++

There is also a special check called host_check which will correspond to the "host" check command. All commands supported by NSClient++ can be used here which (apart from the commands listed on this site) includes all external scripts you define using the ExternalScripts module.

The resulting section will look something like this:

[NSCA Commands]
CPU Load=alias_cpu
host_check=check_ok

3. NSCA Server

How to configure NSCA falls a bit outside the scope of this tutorial but it is pretty straight forward and a quick walk through is provided here.

Don't forget the "debug=1" in /etc/nsca.conf

TODO

4. Testing and Debugging

Now lets fire this baby up and see what it can do. As always we will start with with running NSClient++ in /test mode like so:

NSClient++ /stop
NSClient++ /test

The usual output when NSClient++ boots:

Launching test mode - client mode
d NSClient++.cpp(1106) Enabling debug mode...
d NSClient++.cpp(494) Attempting to start NSCLient++ - 0.3.7.7 2009-07-05
d NSClient++.cpp(897) Loading plugin: Helper function...
d NSClient++.cpp(897) Loading plugin: NSCAAgent (w/ encryption)...
d \NSCAThread.cpp(77) Time difference for NSCA server is: 0
d \NSCAThread.cpp(84) Only reporting: ok,warning,critical,unknown
d \NSCAThread.cpp(102) Autodetected hostname: DESKTOP
l NSClient++.cpp(600) NSCLient++ - 0.3.7.7 2009-07-05 Started!
d \NSCAThread.cpp(171) Drifting: 0
l NSClient++.cpp(402) Using settings from: INI-file
l NSClient++.cpp(403) Enter command to inject or exit to terminate...

Here we will have to wait as NSClient++ (in my example I set the interval to 10 second so I will wait for 10 seconds. Then we get something along the following lines:

d \NSCAThread.cpp(252) Looked up 192.168.0.1 to 192.168.0.1
d \NSCAThread.cpp(297) Finnished sending to server...
d \NSCAThread.cpp(189) Executing (from NSCA): CPU Load
d NSClient++.cpp(1034) Injecting: alias_cpu:
d NSClient++.cpp(1034) Injecting: checkCPU: warn=80, crit=90, time=5m, time=1m, time=30s
d NSClient++.cpp(1070) Injected Result: OK 'OK CPU Load ok.'
d NSClient++.cpp(1071) Injected Performance Result: ''5m'=1%;80;90; '1m'=3%;80;90; '30s'=2%;80;90; '
d NSClient++.cpp(1070) Injected Result: OK 'OK CPU Load ok.'
d NSClient++.cpp(1071) Injected Performance Result: ''5m'=1%;80;90; '1m'=3%;80;90; '30s'=2%;80;90; '
d \NSCAThread.cpp(189) Executing (from NSCA):
d NSClient++.cpp(1034) Injecting: check_ok:
d NSClient++.cpp(1034) Injecting: CheckOK: Everything is fine
d NSClient++.cpp(1070) Injected Result: OK 'Every thing is fine'
d NSClient++.cpp(1071) Injected Performance Result: ''
d NSClient++.cpp(1070) Injected Result: OK 'Every thing is fine'
d NSClient++.cpp(1071) Injected Performance Result: ''
d \NSCAThread.cpp(245) Sending to server...

And everything looks like it went super... BUT and this is a bit but. the NSCA protocol does not support any result checking. We submit the result and we are done there is no "returned information" so everything could have gone terribly wrong and we would not see anything at all.

And here is where we need to start debugging on the Nagios (or NSCA) side.

sudo tail -f /var/log/syslog

will result in the following:

Jul 12 19:35:20 localhost nsca[27093]: Connection from 192.168.0.104 port 26117
Jul 12 19:35:20 localhost nsca[27093]: Handling the connection...
Jul 12 19:35:21 localhost nsca[27093]: Received invalid packet type/version from client
   - possibly due to client using wrong password or crypto algorithm?

And this is clue that we have indeed miss configured NSCA. Most often it is either invalid password or the wrong encryption so if we make sure these are correct we will end up with the following instead:

Jul 12 19:42:54 localhost nsca[27157]: Connection from 192.168.0.104 port 60421
Jul 12 19:42:54 localhost nsca[27157]: Handling the connection...
Jul 12 19:42:55 localhost nsca[27157]: Dropping packet with stale timestamp - packet was 57 seconds old.

This is another issue you might sometime need to resolve it means the clocks of the machines are not in perfect syncronization. This can be solved in three ways:

  1. Sync the clocks
  2. Use the time_delay to change the "local time" in NSClient++
  3. Change the max_packet_age in NSCA.cfg

When we have fixed this we end up with the following:

Jul 12 19:47:01 localhost nsca[27207]: Connection from 192.168.0.104 port 8198
Jul 12 19:47:01 localhost nsca[27207]: Handling the connection...
Jul 12 19:47:02 localhost nsca[27207]: SERVICE CHECK -> Host Name: 'DESKTOP', 
  Service Description: 'CPU Load', Return Code: '0', 
  Output: 'OK CPU Load ok.|'5m'=0%;80;90; '1m'=1%;80;90; '30s'=3%;80;90; '
Jul 12 19:47:02 localhost nsca[27207]: HOST CHECK -> Host Name: 'DESKTOP', 
  Return Code: '0', Output: 'Everything is fine|'
Jul 12 19:47:02 localhost nsca[27207]: End of connection...

And this means (hopefully) that communication is all working and all you need to do now is configure the checks in Nagios.

5. Configure Nagios

5.1 Introduction

Nagios configuration is in itself a whole chapter and this is just a quick peek on how you can do things. First off there are a few concepts to understand:

  • templates are the same as the corresponding item but they have a flag register = 0 which makes them "unlistable items"
  • services are essentially checks (is check CPU)
  • hosts are essentially computers
  • groups are an important concept which I ignore here for simplicity (I recommend you use it)

The configuration is at the end layer quite simple you have a "check" and a "host" and you connect them with a service. Like I show at the bottom line in the diagram above. Whats makes this a tad more complicated is that you can inherit things from a "parent" definition. Which is what I show with arrows (bottom to top) above. The templates with dashed lines are the base templates which all services and hosts inherit.

5.2 Passive Checks

The main difference between passive checks and active checks are the following two flags:

active_checks_enabled
Active service checks are enabled
passive_checks_enabled
Passive service checks are enabled/accepted
So adding the following will "change" an active check to a passive check.

	active_checks_enabled	0 ; Active service checks are enabled
	passive_checks_enabled	1 ; Passive service checks are enabled/accepted

So you say what shall I enter for command for my passive checks?

There are several options for this depending on what you want I wont (as always) go into the details in this quick guide but the short of it is either you use check_dummy or you use the actual command and setup freshness checks. With freshness checks active it means that if a result is not submitted Nagios will actively go out and seek the information (this is what I would recommend for host checks at least).

5.3 Template

First, its best practice to create a new template for each different type of host you'll be monitoring. Let's create a new template for windows servers.

define host{
	name			tpl-windows-servers ; Name of this template
	use			generic-host ; Inherit default values
	check_period		24x7
	check_interval		5
	retry_interval		1
	max_check_attempts	10
	check_command		check-host-alive
	notification_period	24x7
	notification_interval	30
	notification_options	d,r
	contact_groups		admins
	register		0 ; DONT REGISTER THIS - ITS A TEMPLATE
}

Notice that the tpl-windows-servers template definition is inheriting default values from the generic-host template, which is defined in the sample localhost.cfg file that gets installed when you follow the Nagios quickstart installation guide.

5.4 Host definition

Next we need to define a new host for the remote windows server that references the newly created tpl-windows-servers host template.

define host{
	use		tpl-windows-servers ; Inherit default values from a template
	host_name	windowshost ; The name we're giving to this server
	alias		My First Windows Server ; A longer name for the server
	address		10.0.0.2 ; IP address of the server
	active_checks_enabled	0 ; Active host checks are enabled
	passive_checks_enabled	1 ; Passive host checks are enabled/accepted
}

Defining a service for monitoring the remote Windows server. These example service definitions will use the sample commands that are defined in the default NSC.ini file which ships with NSClient 0.3.7 or newer.

5.5 Service definitions

The following service will monitor the CPU load on the remote host. The "alias_cpu" argument which is passed to the check_nrpe command definition tells NSClient++ to run the "alias_cpu" command as defined in the alias section of the NSC.ini file.

define service{
	use			generic-service
	host_name		windowshost 
	service_description	CPU Load
	check_command		check_nrpe!alias_cpu
	active_checks_enabled	0 ; Active service checks are enabled
	passive_checks_enabled	1 ; Passive service checks are enabled/accepted
}

The following service will monitor the free drive space on /dev/hda1 on the remote host.

define service{
	use			generic-service
	host_name		windowshost 
	service_description	Free Space
	check_command		check_nrpe!alias_disk
	active_checks_enabled	0 ; Active service checks are enabled
	passive_checks_enabled	1 ; Passive service checks are enabled/accepted
}

Now a better way here is to add a new template and derive the service checks for a "tpl-passive-service" instead and put the passive options there but alas I was to lazy to do so in this quick guide.

6. Where to go next

This is of cores not the end now you need to check out what checks you want to use run on your servers. There is a lot of built-in checks but there are a lot more external scripts you can use and download from for instance  monitoring exchange or the new  nagios exchange.

Built in checks:

Attachments