TomsBlog

Check a process cpu and memory with Nagios

Sat, 18 Jan 2014 19:48:11 +0000

A check plugin for Nagios to monitor processes and their utilization of system resources.

https://github.com/thomasweaver/check_cpu_proc

This plugin takes in a process name and then uses the command ps to work out how much memory and cpu all the processes of that name are taking up in percentage. It will output performance data for CPU Usage Percentage, Memory Usage Percentage, VSZ, RSS and the number of processes of that name.

Options

Only the -p option is required all other options are optional.

-p allows you to specify the name of the process you want to monitor (REQUIRED)

-w specify a warning for CPU percentage utilization

-c specify a critical for CPU percentage utilization

-m specify a warning for Memory percentage utilization

-n specify a critical fro Memory percentage utilization

Examples

To monitor the apache processes and warn when the CPU uses over 20% and critical when over 30%

./check_cpu_proc.sh -p apache -w 20 -c 30

OK /usr/sbin/apache2 CPU: 0% MEM: 1.5% over 6 processes | proc=6 mem=1.5% cpu=0% rss=42116.0KB vsz=1274760.0KB

To monitor apache processes and warn when the CPU uses over 20% and critical when over 30% . Warn when memory is over 0.2% and Critical when memory is over 0.4%

./check_cpu_proc.sh -p apache -w 20 -c 30 -m 0.2 -n 0.4

CRITICAL /usr/sbin/apache2 CPU: 0% MEM: 1.5% over 6 processes | proc=6 mem=1.5% cpu=0% rss=42116.0KB vsz=1274760.0KB

To monitor nagios processes purely for performance data. i.e. no warning or criticals

./check_cpu_proc.sh -p nagios

OK /usr/lib/nagios/plugins/check_ping CPU: 0% MEM: 0% over 1 processes | proc=1 mem=0% cpu=0% rss=864.0KB vsz=12408.0KB

NOTE: this plugin doesn’t check whether the process is running and is purely for monitoring process utilization.

Nagios Graphs

Below is some examples of graphs created from the performance data output by the plugin:

The below graphs are of a apache process from a live system

Nagios Varnish performance plugin

Thu, 18 Jul 2013 21:38:03 +0000

Varnish is a web application accelerator and sits in front of your web server. It speeds up your application by caching some, if not all of the content meaning it reduces the load on your web server and can reduce the load on you backend as less lookups will be needed by the frontend.

As varnish is the first thing users hit it is imperative it is working properly and you have statistics on how it is performing. This plugin can give you a better insight into how affectively varnish is running and tell you if its having problems.

https://github.com/thomasweaver/check_varnish

Install

To use this plugin you need to have varnishstat installed which is installed by default when you install varnish.

Perl is also required for this plugin. If you don’t have Perl installed you can install in by running the command below

sudo apt-get install perl

sudo yum install perl

Now you can clone the repo above

git clone https://github.com/thomasweaver/check_varnish.git

Now you should have a file “check_varnish.pl” make sure that it has execute permissions:

chmod u+x check_varnish.pl

You now need to copy this file to your nagios plugins folder. You should consult your nagios config to find out where this is. Mine was ‘/usr/lib/nagios/plugin’

mv check_varnish.pl /usr/lib/nagios/plugins/.

How To

check_varnish.pl – Monitor and report on varnish usage

DESCRIPTION

This script will report on various varnish stats including: varnish cache hit ratio backend error count (Total or Ratio) Any other counter in varnishstat If no counters are required the script will ensure the varnish binary is running

OPTIONS

-a –cache – this will make the script output cache_hit ratio perfdata

-b –bin – to specify a different location of the default varnishstat binary location. Default is ‘/usr/bin/varnishstat’

-d –backend – specify script to output backend data you can output ratio, total or both

-h –help – output this message

-w –warning – specify the warning threshold. Required for cache and backend checks

-c –critical – specify the critical threshold. Required for cache and backend checks

-s –stats – specify a comma separated list of all the stats you wish to check Critical and Warning can be specified and all values will be compared to these values.

-t –technique – when specifying stats you can also specify what technique you wish to use to compare the values to the thresholds. specify lt for less than and gt for greater than. Default is gt

EXAMPLES

Check varnish is running

./check_varnish.pl

Check varnish Cache Hit Ratio and warn if ratio is below 0.8

./check_varnish.pl -a -w 0.8 -c 0.6

Check varnish Backends

./check_varnish.pl -d all

Check varnish client requests and drops

./check_varnish.pl -s client_drop,client_req

Nagios Set Up

Once you have run the command in the CLI and all is working you can add the command:

define command {

command_name check_varnish

command_line $USER1$/check_varnish.pl $ARG1$

}

$USER1$ is your variable pointing to your nagios plugins folder and $ARG1$ are any command line arguments you specify in the service.

define service {

host_name localhost

service_description Varnish

check_command check_varnish!–cache -w 0.6 -c 0.4

}

The service above will give a warning if the hit ratio goes below 0.6 and critical if the ratio goes below 0.4

NRPE

The below is a line that can be used in the NRPE configuration for remote monitoring:

command[check_varnish_cache_hit]=/usr/lib/nagios/plugins/check_varnish.pl –cache -w 0.6 -c 0.4

The NRPE service could look like this:

define service {

host_name varnishserver

service_description Varnish Cache Hit Ratio

check_command check_nrpe!check_varnish_cache_hit

}

Performance Data

For certain stats the plugin outputs performance data which can be truned into graphs. The graphs below were produced by nagios and nagiosgrapher: This graph show the performance data when varnish was restarted. As you can see the cache suddenly drops off and then slowly builds back up as more users access the site

The graph below shows the cache hit ratio slowly increasing over a week. This means most of the cache has been populated and so most of the request are coming from the cache.

F5 LTM virtual server Nagios health performance check

Tue, 11 Dec 2012 18:42:01 +0000

A plugin for nagios to check the state of virtual servers on an LTM as well as creating Performance Data output on the amount of connections to each Virtual Server. The plugin is written in PHP and uses SNMP to gather the statuses of each Virtual Server. The plugin currently only supports SNMPv2 and has been tested on Big-IP 10.2 only but should work with other version of Big-IP.

You can get the plugin from here:

https://github.com/thomasweaver/check_ltm_vs

Help

This performs an SNMP lookup against an LTM and then check all Virtual Servers are OK and outputs current client connections as nagios PerfData

Required Values:

	-H Host address

	-C SNMP Community string
 
Optional Values:

	-t Timeout. Number of microseconds until first timeout

	-r Number of retries

	-e Exceptions. Value is a comma sperated string of exceptions not to check.

	-d Disabled Exceptions. Don't check whether the provided virtual servers are disabled. Comma Seperated
 
./check_ltm_vs.php -H IPADDRESS -c COMMUNITYSTRING -e vs_virtualserver1,vs_virtualserver2 -d vs_virtualserver1,vs_virtualserver2

Example

./check_ltm_vs.php -H 192.168.0.1 -C public -d virtualserver1,virtualserver2 -e virtualserver3,virtualserver4 -r 2

The example above will check virtual servers on 192.168.0.1 using Community String public. But will exclude virtualserver3 and virtualserver4 from any checks. It will not check whether virtual servers virtualserver1 and virtualserver2 are enabled but will check that there health state is ok. It will also only try and contact the LTM twice.

The plugin outputs performance data of the virtual servers it checks example out is below:

OK : 4 Virtual Servers OK | virtualserver1=98;;; virtualserver2=23;;; virtualserver3=10;;; virtualserver4;;;

Example of a check where the health state of a virtual server is unknown:

WARNING : virtualserver1, 90 Virtual Servers OK

Nagios HP MSA P2000 status and performance monitor

Wed, 26 Sep 2012 21:34:01 +0000

This script is intended to be used with Nagios to monitor a HP P2000 SAN. Below is a description and the file can be downloaded here:

All versions can be found here: https://github.com/thomasweaver/check_p2000_api

Version 1.2 allowing input of username and password. Great thanks to Jon Witts who added the code which can be found here http://pastebin.com/T1DMYJbu# or download zip below
Version 1.3 Adds an extra check to the status check. A problem was found where by some disk failures were not picked up this was found by Tjerk Nan one extra check on the vdisk fixes this.
Version 1.4 Adds Individual volume status and vdisk status allowing you to create separate services for the checks. Thanks to Jon Witts for writing this http://www.jonwitts.co.uk/archives/296.
Version 1.5 Adds a check for vdisk Read and write latencies that’s available in later firmwares. Thanks to Tjerk Nan for adding these extra checks.
Version 1.6 Add a curl part to the plugin so the pecl_http module is no longer needed but curl libraries are.
Version 1.7 Adds a timeout of 120 seconds to the http_pecl module as timeouts on the default value have occurred on slow networks. Thanks to Aurélien MARTY for finding and fixing the issue.

NOTE: For this command to run the php-http module needs to be installed: http://php.net/manual/en/book.http.php or CURL libraries for version 1.6 and above. Only certain P2000 Firmware’s support the statistic commands and so the performance checks won’t work on some SAN’s but the status check should work on all P2000 Firmware’s. Note you may need to run dos2unix check_p2000_api.php if you get an error about not finding /usr/bin/php. A Bug has been found in version 1.0.1 whereby it may not pick up a broken disk on later Firmware versions, v1.1 fixes this by using multiple different API calls to determine the health.

The P2000 has an inbuilt API where any CLI command can be run and output in an XML format. This script utilises this api and can gather performance data and component status of the SAN.

The script uses the following commands:

show enclosure-status – /API/show/enclosure-status – This outputs all the enclosures, disks, fans, PSUs etc

show disk-statistics – /api/show/disk-statistics – This outputs the Disk statistics including IOPs, Read Speed etc.

show controller-statistics – /api/show/controller-statistics – This outputs the controller statistics including cpu etc

show vdisk-statistics – /api/show/vdisk-statistics – This outputs the vdisk statistics including IOPs, Read Speed etc

show volume-statistics – /api/show/volume-statistics – This outputs the volume statistics including IOPs, Read Speed etc

As of Version 1.2 the auth string is replaced with a Username and Password.

Version 1.1 and below:

The script first uses an authentication string which is a combination of the username and password. To get this auth string you need to capture the traffic when you login to the SAN Management HTTP interface and look at the first post and there will be the author string. The image below shows a packet capture with an authentication string highlighted.

Once you have the auth string you can run the below command:

./check_p2000_api.php -H 192.168.0.1 -L 035ce62a046eaace03bface2abf7e740

All Versions

By default the plugin checks the status of the P2000 SAN and so only the host address username and password.

./check_p2000_api.php -H 192.168.0.1 -U admin -P password

This will loop through all Enclosures, FAN’s, PSU’s, Temperatures, Voltages and Disks. This will not output any performance data. Below is an example of the output:

OK : Enclosure 1 OK, FAN 2 OK, PSU 2 OK, Temp 4 OK, Voltage 10 OK, Disk 10 OK,

To start getting performance data you need to first decide what you want to monitor so first of all log in to the HTTP Management interface and then run the commands in the api yourself and look at the output. Below is some of the output from /api/show/disk-statistics

So say we want to get the IOPS of each disk we could run the command:

./check_p2000_api.php -H 192.168.0.1 -U admin -P password -c disk -S iops

OK : disk iops – disk_1.1 0, disk_1.2 0, disk_1.3 0, disk_1.4 0, disk_1.5 0, disk_1.6 0, disk_1.7 0, disk_1.8 0, disk_1.9 0, disk_1.10 0, | disk_1.1=0;;; disk_1.2=0;;; disk_1.3=0;;; disk_1.4=0;;; disk_1.5=0;;; disk_1.6=0;;; disk_1.7=0;;; disk_1.8=0;;; disk_1.9=0;;; disk_1.10=0;;;

-c is the command we want to run there are currently 5 options status, disk, vdisk, volume and controller these map directly to the commands shown above so specifying disk will run the command /api/show/disk-statistics. -S specifies which value to get from the output. This is basically the value given to the “name” argument in the XML output in our example this is “iops”, but you could use “bytes-per-second-numeric” as the -S option. If no -S value is given then this defaults to iops. If the above command is run it will give the IOPs for all disks in the controller.

What if we want to get the CPU utilization of the controller. Well we need to specify enclosure for the -c value and then cpu-load for the -S value. We can provide more options to the script such as a warning and critical value these are provided using -w and -C by default this will mean if the value from the API is greater than either warning or critical then return the appropriate warning or critical state. This is great in most situations but what if you want to be warned if the iops or cpu or whatever is lower than these values well you can specify this with the -t option with a value of “lessthan”. We can also specify what units the value is measured in with the -u option this will simply add the value to the end of the output. All this can also be done securely over HTTPS by specifying a value of 1 with the -s option (NOTE there is a bug in at least one version of the HP P2000 where sometimes authentication over HTTPS doesn’t work.). So lets put all this into a command

./check_p2000_api.php -H 192.168.0.1 -U admin -P password -c controller -S cpu-load -s 1 -u “%” -w 60 -C 80

So this command will get the CPU load of the controllers in the enclosure over HTTPS and will append the % sign to the output. It will be in a Warning state if the utilisation is over 60% and Critical if its over 80%.

The command would output something like below:

OK : controller cpu-load – controller_A 1%, controller_B 1%, | controller_A=1%;60;80; controller_B=1%;60;80;

This command would give you a graph like below:

Once you have tested everything in the command line its time to add it to nagios. So first create the command:

define command {

   command_name                             check_p2000

   command_line                             $USER1$/check_p2000_api.php -H $HOSTADDRESS$ -U $USER11$ -P $USER12$ $ARG1$

}

$USER1$ is obviously the path to your plugin directory. I put $USER11$ in the nagios resource.cfg this is so you can easily change it if you need to.

Once you have the command defined you then need to define the service:

define service {

    host_name                       SANHOST

    service_description             Controller CPU

    use                             generic-service

    check_command                   check_p2000!-c controller -S cpu-load -u "%" -w 60 -C 80!!!!!!!

    register                        1

   }

Don’t forget always run nagios -v to confirm your configuration is correct before restarting the nagios service.

Installing on Ubuntu

To get the plugin working on Ubuntu you will need to install the pecl_http extension. To do this first install php-pear:

apt-get install  php-pear
apt-get install libcurl4-gnutls-dev
apt-get install make
pecl install pecl_http

You will be asked a few questions during this install just use all default answers. If you get an error saying it can’t find curl header make sure you have installed the curl libraries with the command above. If this installs successfully it should say “You should add “extension=http.so” to php.ini”. So lets do that using a text editor:

Add the line extension=http.so to /etc/php/php If all that went well then you should now be able to use the script.

Installing on Groundwork

Groundwork comes with its own version of PHP located in /usr/local/groundwork/php.

To use this version of PHP I had to download the pecl_http extension and compile it from source. Before that you need to make sure a few things are installed first. I used CentOS but the principle should be the same on other Distros:

yum install autoconf
yum install make
yum install zlib-devel
yum install libcurl-devel

Now you need to download the latest pecl_http extension source from here:

wget http://pecl.php.net/get/pecl_http-1.7.4.tgz
tar -xvf pecl_http-1.7.4.tgz
cd pecl_http-1.7.4.tgz
phpize
./configure
make
make test
make install

Once installed add “extension=http.so” (without the quotes) to /usr/local/groundwork/php/etc/php.ini

Now change the first line of the script to:

#!/usr/local/groundwork/php/bin/php

Everything should work with the script now.