Tag Archives: ganglia

Simple OpenStack monitoring with Ganglia and Nagios

I’ve been running an OpenStack-based cloud for a while. While the modularity of OpenStack is a strength, helping the fast pace of development, it also means that the interactions between components can be quite complex, with all the possibilities for obscure errors this implies. For instance, upgrades in one component (such as a GlusterFS backend) can cause problems elsewhere. Here’s a description of some simple monitoring I’ve added to ameliorate this.

This assumes you already have Ganglia and Nagios available. There are two parts: a regular Ganglia check and a Nagios service that checks the Ganglia value, raising an alert if it crosses your chosen threshold. In my case, one of the sets of metrics I’m interested in is the number of instances in different states – active, error, build and shutoff. If there are too many in the build state, that may mean there’s a problem with the shared /var/lib/nova/instances directory, or with the scheduler, for example.

Here’s the script that runs on each compute node, triggered by cron every 10 minutes:

#!/bin/bash
# Script to check some OpenStack values

. /root/keystonerc_admin

INSTANCES_ERROR=$(nova list --all-tenants|grep ERROR|wc -l)
INSTANCES_ACTIVE=$(nova list --all-tenants |grep ACTIVE|wc -l)
INSTANCES_BUILD=$(nova list --all-tenants |grep BUILD|wc -l)
INSTANCES_SHUTOFF=$(nova list --all-tenants |grep SHUTOFF|wc -l)

/usr/bin/gmetric -d 1200 -x 1200 --name=instances_error --value=${INSTANCES_ERROR} --type=uint8
/usr/bin/gmetric -d 1200 -x 1200 --name=instances_active --value=${INSTANCES_ACTIVE} --type=uint8
/usr/bin/gmetric -d 1200 -x 1200 --name=instances_build --value=${INSTANCES_BUILD} --type=uint8
/usr/bin/gmetric -d 1200 -x 1200 --name=instances_shutoff --value=${INSTANCES_SHUTOFF} --type=uint8

The file keystonerc_admin needs to contain the OpenStack Nova API credentials. The --name value will be used in the Nagios check.

In Nagios, assuming that you’ve already defined the hosts to be checked and created a cloud servicegroup, this service definition will raise an alert if the instance status values collected by the Ganglia script exceed the specified thresholds:

define service {
                use default-service
                service_description INSTANCES ERROR
                servicegroups cloud
                check_command check_with_gmond!instances_error!5!10
                notification_options c,w,f,s,r
                host_name host.to.be.monitored
                normal_check_interval 30
}

define service {
                use default-service
                service_description INSTANCES BUILD
                servicegroups cloud
                check_command check_with_gmond!instances_build!3!5
                notification_options c,w,f,s,r
                host_name host.to.be.monitored
                normal_check_interval 30
}

check_with_gmond is defined as a command that calls the plugin check_ganglia.

This simple approach can be extended to monitor images in Glance, volumes in Cinder etc.

The two scripts are available as gists in Github: Ganglia script and Nagios script.

Unlikely petabyte network values in rrdtool/ganglia

Of late the networking graphs in our Ganglia monitoring have suffered from irritating, improbable spikes (30PB…) that effectively render them meaningless.  At first I tried the removespikes.pl script that I saw mentioned by other people with the same problem.  This didn’t work all that well, either over- or under-shooting what was required.  It also felt like solving the symptoms rather than the cause.  After all, Ganglia is just plotting what it receives from rrdtool.

Eventually I found a suggestion of applying a maximum value in the header of RRD files with rrdtool.  This way, I could rule out these (pretty much) impossible values.  Here’s an example command:

rrdtool tune bytes_in.rrd --maximum sum:9.0000000000e+09

Clearly care is needed that legitimate values aren’t excluded e.g. interfaces running at 10 gigabit or higher speeds.  It’s been working well for the past week and the network graphs are now meaningful again (after manually removing the outlying values).