Tag Archives: OpenStack

Simple OpenStack monitoring with Ganglia and Nagios

I’ve been running an OpenStack-based cloud for a while. While the modularity of OpenStack is a strength, helping the fast pace of development, it also means that the interactions between components can be quite complex, with all the possibilities for obscure errors this implies. For instance, upgrades in one component (such as a GlusterFS backend) can cause problems elsewhere. Here’s a description of some simple monitoring I’ve added to ameliorate this.

This assumes you already have Ganglia and Nagios available. There are two parts: a regular Ganglia check and a Nagios service that checks the Ganglia value, raising an alert if it crosses your chosen threshold. In my case, one of the sets of metrics I’m interested in is the number of instances in different states – active, error, build and shutoff. If there are too many in the build state, that may mean there’s a problem with the shared /var/lib/nova/instances directory, or with the scheduler, for example.

Here’s the script that runs on each compute node, triggered by cron every 10 minutes:

# Script to check some OpenStack values

. /root/keystonerc_admin

INSTANCES_ERROR=$(nova list --all-tenants|grep ERROR|wc -l)
INSTANCES_ACTIVE=$(nova list --all-tenants |grep ACTIVE|wc -l)
INSTANCES_BUILD=$(nova list --all-tenants |grep BUILD|wc -l)
INSTANCES_SHUTOFF=$(nova list --all-tenants |grep SHUTOFF|wc -l)

/usr/bin/gmetric -d 1200 -x 1200 --name=instances_error --value=${INSTANCES_ERROR} --type=uint8
/usr/bin/gmetric -d 1200 -x 1200 --name=instances_active --value=${INSTANCES_ACTIVE} --type=uint8
/usr/bin/gmetric -d 1200 -x 1200 --name=instances_build --value=${INSTANCES_BUILD} --type=uint8
/usr/bin/gmetric -d 1200 -x 1200 --name=instances_shutoff --value=${INSTANCES_SHUTOFF} --type=uint8

The file keystonerc_admin needs to contain the OpenStack Nova API credentials. The --name value will be used in the Nagios check.

In Nagios, assuming that you’ve already defined the hosts to be checked and created a cloud servicegroup, this service definition will raise an alert if the instance status values collected by the Ganglia script exceed the specified thresholds:

define service {
                use default-service
                service_description INSTANCES ERROR
                servicegroups cloud
                check_command check_with_gmond!instances_error!5!10
                notification_options c,w,f,s,r
                host_name host.to.be.monitored
                normal_check_interval 30

define service {
                use default-service
                service_description INSTANCES BUILD
                servicegroups cloud
                check_command check_with_gmond!instances_build!3!5
                notification_options c,w,f,s,r
                host_name host.to.be.monitored
                normal_check_interval 30

check_with_gmond is defined as a command that calls the plugin check_ganglia.

This simple approach can be extended to monitor images in Glance, volumes in Cinder etc.

The two scripts are available as gists in Github: Ganglia script and Nagios script.


OpenStack ephemeral disk problem

I’m administering an OpenStack cloud, running the Folsom release.  One of the users requested some ephemeral disk space for their instances, so I created a custom flavour to meet their requirements.  Unfortunately, all the instances went straight to the error state, because the scheduler couldn’t find a valid host.  This was strange, because I knew there were several hosts with sufficient space.  Here is the error message:

2013-06-12 13:20:19 DEBUG nova.scheduler.filter_scheduler [req-XXX] Attempting to build 1 instance(s) schedule_run_instance /usr/lib/pytho
2013-06-12 13:20:19 WARNING nova.scheduler.manager [req-XXX] Failed to schedule_run_instance: No valid host was found. Exceeded max schedu
ling attempts 3 for instance XXX
2013-06-12 13:20:19 WARNING nova.scheduler.manager [req-XXX] [instance: XXX] Setting instance to ERROR st

After some searching, I came across this question on a Rackspace forum.  The moderator suggested it was an Ubuntu packaging problem.  It also exists in the Red Hat packages I’m using, so I made the change myself and it fixed the ephemeral disk creation errors I was seeing.

I’ll mention this to the Red Hat/Fedora developers.

A direct link to the diff is here.

Interleaving bugs

It’s useful to know the component parts of a problem.  Not just their nature, but their number.  If there’s a component you don’t know about, you may spend excessive time on fruitless attempts to solve other parts that in reality rely upon the undiscovered aspect.

Last week I experienced this while working on an OpenStack setup.  Instances were to be launched by a remote glideinWMS server, using HTCondor‘s EC2 interface.  When invoked manually through a simple job description, instances did start on the cloud controller node.  When invoked by glideinWMS, they failed.  While there was some log information at the remote end showing that the problem was an HTTP 414 return code, indicating that the URI was too long, nothing was logged at the OpenStack end specifically related to the launching of instances (there was information showing that other requests from glideinWMS were succeeding, so it wasn’t a simple connectivity problem).

At first I thought it might be a quota problem.  Increasing the defaults had no effect, so it wasn’t that.  What was more puzzling was the complete lack of any local trace of the launch request.  Eventually I found this OpenStack bug, which looked like it might explain that.  Talking it over with two of the incredibly helpful glideinWMS developers, I found some WSGI code with an internally specified limit to the length of incoming requests.  It turned out this code was included in OpenStack and, if I increased it and restarted the Nova API service, the launch instance requests coming from the remote site started working immediately.  This bug was reported and a fix proposed (again, by a glideinWMS developer).

The whole process was very frustrating until it clicked that I was dealing with two problems and that one was unhelpfully obscuring evidence needed to help solve the other.