Tag Archives: monitoring

Simple OpenStack monitoring with Ganglia and Nagios

I’ve been running an OpenStack-based cloud for a while. While the modularity of OpenStack is a strength, helping the fast pace of development, it also means that the interactions between components can be quite complex, with all the possibilities for obscure errors this implies. For instance, upgrades in one component (such as a GlusterFS backend) can cause problems elsewhere. Here’s a description of some simple monitoring I’ve added to ameliorate this.

This assumes you already have Ganglia and Nagios available. There are two parts: a regular Ganglia check and a Nagios service that checks the Ganglia value, raising an alert if it crosses your chosen threshold. In my case, one of the sets of metrics I’m interested in is the number of instances in different states – active, error, build and shutoff. If there are too many in the build state, that may mean there’s a problem with the shared /var/lib/nova/instances directory, or with the scheduler, for example.

Here’s the script that runs on each compute node, triggered by cron every 10 minutes:

# Script to check some OpenStack values

. /root/keystonerc_admin

INSTANCES_ERROR=$(nova list --all-tenants|grep ERROR|wc -l)
INSTANCES_ACTIVE=$(nova list --all-tenants |grep ACTIVE|wc -l)
INSTANCES_BUILD=$(nova list --all-tenants |grep BUILD|wc -l)
INSTANCES_SHUTOFF=$(nova list --all-tenants |grep SHUTOFF|wc -l)

/usr/bin/gmetric -d 1200 -x 1200 --name=instances_error --value=${INSTANCES_ERROR} --type=uint8
/usr/bin/gmetric -d 1200 -x 1200 --name=instances_active --value=${INSTANCES_ACTIVE} --type=uint8
/usr/bin/gmetric -d 1200 -x 1200 --name=instances_build --value=${INSTANCES_BUILD} --type=uint8
/usr/bin/gmetric -d 1200 -x 1200 --name=instances_shutoff --value=${INSTANCES_SHUTOFF} --type=uint8

The file keystonerc_admin needs to contain the OpenStack Nova API credentials. The --name value will be used in the Nagios check.

In Nagios, assuming that you’ve already defined the hosts to be checked and created a cloud servicegroup, this service definition will raise an alert if the instance status values collected by the Ganglia script exceed the specified thresholds:

define service {
                use default-service
                service_description INSTANCES ERROR
                servicegroups cloud
                check_command check_with_gmond!instances_error!5!10
                notification_options c,w,f,s,r
                host_name host.to.be.monitored
                normal_check_interval 30

define service {
                use default-service
                service_description INSTANCES BUILD
                servicegroups cloud
                check_command check_with_gmond!instances_build!3!5
                notification_options c,w,f,s,r
                host_name host.to.be.monitored
                normal_check_interval 30

check_with_gmond is defined as a command that calls the plugin check_ganglia.

This simple approach can be extended to monitor images in Glance, volumes in Cinder etc.

The two scripts are available as gists in Github: Ganglia script and Nagios script.


Unlikely petabyte network values in rrdtool/ganglia

Of late the networking graphs in our Ganglia monitoring have suffered from irritating, improbable spikes (30PB…) that effectively render them meaningless.  At first I tried the removespikes.pl script that I saw mentioned by other people with the same problem.  This didn’t work all that well, either over- or under-shooting what was required.  It also felt like solving the symptoms rather than the cause.  After all, Ganglia is just plotting what it receives from rrdtool.

Eventually I found a suggestion of applying a maximum value in the header of RRD files with rrdtool.  This way, I could rule out these (pretty much) impossible values.  Here’s an example command:

rrdtool tune bytes_in.rrd --maximum sum:9.0000000000e+09

Clearly care is needed that legitimate values aren’t excluded e.g. interfaces running at 10 gigabit or higher speeds.  It’s been working well for the past week and the network graphs are now meaningful again (after manually removing the outlying values).

DevOpsDays London 2013

On Friday and Saturday last week I attended DevOpsDays London 2013. Other people have blogged about the event, though I haven’t seen much coverage of the Open Spaces sessions, so here are my thoughts, trying to fill in the gaps I haven’t seen covered elsewhere.

Sam Eaton’s talk was very popular.  It came out during the talk that he’d just left the job about which he was talking on the Friday, so perhaps he was more frank than he would otherwise have been. He said he was wary of “owning” tools, because they end up “owning you”, creating “silos within yourself”. ActiveMQ was crucial to the approach he discussed.  It had been installed for a specific reason and they then made much wider use of it. He claimed that it made the creation of bespoke tools easier, because you wouldn’t have to worry about communication.  The common theme of asking for forgiveness being easier than asking for permission also featured, and he extended this to state that people should

deliver first, then evangelize

His slides are available here.

Gene Kim‘s talk was aimed at helping people sell the DevOps approach.  His idea of the

downward spiral of negative feedback

was quite powerful.

I liked the fact that Simon McCartney‘s Ignite talk included a round-up at the end of tools he wished he’d known about before he started his “Stackkicker” project, because he would have adapted one of them instead of starting from scratch.  A humble and practical admission.  Here are his slides.

Daniel Pope gave an Ignite talk on Saturday following up on something he’d mentioned in a Friday Open Space session I’d attended on storage. For testing he uses honeyd and the Fake ARP Daemon to, in essence, create a fake internet for testing. He also referred to

Test-driven development of infrastructure

and unit testing of infrastructure.

Open Spaces

The Open Spaces format was new to me and it was rather successful. The description in fact makes it sound more complicated than it really is. The storage session had people discussing various approaches people had taken to create distributed storage, some of them using a product I’d never heard of before called MogileFS. People were rather wary of Gluster and even warier of Ceph – “I’ve heard the block storage is done, and the filesystem…isn’t” being one of the responses. Lots of references to logstash and sensu in the session on monitoring, both of which I’d been aware of and now seem to have reached an inflection point in popularity terms.

The two Open Space sessions I attended on Saturday were on Clouds and Deployment.  Here are my bullet point notes from them:

  • Cloud experiences
    • Orchestration and automatic scaling is a problem
    • Need to get used mentally to killing your servers
    • Problem of cloud instance naming
      • Hard to know which machine is which
      • Solve this with tags
      • Or change the machine host name and put the instance id into the role eg with cloud-init
      • Also discovery-type pattern, eg using mcollective
      • Can use other inventories eg Chef
      • test driven infrastructure
        • Automatically test new instances that are part of a service, and kill them if they don’t respond correctly
    • How do you detect and deal with poorly performing nodes?
    • Interview question – what do you do if things are failing?
      • Rollback and rebuild
    • Kill a problem node first time
      • If it happens repeatedly, investigate
    • Riemann as a dashboard
    • Monitoring of cloud instances?
      • Much more dynamic than physical machine monitoring
      • Combination of mcollective and sensu
        • How to handle when instance ends?
        • Maybe a cron job on Sensu server?
        • Need to keep information about past machines in order to enable historical performance comparisons
        • Same host name may be reused with different sized instances
        • Custom tools needed for this at present
        • Need to tie machine’s details with monitoring output for this
        • Maybe keep all logs and process them afterwards
      • We don’t even know what the questions are re. Cloud, never mind how to solve them, compared with physical data centres
        • Difference between things staying mostly the same and things mostly chsnging
    • Handling of multiple regions?
      • Security groups don’t transfer automatically between aws regions
      • VPC should help with this
  • Deployment
    • Prefer to have everything in packages, to be able to track dependencies and check integrity
      • Use mcollective to trigger updates
      • Pulp to manage repositories
      • Build with Jenkins
    • Build with Jenkins and deploy with it too
    • Use versioning in the package, to cope with different application versions
    • Advice to use multiple Jenkins machines for different purposes, rather than try to do everything on one machine
    • How do you know what’s been deployed where, when using Jenkins? (which isn’t primarily a deployment tool)
      • Use post install scripts in rpms to register in graphite
      • Use work flow management plugin in Jenkins
    • Push application artifacts into Nexus, as an alternative approach
    • Liquibase
    • Need to have configuration and binaries integrated ie in same Puppet module, to ensure they’re in synch
    • Want to have a local repository and local mirror of everything you deploy, because you can’t rely on Internet resources being there
    • Be careful using something like Maven, because it will use snapshots from the Internet by default, which hurts reliability and reproducibility
      • Therefore block Internet for such cases
      • Use a proxy?
    • How many developers know how to write spec files?
    • Keep environment configuration separate from application configuration
      • Same tags in version control
    • Restrict access to eg Puppet modules to certain developers
    • Use git tags to keep track of things