Finding love in monitoring

By Luke Tymowski, Systems Administrator, Calgary

Monitoring, long an operations backwater, began to get some attention in early 2011, but for all the wrong reasons. None of the tools had made any progress since the 1990s. "Monitoring sucks!" was a common complaint on Twitter (the #monitoringsucks hashtag began to trend at one point).

When all your servers are physical machines, and you're watching them from the same network, monitoring is not too painful. You may realize that there are better ways to do it, but that would require a lot of work, and in operations, your job is to maintain services and the servers necessary to provide them. Building ambitious new tools is usually not part of your mandate.

The cloud changed all of that. Now you're maintaining dozens to thousands of servers that are being created and destroyed by software, and their lifespans are not measured in years. They're in data centres far from you, on networks you don't control. Suddenly, monitoring servers and the services they host becomes very painful, especially considering that the tools you use were designed for long-life static, physical machines on your own network. In these conditions, the #monitoringsucks hashtag makes sense (see Jason Dixon's more detailed explanation of what's wrong with the older monitoring tools).

The problem with virtual monitoring

Where there is a gap between what is needed and what is available, there are business opportunities. A number of startups have promised solutions to these monitoring problems. Some certainly appeared to deliver on their promises, but are expensive. One that stood out was Cloudkick, which was quickly acquired by Rackspace, and is now part of Rackspace's Cloud monitoring team. Another interesting one is Boundary, which, when it was first announced, appeared to focus on the network layer, but now offers monitoring solutions for the full stack '€” from the network to the application layer. The 800lb gorilla in this space is New Relic. And there are a number of other vendors in this field.

Static checks and metrics

Monitoring is usually viewed in two parts: first to do static checks and the resulting alerts (is my server alive, is my CPU load too high, is my network saturated); and the second to provide metrics (graphs showing the load, network bandwidth, and memory usage over time).

Both are necessary (checks/alerts and metrics), but they're usually separate systems. After a while you find yourself thinking they should be one and the same. In the last year, a lot of developers have said the same thing.

Librato, a new startup, focuses on collecting metrics, then gives you the means to create alerts based on metrics data. It is also very affordable compared to many of the service-based monitoring and metrics companies.

So why the current trend towards building alerts on top of metrics, rather than static checks? Baron Schwartz explained why this shift in monitoring is important in a recent episode of the Food Fight Show, a podcast based on DevOps tools and processes. (Baron is well known for his work in the MySQL community. He recently left Percona, a MySQL consulting shop, to create a new metrics-based startup called VividCortex)

What if you want, or need, to build your own monitoring and metrics stack?

Sensu is a new monitoring tool that appeared at the end of 2011. Sonian, a Boston-based company, used Nagios to monitor its AWS-based stack and found it very painful. Nagios was built in the 1990s, and its approach hasn't changed much over the years. It is reliable, and many companies, including Cybera, base their monitoring stack on Nagios. But its shortcomings become readily apparent when working in the cloud.

Sean Porter, a Vancouver-based contractor for Sonian, had some ideas on how he might reinvent monitoring for the cloud. He developed Sensu, (explained here) which has become very popular in the past year. It's Ruby-based, as are the two big configuration management tools (Chef and Puppet), and was designed from the beginning to be installed and managed by tools like Chef and Puppet. (Sean has since moved to Heavy Water, an operations automation-focused consulting company.)

Another tool that is gaining a lot of traction is Riemann. It's written in Clojure, a JVM-based Lisp implementation. Kyle Kingsbury, Riemann's developer, spent a few years working for Boundary, so he's very familiar with the problems encountered by developers and Sysadmins trying to monitor and gauge modern (i.e. cloud-based) application stacks. He left Boundary and spent six months focusing on Riemann (this video explains how it works).

Good graphing

Graphite is the current metrics tool of choice. (Graphite forms the basis of Librato's stack.) Etsy is a vocal and enthusiastic Graphite user, and one of the tools they've developed, StatsD, allows you to easily add data to Graphite.

But Graphite too has a steep learning curve. Jason Dixon has built two tools to help develop Graphite-based dashboards: Tasseo lets you build real-time dashboards easily; and Descartes helps you correlate metrics in a single chart in a way that Tasseo charts cannot.

Monitoring love

So there are now many new, interesting, and immediately useful monitoring and metrics tools available. Is the #monitoringsucks hashtag still relevant or appropriate? No. In the last year the new hashtag trending on Twitter is #monitoringlove.

Monitorama

To capitalize on the current abundance of monitoring and metrics energy, Jason Dixon has organized a conference on just that: Monitorama. Where most conferences have attendees who listen to speakers but who don't otherwise participate in the event, Monitorama expects everyone, not just the speakers, to contribute.

Will this be the summer of monitoring love?