By Matt Delaney, DevOp, Cybera
A group of us at Cybera have been using the ELK stack (ElasticSearch, Logstash, Kibana) for about a year and a half, and for the most part it's been great. We've been able to diagnose issues, gather usage statistics, and visualize system performance much faster and easier than in the past. However, there has been one issue that has cropped up more and more frequently as the dataset grew: ElasticSearch began crashing.. a lot. Every time our ElasticSearch cluster crashed it took an increasingly longer amount of time to get it back into a healthy state. Up until this point, we'd only invested small amounts of time in ElasticSearch, but it became increasingly clear that if we wanted to continue to leverage all the benefits we enjoy with the ELK stack, we'd need to learn how to set up a more robust ElasticSearch cluster.
Conveniently, ElasticON (the first official ElasticSearch conference) took place just as we were reaching the height of our problems. I attended with the hopes of finding some best practices, and learn what was working for other people running clusters with much larger data sets than us. I was not disappointed. I came away from the conference with a bunch of ideas of how to make our cluster run a little smoother. In this post I'll go over the setup for our old architecture, why it didn't work well for us, what we did with the new architecture, and why we made those changes.
First, a bit about our use case. We're using ElasticSearch to do log file analysis for the Learning Management Cloud. We process anywhere between 5 and 10 million log events every day (including things like syslog, Apache access logs, and more). We use Kibana to work with the data and we usually just look at the most recent couple of days; however, from time to time we look at longer term trends spanning a few months at a time.
The above diagram shows our old architecture. We had a number of machines producing logs, which we sent to RabbitMQ where they'd sit until our Logstash instance could process the event. Once the event was processed it would be inserted into our ElasticSearch cluster (using the REST API). We'd then use Kibana to visualize and analyze the events stored in ElasticSearch. We were load balancing any requests to ElasticSearch using HA Proxy.
In ElasticSearch itself we had about 250 daily indices (i.e. 250 days of Logstash data), where each index had 5 shards and 2 replicas of each shard. The ElasticSearch VMs were running on OpenStack and had 8GB (4GB for the JVM) of memory and 4 CPUs. This worked fine when we were only doing small queries in Kibana; however, when we started looking at more than a week at a time, the cluster would often come crashing down. When we restarted the cluster, it took quite some time just to get into a yellow health state (primary shards allocated, but missing replicas), and potentially many more hours to get back to a fully healthy green state (all primary and replica shards allocated).
What was wrong with our old architecture?
It turns out there were a number of issues with our architecture setup. That being said, the biggest problem that was causing the crashes was memory, or specifically the lack thereof. Queries that make use of sorting or faceting use field data, and if you don't have the free memory for the field data cache when you make such a query it can crash your ElasticSearch node. There are ways to mitigate this (setting the field data cache size or setting expiry times for field data), but the bottom line is that if you don't have enough memory available on your machine, then you'll be running into problems. Its worth noting that 'doc values' are an alternative to field data that can prevent out-of-memory issues by utilizing disk space at a cost to performance, check out this blog post for some more information on doc values. At ElasticON, they mentioned that using doc values will be the default for ElasticSearch in the future.
There was a great talk at the conference on scaling ElasticSearch clusters. There was a ton of great information, including the recommendation to use data nodes with 64GB RAM and 4 CPUs, giving 50% of your memory to the JVM and leaving the rest for the operating system. Anything past this 64GB of RAM should be scaled out horizontally by adding additional data nodes (the reason for this limit is because the JVM does not effectively handle more than 32GB of memory). In our case, we did not have the available resources to dedicate 64GB of RAM to each data node, so we decided to use 16GB for now.
Aside from the memory issue, we can make the ElasticSearch cluster more resilient by separating out the node roles. In our initial setup we had five ElasticSearch nodes, each taking on the role of master, data, and client node. Having dedicated master nodes is highly recommended for improving cluster stability. A dedicated client node will handle any operations on the cluster (inserting, deleting, searching, etc) and will load balance requests appropriately among data nodes.
Our Logstash node was another area that we were able to improve. Up until now, we've been sending events to ElasticSearch via the REST API; however, Logstash is able to actually join the ElasticSearch cluster as a client and send events directly. This avoids the overhead of having to make HTTP requests for each action.
Our new architecture
In our new cluster we have three master nodes, each with 8GB RAM and 4 CPUs; three data nodes with 16GB RAM and 4 CPUs; and one client node with 16GB RAM and 4 CPUs (in a perfect world, if we had the resources available, we'd have 64GB RAM for our data nodes). We've configured ElasticSearch to require at least two master nodes to form a cluster (this is needed to avoid the possibility of a split brain cluster). In this setup, we can lose any one master node and up to two data nodes (with a replication level of two).
With this new setup the data nodes using almost 100% of their JVM heap, the elected master node is using just under 50% of it's JVM heap, and the client nodes only use a very small amount of their JVM heap. Eventually, we'll want to increase the memory allocated to the data nodes. We've also set the field data cache size to 50%, to ensure that we won't crash if we get a query that would normally exhaust all heap space. In the mean time, we're able to deal with queries covering a couple of months without any problem (queries over that are still somewhat problematic, but at least they don't crash the ElasticSearch nodes anymore).
We've been running with this new configuration for about a month now and it has proven to be sufficiently resilient. There have been a number of unplanned system restarts (unrelated to ElasticSearch) and the cluster was able to continue operating as expected. The restarted system was also able to automatically rejoin the cluster without intervention.
It's also worth noting that we've started to limit the number of days' worth of data we're keeping in ElasticSearch. Prior to this, we were storing as much as we could fit on the disks allocated to the data nodes; however, as we rarely need to look at data older than a few months, we've decided to remove data that is older than 120 days using ElasticSearch Curator. This has the added advantage of shorter recovery times if one of the data nodes is lost.
Thanks for reading and I hope that this post was able to save you a bit of time by avoiding some of the pitfalls that we ran into! If you have any questions or tips of your own please leave a comment. In the future, we'll write more posts as we continue learning and improving ElasticSearch cluster.