Erlang-solutions.com

Operational nightmare fun: dealing with misconfigured Riak Clusters

2016-09-21

Everyone has a strong opinion about DevOps. Either you love it, or you hate it. Personally, we believe in DevOps. In the Erlang world, we were doing DevOps even before the term was invented. The power of the Erlang Virtual Machine and the visibility you have when troubleshooting is second to none, but you need to know what you are doing, as you can affect the system and even cause outages. Speaking with developers who dislike DevOps, we are often told that the fun ends with delivering the product. They believe support is boring, and is only about being able to master the stack’s configuration possibilities. We believe they are so wrong! Give us a few minutes to tell you a story, and then you can make up your own mind.

This is a battle story from the trenches. It is a battle story about supporting and operating a production system where outages result in extensive financial losses.

Background

We were providing 24/7 support to a customer running a Riak[1] cluster with 6 nodes. Riak is an open source NoSQL database from Basho Technologies. The operations team were masters at general pre-emptive support. They monitored the CPU, IO, memory and the network utilisation; checked and analysed the graphs regularly. However, they weren't Erlang or Riak experts, so we often helped them recover from the failure of individual nodes and understand and adjust their cluster configuration. What they lacked, however, was an Erlang centric view of the system.

The nodes were using a relatively old version of Riak, but the DevOps team was happy with the performance. Their policy was to upgrade only when they came across issues which were fixed in later versions. We came across one of these issues that required an upgrade from to the latest stable Riak release, which at the time had reached 2.0.6. As is customary with huge hops between versions, upgrades that require no downtime are done in steps, using intermittent releases where upgrades between them have been tested thoroughly. This mitigates risk and reduces the need to customize upgrade scripts.

We started preparing for this particular upgrade, planning to first upgrade to version 1.4.12, followed by an upgrade to version 2.0.6. We provided a proposal and discussed the upgrade process in detail. Our customer upgraded their test cluster to gain the necessary experience. The trial run went smoothly, resolved all of the issues they were experiencing, giving everyone, us included, a false sense of optimism. Our customer couldn't wait to see the upgraded production cluster in action.

The first upgrade to version 1.4.12 went as planned, although much slower than expected. Immediately after the upgrade, we ran our routine health checks together with the operations team. The results were positive and the cluster was able to serve all the requests. Everyone was now looking forward to upgrading to version 2.0.6. Unfortunately, it did not happen as quickly as we hoped.

Early warning signs

Historically, the cluster was always close to hit its performance limits, but it always seemed to cope. The most serious problems were the large IO latency and at times, the unreliable network. The physical machines were equipped with slow disks and the network cards occasionally dropped packets. Also, the partitions handled by each Riak node were too many and too large. The cluster would have had much better performance if one or two new Riak nodes would had been added, as well as replacing the slow disks with new SSDs. Nonetheless, the cluster was able to serve requests but had no spare capacity to tolerate extra load.

A week after the first upgrade (to version 1.4.12), the operations team noticed that the trends of the IO graphs had changed. Disk usage of the Riak nodes had increased dramatically (by around 600 GB per week and node) and wasn't in line with the application's usage pattern. Also, the memory usage of the Riak processes was increasing. Operations decided to reboot the cluster in a rolling manner, as it would trigger Riak to free up used disk space it no longer needed.

Things getting worse

Monday evening, whilst reading a great book (which will be covered in another blog post, another day), a PagerDuty ALERT broke the silence.

What happened?!?

The operations team had started to reboot the Riak nodes in a rolling manner to free up the disk space whilst applying the configuration changes we had recommended. This worked until the third host was restarted. More and more incoming requests were timing out. This got the operations team worried, prompting them to escalate the issue to us. Large spikes appeared in the IO, memory and in the network graphs. The cluster was struggling to handle its load. The logs and graphs we inspected confirmed that the nodes were overloaded.

To understand the symptom and the cause, we need to step back and understand partition transfers [2]. In Riak data is organised into partitions. Each Riak node is responsible for a set of partitions. When a Riak node is down (e.g. because it is being restarted), its partitions are automatically taken over by the other nodes. For each partition, a secondary partition is created on another node. Existing data is not transferred, but whenever new data reaches the cluster, it is stored in the secondary partition. When the Riak node which was down recovers, the data in the secondary partitions is transferred back to the primary partition of the original node.

Back to our incident. We noticed that many transfers had stalled; the data from the secondary partitions was not moving to the primary ones on the newly restarted nodes. To make matters worse, the transfer coordinator process had a large message queue. We kept on monitoring the system and gathering data whilst investigating the cause. When doing so, the overload issue resolved itself and the cluster returned to normal, serving requests as expected.

What caused the overload? In our investigations, we noticed a change in the configuration files. It turned out that before rebooting the Riak nodes, the operations team had changed the number of allowed concurrent partition transfers on a node from the default setting of 2 to 50. They were hoping in a quicker transfer rate, but instead, they overloaded the nodes making them unresponsive. To add the icing on the cake, the extra traffic also caused further problems by contributing to network saturation.

Wouldn’t it have been great if the unexpected config change had been immediately visible for us? In this case, we would have immediately known what the root cause was... Also, checking the messages of the stuck manager process or inspecting the processes responsible for the stalled transfers would have made us much more effective! Not considering if we would have been able to see the complete, detailed history of what happened to the partitions instead of having only a limited number of system snapshots.

Crisis

After the concurrent transfer values got fixed, we reviewed all node configuration in the Riak cluster and found nothing suspicious. Even though the overload issue resolved itself, we were still experiencing problems. Not a single bitcask merge (Riak's garbage collection mechanism[4]) had run to completion, resulting in increasing IO and memory utilisation. The transfers were still stalled, so the nodes managing their usual load had to also handle the load alongside the load generated from the secondary partitions. The extra overhead in all nodes resulted in a constant high load, a problem that had become the norm rather than the exception. It had to be addressed.

We decided to focus on the stalled transfers in the hope that resolving them would also resolve the load issues. When Riak documentation and user mailing list don’t help, you start reading the source code. We spent days trying to identify the parts which could be broken and cause the issue we were facing. Whenever we found a potential culprit, we asked the operations team to execute Erlang command, giving us feedback. Our progress was slow, as we were not on the same location as the client and did not have access to the production servers. All commands and trouble shooting had to be done by the customer.

How fast would our progress have been if the process info, the process messages, and the process state had been immediately available for us? Here delays matter as capturing important information is often possible only for a very limited time. Thus, many times we simply missed the opportunity.

Incident report

When investigating the root cause,

three of the six nodes crashed, causing a major outage. The reason for the crash was shown by dmesg. dmesg is a command on most Unix-like operating systems that prints the message buffer of the kernel.

We were able to restart the nodes and fully restore service in less than an hour, but the transfers were still stalled, affecting even more partitions as more transfers got stalled.

What if we had more detailed records of memory metrics, such as atom memory, process memory, binary memory, ETS memory and system memory? These metrics would have given insight into the root cause, because we would have seen binary memory and process memory usage shoot up.

First aid

Analysing the logs and the metrics it was clear that issues were ongoing and that the nodes would soon crash again. Most noteworthy was an increase in memory consumption. We sent an Erlang command that forced a garbage collection on all Erlang processes. The first attempt failed as the command was pasted incorrectly. The second attempt freed up half of the used memory! It was a quick win allowing the operations team to resume traffic back to normal levels, giving us the breathing space we needed to resume our investigations on the stalled transfers.

How fast would it have been, if the logs had already been merged and we had been able to effectively filter them? Then, we wouldn't have spent valuable time handling them. Also, how safe would it have been if the correct Erlang commands had been available for the operators?

As we managed to recover from the outage we returned to our main question.

Why are the transfers stalled?

We were studying a collection of output generated by riak-admin top -sort msg_q on the nodes. The output is a table where each row represents a process in the Riak node. The processes with the longest message queues are shown, sorted by the message queue length.

A breakthrough in our investigations occurred when we noticed that the vnode proxy process message queue sizes didn't change. The role of these vnode proxy processes is to protect their vnode processes from overload. Each vnode process[3] manages a data partition, so every message addressed to a vnode process goes through the corresponding proxy process. The following drawing shows that in the normal flow of events, the proxy simply forwards the messages (top image), but in our case, the messages were stuck in the mailboxes of the proxy processes (bottom image):

Examining the source code, we found the names of these processes contained the id of the partition being managed by their vnodes. And guess which partitions they referred to? The ones that were stalled. Yay, so nice! But wait a minute. Why were these processes overloaded? The nodes handling the primary partitions were up, so the secondary partitions had no work to do. Only a very limited number of requests should have reached them. We sent a few Erlang commands to the operations team, asking them to retrieve the process info, the process messages, the process state of the proxies and the corresponding vnode processes. The result was a real surprise!

What we found was the proxies had false assumptions about their vnodes. The vnodes had no messages in their message queues, but the proxies thought they were overloaded, so the overload protection kicked in dropping any incoming requests, including requests for completing the pending transfers. We discovered a new riak_core issue[5] affecting all Riak versions between 1.4.12 and 2.1!

Details are in the Github ticket with the most important debugging results reproduced here:

We were pretty excited to find the root cause of the problem, but the production server was still struggling as the message queues were not getting any shorter. We couldn't expect the proxies to repair themselves quickly, as the primary vnodes handling these partitions were up and serving all incoming requests. As a result, no requests with the exception of the requests for completing the transfers reached the faulty proxies, slowing down their recovery. By studying the proxy implementation, we concluded that terminating these proxies would temporarily solve the problem, as their supervisor would restart them in a consistent state! So, we wrote our killer script that the operations team dutifully executed, and as a result, the transfers were completed.

How much easier would it have been to use a tool that provides ETOP like process listings with an interactive menu that allows users to directly kill processes, and query the process info and the process state? Then we wouldn't have had to write complex Erlang commands to identify, inspect and manage the bad processes.

What happened next? Getting the official patch installed solved the vnode proxy issue. It didn’t solve the resource usage issue, though that is a story for another blog post.

Conclusion

Summing up, operations on production grade Erlang systems is neither easy nor boring. A well-experienced support team equipped with application specific knowledge and Erlang expertise is simply not enough. You need an Erlang centric view of the system, full visibility of the historical metrics and logs, and the drive to use this information to troubleshoot issues during and after outages. This Erlang centric view allows you to quickly figure out what went wrong, fix it and put in place measures ensuring the problem never causes an outage again. Imagine how many person hours we spent troubleshooting this particular issue! Not to mention the financial loss and reputation damage outages and service degradation can cause.

This incident, together with many others, inspired us to create a tool that gives us full visibility and early warnings when things are about to go wrong. It is called WombatOAM and it just reached version 2.2.1:

As is customary in the Enterprise world, we were not granted access to the production servers; all communication went through the operations team, who typed commands in the Erlang shell and sent us the logs we requested. Had we had WombatOAM in this scenario, we could have accessed its GUI, allowing us to quickly troubleshoot the system (in read only mode) and reduce communication time with the operations team. We would have quickly spotted the increase in message queue sizes, and using WombatOAM's ETOP could have inspected the process info, state and message queue. There is no doubt we would have found out that the vnode proxy processes were blocked in a couple of hours. WombatOAM would have further allowed the operations team, through a simple click of a mouse, to trigger the garbage collector on all the Erlang processes, without the risk of typing errors in the shell. The Riak specific plugins would have provided us with fine-grain metrics and a detailed, complete history about various events related to the partitions enabling quick, periodic health-checks.

During the outage, having a central view of all nodes' logs with filtering capabilities would have been a lifesaver, as the important logs could have been easily studied. Detailed memory metrics would have given us insight into the root cause of the Erlang nodes running out of memory. And thanks to WombatOAM's configuration management feature, we would have detected the mis-configuration when an alarm had automatically been raised. The configuration issue could have then been fixed using the GUI, without having to access the remote shell or config files. Customers who give WombatOAM a chance to show its capabilities usually come back to us with success stories about how WombatOAM addressed their problems. Try it out and tell us yours!

References

[1] https://github.com/basho/riak

[2] http://docs.basho.com/riak/kv/2.1.4/using/reference/handoff/

[3] http://docs.basho.com/riak/kv/2.1.4/learn/glossary/#vnode

[4] http://docs.basho.com/riak/kv/2.1.4/setup/planning/backend/bitcask/#disk-usage-and-merging-settings

[5] https://github.com/basho/riak_core/issues/760