2013-09-24

Kevin Lawver, President @ Rails Machine, is our guest author for this post.

Few things feel worst than rolling out a High Availability (HA) system, then regularly seeing that system collapse. For our team at Rails Machine, that failing HA system was MySQL Multi-Master Replication Manager (MMM).



We've been searching for a MMM replacement for a while, and a few months ago, we made the switch to MariaDB + Galera Cluster for High Availability MySQL.

What's wrong with MySQL MMM? What's special about Galera Cluster? Read on!

MySQL Multi-Master Replication (MMM) is fundamentally broken

How MySQL MMM Works: a server with MySQL MMM installed polls the MySQL nodes every ten seconds (default interval), checking their health. Only one of the servers receives the writer role - the rest can have reader roles. MMM maintains a virtual IP that points to the node with the writer role.

The problem is the polling: if MySQL MMM is polling every ten seconds, what happens when the writer node fails between the checks? You're probably handling many transactions if you have an HA setup - you could have hundreds of thousands of failed transactions before MMM detects the writer node isn't healthy. Worse, what if there's an internal issue where replication fails first, then transactions fail, then you switch the writer role to the other node? The other node won't be consistent with the original writer node.

Decreasing the polling interval to one second won't fix it - large databases are likely doing many transactions per/second.

So, polling is the fundamental issue, but beyond fundamental issues, MySQL MMM in the wild often results in hard-to-recover problems. Here's Baron Swartz, MySQL God at Percona, on MMM's flaws:

In short, MMM causes more downtime than it prevents. It’s a Low-Availability tool, not a High-Availability tool. It only takes one really good serious system-wide mess to take you down for a couple of days, working 24×7 trying to scrape your data off the walls and put it back into the server. MMM brings new meaning to the term “cluster-f__k”.

Despite MMM's flaws, it was the least broken way to do HA for MySQL for a time. But, times have changed. Even the creator of MySQL MMM says it's time for a change. Here's Alexey's comment on Baron's blog post on MMM:

I’m the original author of MMM and I completely agree with you. Every time I try to add HA to my clusters I remember MMM and want to stab myself because I simply could not trust my data to the tool and there is nothing else available on the market to do the job reliably.

So, why is Galera the best MySQL HA solution?

Our Galera Cluster setup still uses polling to do health health checks - how is that any better than MMM?

The answer lies in how replication works. With standard MySQL, writes to the master are recorded in a binary log. Slaves then reproduce the queries in the binary log. There is always a delay between when a query is run on the writer and when other nodes run it. This is asynchronous.

MySQL asynchronous replication has the following issues:

It's common for a slave's dataset to lag behind that of the master.

MySQL replication is slow - it replays transactions from the binary log.

With Galera, transactions are confirmed across all nodes before they are committed. If a transaction fails on a node, that node is immediately removed from the cluster. In other words, Galera replication is synchronous. You never lose a transaction - there is no lag (and Galera's row-based replication is around 5x faster).

Galera Cluster is an insider

MySQL MMM is an outsider - it's "dumb" as far as what's actually going on with the server. It does one kind of check and that's all it knows how to react to.

Galera Cluster is an "insider" and can therefore be smarter about the internal state of each node and do the right thing without manual intervention (a node being synced or unsynced, for example, becoming a donor, etc - it's all automatic).

What happens when the writer node fails?

While it's possible to write to any node in a Galera Cluster, we've chosen to minimize potential deadlocks and only write to a single node. To do this, we use HAProxy: we have a frontend for the writer node and a frontend for reads that balances queries across all of the nodes. The writer sends queries to a single node and the other nodes are backups.

If HAProxy detects that the writer node isn't healthly, it immediately promotes one of the backup nodes to the writer. Frequently, MySQL MMM would end up taking all of the nodes offline in this case - HAProxy doesn't do this. We may lose a couple of queries when the writer backend is updated, but it won't result in an inconsistent data set across servers, which is far more crippling.

We don't automatically repair failed nodes, but that's OK. My primary concern is just making sure a healthy node is taking writes, and HAProxy does that.

Repairing failed nodes (and bringing new ones online)

When a node fails with standard MySQL replication, you put an intense load on one server while getting replication set up again (that single server taking not only reads and writes but the load from innodbbackupex).

With Galera, you take one of the nodes offline (so you need a minimum of three nodes). That node becomes the donor - write operations to it are blocked. The node transfers its data to the failed node (or new node) via rsync. Then, both the donor node and the failed node catchup by running queries from the slave queue.

Once both nodes are back to a Synced state, HAProxy will automatically mark them as Up and add them back to the front-end.

Galera Cluster supports regular MySQL too, so why did we switch to MariaDB?

Our reasons for switching to MariaDB are a mix of technical and political:

Easy Migration: First, MariaDB is a Drop-in replacement for MySQL. Migrating from from MySQL 5.1 to MariaDB Galera Server 5.6 just worked.

Performance: We had several queries covered by indexes that we had performance issues with before the migration that "fixed" themselves afterwards, which was a nice surprise. MariaDB seems to be much better at the complex queries and joins that Ruby on Rails is famous for. It does a better job on the whole of finding indexes and like I said before, a lot of the queries that plagued us are now speedy. We haven't seen much if any difference in memory usage either, which was kind of surprising. I expected there to be more usage with the overhead from Galera, but if there is, it's not noticeable.

Community: MariaDB has a lot of momentum and is adding more features than MySQL. There are a lot of concerns about the future of MySQL with Oracle and MariaDB looks like it will be around for a long while - even Google is switching to MariaDB.

Would we do it again?

Absolutely. We don't see any reasons to not switch to both MariaDB and Galera Cluster.

Monitoring Galera Cluster



The great thing about monitoring Galera is it's layered - we can easily monitor each part of the stack. We use Scout for monitoring, which means we just need to use the following plugins:

MariaDB Galera Cluster - install on each Galera Cluster node to fetch key metrics (Local State, Connected, etc).

HAProxy - Install for each proxy (once for the writer, once for the readers).

URL Monitoring - check the same status URL that HAProxy checks to determine node health.

TL;DR

Until recently, MySQL MMM was the best (but broken) way to add High Availability to MySQL. Galera Cluster finally adds true High Availability to MySQL primarily thanks to synchronous replication.

Show more