Planetmysql.org

PoC: Using a Group Communication System (Isis2) to improve MySQL Replication HA

2013-08-08

Modern NoSQL solutions make good, old MySQL Replication appear weak on High Availability (HA). Basically, MySQL users have three choices for MySQL Replication HA: give up on HA, believe that doubling single points of failures means HA, or go for a proper but complex solution. Albeit, as NoSQL world and competition proves, solid HA can be dead simple: embed a Group Communication System (GCS) into MySQL! No single point of failure and near zero client deployment is doable. In parts, the proposal surpassed Pacemaker/Corosync. Read on: story, slides, experimental code.

PoC: Using a Group Communication System to improve MySQL Replication HA from Ulf Wendel

PDF download

Free tip of the day: A Single Point of Failure cannot cure a SPOF

If on a sailing boat on the wide, wide ocean, and the captain is the only one who knows how to sail, would you feel safe? No? Me neither. The captain is a Single Point of Failure (SPOF). So is the master/primary in a MySQL Replication cluster.

Out-of-the-box MySQL Replication setup

Writes

Reads

Master (SPOF)

Slave (one of many)

Slave

Slave

If the boat was monitored by a helicopter, that will eventually fail or loose sight to the ship, would you feel significantly safer? No? Me neither. The helicopter is a single point of failure. So is MHA or mysqlfailover (now GA!) when used for monitoring of the master. Using a single monitor to trigger MySQL replication master failover means doubling the number of SPOFs!

MySQL Replication setup aiming at 99.9% HA

Writes

Reads

Monitor, e.g. mysqlfailover/MHA (SPOF)

Master (SPOF)

Slave (one of many)

Slave

Slave

Of course, I’m overding it. If the probability of one system to fail is 1:1000, then the probability of two systems failing at the same time is
1:10000
1000 * 1000 = 1000000, 1:1000000 (Thanks Ben for the hint!, Readers: see Baron’s comment below). That’s an improvement. Plus, the big achievement of both mysqlfailover and MHA is the automation of the actual master failover procedure. For this reason alone you shoud use either one.

A proper but complex no SPOF setup

A proper cluster solution such as Windows Failover Clustering or Pacemaker/Corosync/DRBD is way more secure (BTW, the whitepapers are worth checking). Pacemaker/Corosync eliminates all single points of failures. Additionally, it prevents transaction loss and speeds up the failover using Distributed Replicated Block Device to update a standby MySQL replication master server.

MySQL Replication 3rd party setup aiming at 99.99% HA

Writes

Reads

Pacemaker (CRM)

Corosync (CCM)

Master (Standby)

DRBD

Pacemaker (CRM)

Corosync (CCM)

Master (Active)

DRBD

Slave (one of many)

Pacemaker (CRM)

Corosync (CCM)

Slave

Pacemaker (CRM)

Corosync (CCM)

Slave (one of many)

This is a true Unix style design. There are many, independent, small programs working together. At the core you find a Cluster Resource Manager (CRM) responsible for taking all kinds of actions, for example, database failover. And, at the core you find a seperate Cluster Communication Manager (CCM) that adds communication channels between all the nodes. A CCM is usually using reundant connections between the nodes. Given three nodes A, B, C connected to each other with A and B loosing their connection A->B, the CCM still allows A and B to communicate through C: A->C->B. The Pacemaker CRM and the Corosync CCM run on all the nodes. No single point of failure.

That is exactly what you want, but its way too complicated.

A typical failover procedure ignores the client badly!

MySQL HA solutions are all too often server-centric. Much emphasis is on the actual master failover procedure. There is little, if any, automation of client deployment. To avoid client deployment, the master is often assigned a virtual IP that is transferred as part of a failover.

Failed master

New master

Virtual IP

->

But, master failover is not the only event clients need to be made aware of. For example, slave additions or failures. Telling a DBA to deploy clients manually is 1990s. This is not what keeps your product relevant in 2013. Sometimes a blown-up single monitor is proposed that clients can query to learn about changes to the cluster topology. Seriously: another SPOF, another communication channel but SQL?

Clients should continiously adapt themselves to cluster changes

The PHP mysqlnd drivers replication and load balacing plugin (PECL/mysqlnd_ms) proves the potential of a driver integrated proxy:

making complex clusters transparent to use: read-write splitting, load balancing, service levels (e.g. abstract on consistency). semi-automatic server selection for sharding and partitioning, …

scale by client – no bottlenecks due to centralized proxies

fail by client – failure does not affect many unlike with a central proxy such as MySQL Proxy

MySQL? Load? Clustering! Balancing! PECL/mysqlnd_ms 1.4 from Ulf Wendel

If the HA solution would only tell the driver about the clusters state, the driver could hide even more of a clusters complexity and improve ease of use. Imagine drivers got real-time server load information and would adapt their load balancing dynamically. Drivers could stop bombarding a badly lagging slave with requests to give it time to breath or lower their write rates.

The solution: Isis2, CRM/CCM moves into a MySQL plugin

Here comes the code for system that could offer proper HA without much added complexity! The idea is simple: use a Group Communication System (GCS) in a MySQL Server Plugin to connect all servers of a cluster with each other. Use the communication primitives of the GCS to exchange state information synchronously. State information is made of role (master, slave), status (on, off, standby), possibly current system load, or whatever else may be of interest.

At any time a GCS can give a list of its members. Thus, at any time, a client can ask for a list of all servers in the cluster. The GCS jointly decides on membership changes. It recognizes the addition of servers but also their failure. If a server fails, appropriate action can be taken. For example, a failover script can be run to elect a new master and reconfigure the MySQL Replication cluster.

Best HA today

Proposed PoC

No SPOF but complex

No SPOF, easy, best possible client support

Pacemaker (CRM)

-v

Corosync (CCM)

-v

MySQL

|

MySQL

>

MySQL Isis2 daemon plugin (CRM/CCM)

>

MySQL I_S cluster state information plugin

DRBD

(GTID logic)

Clients learn about changes in the cluster topology by querying INFORMATION_SCHEMA tables. If a client fails to connect to one server, it can pick any of the remaining and ask it for an update on the list of servers. Then, if need be, the client can reconfigure itself. No DBA action needed, zero administration within reach. As state information is exposed through I_S tables, clients do not have to learn from special, possibly centralized servers likely requiring extra communication protocols. There is no risk of management/monitoring servers getting slashdotted. And, updating client programs is easy as can be due to the use of plain SQL.

Just a few server plugins…

By embedding a GCS into a MySQL plugin the HA stack is greatly simplified. The job of Pacemaker and Corosync is done by the MySQL plugin. All you need for a MySQL HA setup could be delivered in one download package. All you need to add HA would be load some MySQL plugins… well, in theory. You get the point.

On the hunt for a C/C++ GCS

The biggest challenge in implementing such a system is the hunt for a free and Open Source Group Communication System that can be embedded in a C/C++ MySQL server plugin. Corosync has a client/deamon-server design that is not very tempting. Its brother, the Spread Toolkit, is somewhat limited to ~40 nodes. Neither solution has an appealing API. The rest is either old, is implemented in the wrong language (Java) or is out of question for me (Galera). I had almost given up searching… if only MySQL had a GCS.

Then came Isis2. Wrong language (C#) but what an API! The Isis2 Cloud Computing Library is the successor of a true classic, the Isis library. Experts will immediately associate the Virtual Synchrony Model (see also slide 34 and following) with Isis and recall the success stories. Isis2 combines Virtual Synchrony and Paxos ideas (see video). Paxos are commonly used in NoSQL solutions.

Most important: how easy Isis2 makes distributed cloud computing. A simple job like exchanging state information between MySQL servers becomes absolutely trivial!

One downside: C# to C/C++ language barrier

For my purposes, Isis2 has one big downside. It is written in C#/.NET. A MySQL C/C++ server plugin cannot use it directly. MySQL cannot become a direct member of an Isis2 group. Instead, one has to write an Isis2 client in C# which communicates with MySQL through a network socket. As we now have two independent processes, there is an additional heartbeat between MySQL and the Isis2 client in the PoC. This heartbeat would not be required if everything was in one process, which would be the case if one would use a pure C/C++ GCS.

Programming language

Component

C/C++

MySQL

C/C++

MySQL daemon Isis2 heartbeat plugin

|

C#/.NET (Mono)

Isis2 client socket server

I’m happy to pay this price for the wonderful API that Isis2 offers. Expressed in extra lines of code the price for the PoC is probably less than 20% respectively 200 lines. Let the hacking begin…

The Isis2 client socket server

Disclaimer: all the code I show is entirely untested and partically known to be buggy or unstable. I stopped my fun tour through C#/.NET and MySQL plugin development after three days. This was the time it took me to realize “it could be done”, which is all I wanted for a PoC. That said, here is the code for the Isis2 client.

At the core: a distributed Isis2 group

Let’s build the core of the proposed MySQL HA cluster solution. The core is made of a distributed Isis2 group. Assume we have three hosts A, B and C in our MySQL cluster. On each host we start an Isis2 client. All clients try to join a distributed group of a certain name. Once they have joined the group, they can communicate with each other. If that all sounds a bit to magic, get yourself the book on the subject of Isis2 (Ken Birman, Guide to Reliable Distributed Systems: Building High-Assurance Applications and Cloud-Hosted Services).

Members of a distributed Isis2 group

Host A

mono ./isis2_daemon.exe

Host B

Host C

mono ./isis2_daemon.exe

mono ./isis2_daemon.exe

Isis2 offers a rich variety of Send() commands to communicate within the group. Options range from low level unreliable, asynchronously delivered messages for gossip protocols to slow but reliable, totally ordered, virtually synchronous messages. The latter either reach all members of a group or none. As speed or group size is no major concern when exchanging nothing but a list of MySQL servers in a PoC, the PoC is using SafeSend() only. To an outside observer messages appear “synchronously” on all group members.

The Isis2 client connects to a virtual group

If you compile and run it (e.g. dmcs isis_deamon.cs Isis.cs ; mono ./isis_deamon.exe ), the Isis2 client tries to join a virtual, distributed group. For this it contacts the ORACLE and uses the Isis2 ORACLE Rendevous Service. No kidding, that’s how Isis2 calls it.

If there are already other clients from other computers registered in the group a checkpoint is done to do a state transfer. The joining client learns the state from the other members. The state consists of a list of all MySQL servers that have registered themselves at their local Isis2 clients to become members.

Isis2 abstracts away all the glory details from you: the networking, the group membership, the messaging associated with the state transfer and so fort. Those few lines make the core of the procedure.

Did I already say, the Isis2 API is nice? Note this detail from the checkpointing/state transfer instructions. The Isis2 library sends a list of MySQLServer objects over the wire. As an Isis2 developer you do not give up on your standard OOP style…

The big downside of this beauty are the questionmarks behind a once discussed pure C++ port of Isis2. Developing a library that offers the same for C++ objects may be hard if not impossible.

Making an Isis2 client talk to MySQL

As soon as an Isis2 client has joined a distributed group, it starts a socket server to receive commands from the MySQL server associated with it. MySQL can then register itself in the group, announce its state and leave the group when shutdown.

Members of a distributed Isis2 group

Host A

MySQL

MySQL Isis2d daemon plugin

mono ./isis2_daemon.exe

Host B

Host C

MySQL

MySQL

MySQL Isis2d daemon plugin

MySQL Isis2d daemon plugin

mono ./isis2_daemon.exe

mono ./isis2_daemon.exe

As you can guess, it does not require much code in the .NET framework to start a socket server and listen to network requests. Nothing fancy: add code for parsing commands and you are done.

You can now connect to the Isis2 client using telnet. Below is an example session with the join command that shall be used by MySQL to register itself in the distributed Isis2 group.

The Isis2 client that receives the join command, forwards it to everybody in the group using SafeSend(). The Isis2 manual discusses the messaging API in great detail.

All group members, including the sender, receive the message and add the MySQL server to their server list. The MySQL server is now known to everybody in the group. On whichever host you issue a serverlist command, the MySQL server is listed.

The MySQL server gets dropped from the list if it fails to send a heartbeat to its Isis2 client or sends a leave message. If it fails to send a heartbeat, MySQL might have crashes and appropriate action could be triggered. If the local Isis2 client dies, well, that’s a weak spot (self-set three days coding limit…). The PoC simplifies the case and assumes that the host crashed. All servers associated with the Isis2 client are dropped from the groups server list.

Please note, that all Isis2 members of an Isis2 group jointly decide whether a member has crashed or not. And, it takes more than a single missed heartbeat.

The MySQL plugins

Its nice to see the telnet sessions but it would be even better to have MySQL server plugins that send join, leave, heartbeat messages and parsed serverlist

into something more driver friendly. Nothing as easy as that: one MySQL server plugin for the membership and another one to export an INFORMATION_SCHEMA.ISIS2IS table.

A MySQL cluster that can report its state and take HA actions

Host A

MySQL

MySQL Isis2d daemon plugin

MySQL Isis2is I_S plugin

mono ./isis2_daemon.exe

Host B

Host C

MySQL

MySQL

MySQL Isis2d daemon plugin

MySQL Isis2d daemon plugin

MySQL Isis2is I_S plugin

MySQL Isis2is I_S plugin

mono ./isis2_daemon.exe

mono ./isis2_daemon.exe

Here you go with the plugin code. Again, its incomplete and even crashing code (UNINSTALL PLUGIN will crash). Its just good enough to make the point. There is not much to say about the plugin code that you cannot find discussed in greater depth in the MySQL manual. Its no more than a wrapper for the network commands shown before.

Loading the Plugins into MySQL

Once you load the plugins into MySQL, the MySQL server will try to register MySQL at its local Isis2 client using join. Every now and then, it sends a heartbeat. Upon server shutdown or plugin uninstallation, it sends leave.

At any time, on any MySQL server that has the plugins installed you can query the I_S for a list of all MySQL servers in the cluster.

A fictional user manual for the DBA

Here’s the user manual for the HA solution proposed. For every MySQL server do:

Start Isis2 client: mono ./isis_deamon.exe

Configure Connector Plugins, e.g. Isis2 client address

Heartbeat to Isis2: INSTALL PLUGIN isis2d SONAME ‘libisis2.so’

I_S Plugin: INSTALL PLUGIN isis2is SONAME ‘libisis2.so’

Teach your clients to monitor INFORMATION_SCHEMA

Summary

Needless to say, I’m only scratching the surface. Nowhere in my code is a failover script being run or a MySQL server is reconfigured. I named some possible hooks for such actions. After years and years of extensive work by the MySQL community on how exactly to do the server reconfiguration, this was out of interest to me. The server side is mostly solved. Just think of the introduction of GTIDs in MySQL 5.6.

Whether you got for a HA setup with or without a single point of failure is a question of demand. For some, it may be perfectly tolerable. For those requiring five nines – check out MySQL Cluster NDB 7.3. Still, if a SPOF can be avoided, avoid it. A GCS embedded into plugin approach cannot cure the fact that a MyS