Planet.debian.net

Vincent Bernat: High availability with ExaBGP

2013-09-06

When it comes to provide redundant services, several options are available:

The service can be hosted behind a set of load-balancers. They
will detect any faulty node. However, you need to ensure that this
new layer is also fault-tolerant.

The nodes providing the service can rely on IP failover to
share a set of IP using protocols like VRRP1 or CARP. The IP
address of a faulty node will be assigned to another node. This
requires all nodes to be part of the same IP subnet.

The clients of the service can ask a third-party for available
nodes. Usually, this is achieved through round-robin DNS where
only working nodes are advertised in a DNS record. The failover
can be quite long because of caches.

A common setup is a combination of those solutions: web servers are
behind a couple of load-balancers achieving both redundancy and
load-balancing. The load-balancers use VRRP to ensure redundancy. The
whole setup is replicated to another datacenter and round-robin DNS
is used to ensure both redundancy and load-balacing of the datacenters.

There is a fourth option which is similar to VRRP but relies on
dynamic routing and therefore is not limited to nodes in the same subnet:

The nodes advertise their availability with BGP to announce the
set of service IP addresses they are able to serve. Each address is
weighted such that IP addresses are balanced among the available nodes.

We will explore how to implement this fourth option using ExaBGP,
the BGP swiss army knife of networking, in a small lab
based on KVM. You can grab the
complete lab from GitHub. ExaBGP 3.2.5 is needed to run this lab.

Environment

BGP configuration

OSPF configuration

Web nodes

Redundancy with ExaBGP

The big picture

ExaBGP configuration

Route servers

Bird configuration

Quagga configuration

Routers

Testing

Demo

Environment

We will be working in the following (IPv6-only) environment:

BGP configuration

BGP is enabled on ER2 and ER3 to exchange routes with peers and
transits (only R1 for this lab). BIRD is used as a BGP
daemon. Its configuration is pretty basic. Here is a fragment:

First, in ➊, we declare the routes that we want to export to our
peers. We don’t try to summarize them from the IGP. We just
unconditonnaly export the networks that we own. Then, in ➋, R1 is
defined as a neighbor and we export the static route we declared
previously. We import any route that R1 is willing to send us. In ➌,
we share everything we know with our pal, ER3, using internal
BGP.

OSPF configuration

OSPF will distribute routes inside the AS. It is enabled on ER2,
ER3, DR6, DR7 and DR8. For example, here is the relevant part
of the configuration of DR6:

ER2 and ER3 inject a default route into OSPF:

Web nodes

The web nodes are pretty basic. They have a default static route to
the nearest router and that’s all. The interesting thing here is that
they are each on a separate IP subnet: we cannot share an IP using
VRRP2.

Why are those web servers on different subnets? Maybe they are not in
the same datacenter or maybe your network architecture is using a
routed access layer.

Let’s see how to use BGP to enable redundancy of those web nodes.

Redundancy with ExaBGP

ExaBGP is a convenient tool to plug scripts into BGP. They can
then receive and advertise routes. ExaBGP does the hard work of
speaking BGP with your routers. The scripts just have to read routes
from standard input or advertise them on standard output.

The big picture

Here is the big picture:

Let’s explain it step by step:

Three IP addresses will be allocated for our web service:
2001:db8:30::1, 2001:db8:30::2 and 2001:db8:30::3. Those are
distinct from the real IP addresses of W1, W2 and W3.

Each web node will advertise all of them to the route servers
we added in the network. I will talk more about those route
servers later.

Each route comes with a metric to help the route server
choose where it should be routed. We choose the metrics such that
each IP address will be routed to a distinct web node (unless
there is a problem).

The route servers (which are not routers) will then advertise the
best routes they learned to all the routers in our network. This
is still done using BGP.

Now, for a given IP address, each router knows to which web node
the traffic should be directed.

Here are the respective metrics announced routes for W1, W2 and
W3 when everything works as expected:

Route

W1

W2

W3

Best

Backup

2001:db8:30::1

102

101

100

W3

W2

2001:db8:30::2

101

100

102

W2

W1

2001:db8:30::3

100

102

101

W1

W3

ExaBGP configuration

The configuration of ExaBGP is quite simple:

A script is entailed to check if the service (an nginx web server)
is up and running and advertise the appropriate IP addresses to the
two declared route servers. If we run the script manually, we can see
the advertised routes:

When the service becomes unresponsive, the healthcheck script detect
the situation and retry several times before acknowledging that the
service is dead. Then, the IP addresses are advertised with higher
metrics and the service will be routed to another node (the one
advertising 2001:db8:30::3/128 with metric 101).

This healthcheck script is now part of ExaBGP.

Route servers

We could have connected our ExaBGP servers directly to each
router. However, if you have 20 routers and 10 web servers, you now
have to manage a mesh of 200 sessions. The route servers are here
for three purposes:

Reduce the number of BGP sessions (from 200 to 60) between
equipments (less configuration, less errors).

Avoid modifying the configuration on routers each time a new
service is added.

Separate the routing decision (the route servers) from the
routing process (the routers).

You may also ask yourself: “why not use OSPF?”. Good question!

OSPF could be enabled on each web node and the IP addresses
advertised using this protocol. However, OSPF has several
drawbacks: it does not scale, there are restrictions on the allowed
topologies, it is difficult to filter routes inside OSPF and a
misconfiguration will likely impact the whole network. Therefore,
it is considered a good practice to limit OSPF to network equipments.

The routes learned by the route servers could be injected into
OSPF. On paper, OSPF has a “next-hop” field to provide an explicit
next-hop. This would be handy as we wouldn’t have to configure
adjacencies with each router. However, I have absolutely no idea
how to inject BGP next-hop into OSPF next-hop. What happens is that
BGP next-hop is resolved locally using OSPF routes. For example, if
we inject BGP routes into OSPF from RS4, RS4 will know the
appropriate next-hop but other routers will route the traffic to
RS4.

Let’s look at how to configure our route servers. RS4 will use
BIRD while RS5 will use Quagga. Using two different
implementations will help with resiliency by avoiding a bug to hit the
two route servers at the same time.

Bird configuration

There are two sides for BGP configuration: the BGP sessions with the
ExaBGP nodes and the ones with the routers. Here is the
configuration for the later:

The AS number used by our route server is 65002 while the AS number
used for routers is 65003 (the AS numbers for web nodes will be
65001). They are reserved for private use by RFC 6996. All routes
known by the route server are exported to the routers but no routes
are accepted from them.

Let’s have a look at the other side:

To ensure separation of concerns, we are being a bit more picky. With
➊ and ➋, we only accept loopback addresses and only if they are
contained in the subnet that we reserved for this use. No server
should be able to inject arbitrary addresses into our network. With ➌,
we also limit the number of routes that a server can advertise.

With ➍, we reduce the hold time from 240 to 6. It means that after 6
seconds, the peer is considered dead. This is quite important to be
able to recover quickly from a dead machine. The minimal value
is 3. We could have use a similar setting with the session with the routers.

Quagga configuration

Quagga’s configuration is a bit more verbose but should be strictly equivalent:

The view is here to not install routes in the kernel3.

Routers

Configuring BIRD to receive routes from route servers is straightforward:

It is important to set gateway recursive because most of the time,
the next-hop is not reachable directly. In this case, by default,
BIRD will use the IP address of the advertising router (the route servers).

Testing

Let’s check that everything works as expected. Here is the view from RS5:

For example, traffic to 2001:db8:30::2 should be routed through
2001:db8:7::12 (which is W2). Other IP are affected to W1 and
W3.

RS4 should see the same thing4:

Let’s have a look at DR6:

So, 2001:db8:30::3 is owned by W1 which is behind DR6 while the
two others are in another part of the network and will be reached
through the appropriate link-local addresses learned by OSPF.

Let’s kill W1 by stopping nginx. A few seconds later, DR6 learns
the new routes:

Demo

For a demo, have a look at the following video (it is also available
as an Ogg Theora video).

The primary target of VRRP is to make a default gateway for
an IP network highly available by exposing a virtual router
which is an abstract representation of multiple routers. The
virtual IP address is held by the master router and any
backup router can be promoted to master in case of a
failure. In practice, VRRP can be used to achieve high
availability of services too. ↩

However, you could deploy an L2 overlay network using, for
example, VXLAN and use VRRP on it. ↩

I have been told on the Quagga mailing-list that such a
setup is quite uncommon and that it would be better to not
use views but use --no_kernel flag for bgpd instead. You
may want to look at the whole thread for more details. ↩

The output of birdc6 shows both the next-hop as
advertised in BGP and the resolved next-hop if the route
have to be exported into another protocol. I have
simplified the output to avoid confusion. ↩