2012-12-31

The following is a technical post written by Ian Applegate
(@AppealingTea), a member of our
Systems Engineering team, on how to optimize the Linux TCP stack for
mobile connections. The article was originally
published
as part of the 2012 Web Performance
Calendar. At CloudFlare, we spend
a significant amount of time ensuring our network stack is tuned to
whatever kind of network or device is connecting to us. We wanted to
share some of the technical details to help other organizations that are
looking to optimize for mobile network performance, even if they're not
using CloudFlare. And, if you are using
CloudFlare, you get all these benefits
and the fastest possible TCP performance when a mobile network accesses
your site.



We spend a lot of time at CloudFlare thinking about how to make the
Internet fast on mobile devices. Currently there are over 1.2 billion
active mobile users and that number is growing rapidly. Earlier this
year mobile Internet access passed fixed Internet access in India and
that's likely to be repeated the world over. So, mobile network
performance will only become more and more important.

Most of the focus today on improving mobile performance is on Layer 7
with front end optimizations (FEO). At CloudFlare, we've done
significant work in this area with front end optimization technologies
like Rocket Loader, Mirage, and
Polish that dynamically
modify web content to make it load quickly whatever device is being
used. However, while FEO is important to make mobile fast, the unique
characteristics of mobile networks also means we have to pay attention
to the underlying performance of the technologies down at Layer 4 of the
network stack.

This article is about the challenges mobile devices present, how the
default TCP configuration is ill-suited for optimal mobile performance,
and what you can do to improve performance for visitors connecting via
mobile networks. Before diving into the details, a quick technical note.
At CloudFlare, we've build most of our systems on top of a custom
version of Linux so, while the underlying technologies can apply to
other operating systems, the examples I'll use are from Linux.

TCP Congestion Control

To understand the challenges of mobile network performance at Layer 4 of
the networking stack you need to understand TCP Congestion Control. TCP
Congestion Control is the gatekeeper that determines how to control the
flow of packets from your server to your clients. Its goal is to prevent
Internet congestion by detecting when congestion occurs and slowing down
the rate data is transmitted. This helps ensure that the Internet is
available to everyone, but can cause problems on mobile network when TCP
mistakes mobile network problems for congestion.

TCP Congestion Control holds back the floodgates if it detects
congestion (i.e. packet loss) on the remote end. A network is,
inherently, a shared resource. The purpose of TCP Congestion Control was
to ensure that every device on the network cooperates to not overwhelm
its resource. On a wired network, if packet loss is detected it is a
fairly reliable indicator that a port along the connection is
overburdened. What is typically going on in these cases is that a memory
buffer in a switch somewhere has filled beyond its capacity because
packets are coming in faster than they can be sent out and data is being
discarded. TCP Congestion Control on clients and servers is setup to
"back off" in these cases in order to ensure that the network remains
available for all its users.

But figuring out what packet loss means on a mobile network is a
different matter. Radio networks are inherently susceptible to
interference which results in packet loss. If pakcets are being dropped
does that mean a switch is overburdened, like we can infer on a wired
network? Or did someone travel from an undersubscribed wireless cell to
an oversubscribed one? Or did someone just turn on a microwave? Or maybe
it was just a random solar flare? Since it's not as clear what packet
loss means on a mobile network, it's not clear what action a TCP
Congestion Control algorithm should take.

A Series of Leaky Tubes

To optimize networks for lossy networks like those on mobile networks,
it's important to understand exactly how TCP Congestion Control
algorithms are designed. While the high level concept makes sense, the
details of TCP Congestion Control are not widely understood by most
people working in the web performance industry. That said, it is an
important core part of what makes the Internet reliable and the subject
of very active research and development.



To understand how TCP Congestion Control algorithms work, imagine the
following analogy. Think of your web server as your local water utility
plant. You've built on a large network of pipes in your hometown and you
need to guarantee that each pipe is as pressurized as possible for
delivery, but you don't want to burst the pipes. (Note: I recognize the
late Senator Ted Stevens got a lot of flack for describing the Internet
as a "series of tubes," but the metaphor is surprisingly accurate.)

Your client, Crazy Arty, runs a local water bottling plant that connects
to your pipe network. Crazy Arty's infrastructure is built on old pipes
that are leaky and brittle. For you to get water to them without
bursting his pipes, you need to infer the capability of Crazy Arty's
system. If you don't know in advance then you do a test — you send a
known amount of water to the line and then measure the pressure. If the
pressure is suddenly lost then you can infer that you broke a pipe. If
not, then that level is likely safe and you can add more water pressure
and repeat the test. You can iterate this test until you burst a pipe,
see the drop in pressure, write down the maximum water volume, and going
forward ensure you never exceed it.

Imagine, however, that there's some exogenous factor that could decrease
the pressure in the pipe without actually indicating a pipe had burst.
What if, for example, Crazy Arty ran a pump that he only turned on
randomly from time to time and you didn't know about. If the only signal
you have is observing a loss in pressure, you'd have no way of knowing
whether you'd burst a pipe or if Crazy Arty had just plugged in the
pump. The effect would be that you'd likely record a pressure level much
less than the amount the pipes could actually withstand — leading to all
your customers on the network potentially having lower water pressure
than they should.

Optimizing for Congestion or Loss

If you've been following up to this point then you already know more
about TCP Congestion Control than you would guess. The initial amount of
water we talked about in TCP is known as the Initial Congestion Window
(initcwnd) it is the initial number of packets in flight across the
network. The congestion window (cwnd) either shrinks, grows, or stays
the same depending on how many packets make it back and how fast (in ACK
trains) they return after the initial burst. In essence, TCP Congestion
Control is just like the water utility — measuring the pressure a
network can withstand and then adjusting the volume in an attempt to
maximize flow without bursting any pipes.

When a TCP connection is first established it attempts to ramp up the
cwnd quickly. This phase of the connection, where TCP grows the cwnd
rapidly, is called Slow Start. That's a bit of a misnomer since it is
generally an exponential growth function which is quite fast and
aggressive. Just like when the water utility in the example above
detects a drop in pressure it turns down the volume of water, when TCP
detects packets are lost it reduces the size of the cwnd and delays the
time before another burst of packets is delivered. The time between
packet bursts is known as the Retransmission Timeout (RTO). The
algorithm within TCP that controls these processes is called the
Congestion Control Algorithm. There are many congestion control
algorithms and clients and servers can use different strategies based in
the characteristics of their networks. Most of Congestion Control
Algorithms focus on optimizing for one type of network loss or another:
congestive loss (like you see on wired networks) or random loss (like
you see on mobile networks).



In the example above, a pipe bursting would be an indication of
congestive loss. There was a physical limit to the pipes, it is
exceeded, and the appropriate response is to back off. On the other
hand, Crazy Arty's pump is analogous to random loss. The capacity is
still available on the network and only a temporary disturbance causes
the water utility to see the pipes as overfull. The Internet started as
a network of wired devices, and, as its name suggests, congestion
control was largely designed to optimize for congestive loss. As a
result, the default Congestion Control Algorithm in many operating
systems is good for communicating wired networks but not as good for
communicating with mobile networks.

A few Congestion Control algorithms try to bridge the gap by using the
time of the delay in the "pressure increase" to "expected capacity" to
figure out the cause of the loss. These are known as bandwidth
estimation algorithms, and examples include
Vegas,
Veno
and Westwood+.
Unfortunately, all of these methods are reactive and reuse no
information across two similar streams.

At companies that see a significant amount of network traffic, like
CloudFlare or Google, it is possible to map the characteristics of the
Internet's networks and choose a specific congestion control algorithm
in order to maximize performance for that network. Unfortunately, unless
you are seeing the large amounts of traffic as we do and can record data
on network performance, the ability to instrument your congestion
control or build a "weather forecast" is usually impossible.
Fortunately, there are still several things you can do to make your
server more responsive to visitors even when they're coming from lossy,
mobile devices.

Compelling Reasons to Upgrade You Kernel

The Linux network stack has been under extensive development to bring
about some sensible defaults and mechanisms for dealing with the network
topology of 2012. A mixed network of high bandwidth low latency and high
bandwidth, high latency, lossy connections was never fully anticipated
by the kernel developers of 2009 and if you check your server's kernel
version chances are it's running a 2.6.32.x kernel from that era.

uname -a

There are a number of reasons that if you're running an old kernel on
your web server and want to increase web performance, especially for
mobile devices, you should investigate upgrading. To begin, Linux 2.6.38
bumps the default initcwnd and initrwnd (inital receive window) from 3
to 10. This is an easy, big win.
It allows for 14.2KB (vs 5.7KB) of data to be sent or received in the
initial round trip before slow start grows the cwnd further. This is
important for HTTP and SSL because it gives you more room to fit the
header in the initial set of packets. If you are running an older kernel
you may be able to run the following command on a bash shell (use
caution) to set all of your routes' initcwnd and initrwnd to 10. On
average, this small change can be one of the biggest boosts when you're
trying to maximize web performance.

ip route | while read p; do ip route change $p initcwnd 10 initrwnd 10; done

Linux kernel 3.2 implements Proportional Rate Reduction
(PRR).
PRR decreases the time it takes for a lossy connection to recover its
full speed, potentially improving HTTP response times by 3-10%. The
benefits of PRR are significant for mobile networks. To understand why,
it's worth diving back into the details of how previous congestion
control strategies interacted with loss.

Many congestion control algorithms halve the cwnd when a loss is
detected. When multiple losses occur this can result in a case where the
cwnd is lower than the slow start threshold. Unfortunately, the
connection never goes through slow start again. The result is that a few
network interruptions can result in TCP slowing to a crawl for all the
connections in the session.

This is even more deadly when combined with tcp_no_metrics_save=0
sysctl setting on unpatched kernels before 3.2. This setting will save
data on connections and attempt to use it to optimize the network.
Unfortunately, this can actually make performance worse because TCP will
apply the exception case to every new connection from a client within a
window of a few minutes. In other words, in some cases, one person
surfing your site from a mobile phone who has some random packet loss
can reduce your server's performance to this visitor even when their
temporary loss has cleared.

If you expect your visitors to be coming from mobile, lossy connections
and you cannot upgrade or patch your kernel I recommend setting
tcp_no_metrics_save=1. If you're comfortable doing some hacking, you
can patch older
kernels.

The good news is that Linux 3.2 implements PRR, which decreases the
amount of time that a lossy connection will impact TCP performance. If
you can upgrade, it may be one of the most significant things you can do
in order to increase your web performance.

More Improvements Ahead

Linux 3.2 also has another important improvement with RFC2099bis. The
initial Retransmission Timeout (initRTO) has been changed to 1s from 3s.
If loss happens after sending the initcwnd two seconds waiting time are
saved when trying to resend the data. With TCP streams being so short
this can have a very noticeable improvement if a connection experiences
loss at the beginning of the stream. Like the PRR patch this can also be
applied (with modification) to older kernels if for some reason you
cannot upgrade (here's the
patch).

Looking forward, Linux 3.3 has Byte Queue Limits when teamed with CoDel
(controlled delay) in the 3.5 kernel helps fight the long standing issue
of
Bufferbloat
by intelligently managing packet queues. Bufferbloat is when the caching
overhead on TCP becomes inefficient because it's littered with stale
data. Linux 3.3 has features to auto QoS important packets
(SYN/DNS/ARP/etc.,) keep down buffer queues thereby reducing bufferbloat
and improving latency on loaded servers.

Linux 3.5 implements TCP Early
Retransmit with some safeguards for
connections that have a small amount of packet reordering. This allows
connections, under certain conditions, to trigger fast retransmit and
bypass the costly Retransmission Timeout (RTO) mentioned earlier. By
default it is enabled in the failsafe mode tcp_early_retrans=2. If for
some reason you are sure your clients have loss but no reordering then
you could set tcp_early_retrans=1 to save one quarter a RTT on
recovery.

One of the most extensive changes to 3.6 that hasn't got much press is
the removal of the IPv4 routing cache. In a nutshell it was an
extraneous caching layer in the kernel that mapped interfaces to routes
to IPs and saved a lookup to the Forward Information Base (FIB). The FIB
is a routing table within the network stack. The IPv4 routing cache was
intended to eliminate a FIB lookup and increase performance. While a
good idea in principle, unfortunately it provided a very small
performance boost in less than 10% of connections. In the 3.2.x-3.5.x
kernels it was extremely vulnerable to certain DDoS techniques so it has
been removed.

Finally, one important setting you should check, regardless of the Linux
kernel you are running: tcp_slow_start_after_idle. If you're
concerned about web performance, it has been proclaimed sysctl setting
of the year. It can be enabled in almost any kernel. By default this is
set to 1 which will aggressively reduce cwnd on idle connections and
negatively impact any long lived connections such as SSL. The following
command will set it to 0 and can significantly improve performance:

sysctl -w tcp_slow_start_after_idle=0

The Missing Congestion Control Algorithm

You may be curious as to why I haven't made a recommendation as far as a
quick and easy change of congestion control algorithms. Since Linux
2.6.19, the default congestion control algorithm in the Linux kernel is
CUBIC, which is time based and optimized for high speed and high latency
networks. It's killer feature, known as called Hybrid Slow Start
(HyStart), allows it to safely exit slow start by measuring the ACK
trains and not overshoot the cwnd. It can improve startup throughput by
up to 200-300%.

While other Congestion Control Algorithms may seem like performance wins
on connections experiencing high amounts of loss (>.1%) (e.g., TCP
Westwood+ or Hybla), unfortunately these algorithms don't include
HyStart. The net effect is that, in our tests, they underperform CUBIC
for general network performance. Unless a majority of your clients are
on lossy connections, I recommend staying with CUBIC.

Of course the real answer here is to dynamically swap out congestion
control algorithms based on historical data to better serve these edge
cases. Unfortunately, that is difficult for the average web server
unless you're seeing a very high volume of traffic and are able to
record and analyze network characteristics across multiple connections.
The good news is that loss predictors and hybrid congestion control
algorithms are continuing to mature, so maybe we will have an answer in
an upcoming kernel.

Show more