2015-02-16

Hello,

Our installation and configuration of HAProxy v1.5.3 on Debian GNU/Linux

Wheezy (v7.8, fully patched to date, and running on bare metal with no

virtualization) has been stable. I have an active/passive server

deployment using keepalived, and they have been running without issue on

this version since 7/31/14. HAProxy interfaces with a backend Windows

Server 2008 R2/IIS v7.5 web farm.

The physical servers are Dell PowerEdge R310 with (1) Intel Xeon X3430

(4 cores) @2.4GHz and 32GB of RAM (@800 MHz). Each server has bond0

configured, which is comprised of eth0 and eth1, and each physical

interface connects to a switch stack (Cisco Catalyst 3750) using

802.3ad. The on-board network cards are Broadcom Corporation NetXtreme

II BCM5716 Gigabit Ethernet (rev 20). Cisco 3750 switch interface

configuration and statistic reporting (i.e. input/output errors, CRCs,

etc.) is clean. The backend servers are physically connected to the same

Cisco 3750 switch stack. Active/passive high availability for HAProxy

using keepalived works as expected.

HAProxy Statistics under normal weekly workloads reflect the following:

Queue/Cur - 0, Max - some #, Limit --

Session rate/Cur - 1 to 200 per server

Session rate/Max - 300 to 500

Session rate/Limit - blank

Sessions/Cur - 1 to 30 per server; could spike to 50

Sessions/Max - 50

Sessions/Limit - 50

Denied Req/Resp - 0

Errors/Req -

Errors/Conn - 0

Errors/Resp - usually 1+, but not incrementing fast (i.e., in six hours'

time today there are 41 total)

Warnings/Ret/Redis - 0

In January 2015, I tried to catch up on HAProxy maintenance releases by

upgrading only our active server from v1.5.3 to v1.5.10 (before 1.5.11

was announced) late on a Tuesday night. Immediately post upgrade, the

active server seemingly behaved per testing. Unfortunately, v1.5.10

surfaced a new problem early the next morning around 9:00 a.m. which

forced me to fail over to our passive server (still running v1.5.3) in

order to restore service to our customers, which was followed by

downgrading our active server to v1.5.3 in order to stabilize the system

and restore the high availability pair.

*The problem exhibited the following behaviors on the active server: *

* HAProxy Statistics (HPS) showed many, but not all, web farm servers

with Queue/Cur in the low thousands, and they would remain there

with minor queue count fluctuations both incrementing and

decrementing by < 100 every stats page refresh. For these same

servers, the Sessions/Cur was stuck at 50, which is the configured

Max & Limit, which explains the queuing and why some customers

weren't able to use our service.

* HPS would intermittently flash yellow horizontal lines, also noting

a very high 2000ms L7 response time, typically on the servers with

the high queue count.

* Stopping and starting the HAProxy service would shuffle around the

numbers in HPS as to which server had the high queues, but not all

servers would have high queues (only two or three would have them).

Waiting for five or ten minutes wouldn't self heal the queues

through session processing.

* HPS would rarely flash a red horizontal line, and that server's

sessions would seem to zero out its Queue/Cur.

* CPU utilization (30%) and memory consumption (< 5GB) on the active

node during the event are within standard trends.

None of the backend web farm servers, per active cacti graphing,

displayed any CPU, memory, or disk anomalies during this time. At the

time, I decided to table any further upgrade attempts until I could

research the issue further.

On the night of 2/13/15, I thought I would try again with v1.5.11 even

though I struggled to find anything relevant to my former experience in

the /HAProxy ChangeLog/ or problems with my configuration. All weekend

and early this morning, v1.5.11 behaved up until more customers came

online and started using our services. Looking at our cacti graph, from

8:50 a.m. EST to 9:00 a.m. EST, our total ingress and egress traffic

combined jumped from 80Mbps to 170Mbps. It was during this time that the

problem described above surfaced again, causing a service failure for

large amount of our customers.

* @ 9:05 a.m. stopping and starting HAProxy v1.5.11 didn't resolve the

problem. Waited six minutes for processing which didn't catch up.

* @ 9:12 a.m. I downgraded HAProxy from v1.5.11 to v1.5.3 and

everything normalized in less than a minute.

* @ 9:16 a.m. I upgraded HAProxy from v1.5.3 to v1.5.5 and the problem

surfaced again and didn't heal in five minutes' time.

* @ 9:22 a.m. I downgraded HAProxy from v1.5.5 to v1.5.4 and

everything normalized in less than a minute. It has been stable all

day so far.

Each time I would build HAProxy I would

* wget http://haproxy.1wt.eu/download/1.5/src/haproxy-1.x.x.tar.gz

* tar -xf haproxy-1.x.x.tar.gz

* cd haproxy-1.x.x

* service haproxy stop

* make TARGET=linux2628 CPU=generic USE_PCRE=1 USE_OPENSSL=1 USE_ZLIB=1

* make install

* service haproxy start

I've reviewed the ChangeLog found here:
http://www.haproxy.org/download/1.5/src/CHANGELOG, but I haven't been

able to pinpoint any specific change in v1.5.5 which might be affecting

my deployment based on my configuration.

*root@server:/#uname -a*

Linux p01 3.2.0-4-amd64 #1 SMP Debian 3.2.65-1+deb7u1 x86_64 GNU/Linux

*root@server:/#cat /etc/sysctl.conf*

net.ipv4.conf.all.accept_redirects = 0

net.ipv4.conf.all.secure_redirects = 1

net.ipv4.conf.all.send_redirects = 0

net.ipv4.conf.all.accept_source_route = 0

net.ipv4.conf.all.log_martians = 1

net.core.somaxconn=10000

net.ipv4.ip_local_port_range = 5700 65000

*root@server:/#cat /etc/haproxy/haproxy.conf*

global

log 127.0.0.1 local0

maxconn 32000

user (some user)

group (some group)

daemon

maxsslconn 32000

maxconnrate 32000

chroot /(some path)/chroot/haproxy

node (some name)

stats socket /(some path)/haproxy

tune.ssl.default-dh-param 1024

defaults

log global

mode http

option httplog

option dontlognull

retries 3

option redispatch

maxconn 32000

timeout connect 35s

timeout client 35s

timeout server 35s

frontend web

mode http

timeout client 1200s

option forwardfor except 127.0.0.1

bind *:80

bind 0.0.0.0:443 ssl crt (path to cer file) ca-file (path to crt)

redirect scheme https if !{ ssl_fc }

acl url_imaging path_beg /(custom path 1)

acl url_report path_beg /(custom path 2)

acl url_wlog path_beg /(custom path 3)

use_backend sweb-farm if url_imaging or url_report or url_wlog

capture request header Host len 32

capture request header User-Agent len 200

capture request header Content-length len 200

capture request header X-Forwarded-For len 32

default_backend web-farm

backend web-farm

mode http

# This ridiculous timeout is required due to bad application design

for reporting purposes.

timeout server 1200s

option httpchk HEAD /index.html

option http-server-close

balance hdr(host)

hash-type consistent

stick-table type ip size 10m expire 30m

stick on src

stats enable

stats hide-version

stats scope .

stats uri (my stats URI)

stats realm Haproxy\ Statistics

stats auth (username:pass)

stats refresh 2s

stats show-legends

stats show-node (city)

server web01 x.x.x.x:80 maxconn 50 weight 30 check inter 2000 rise

2 fall 2 ca-file (path to crt)

server web02 x.x.x.x:80 maxconn 50 weight 30 check inter 2000 rise

2 fall 2 ca-file (path to crt)

server web03 x.x.x.x:80 maxconn 50 weight 15 check inter 2000 rise

2 fall 2 ca-file (path to crt)

server web04 x.x.x.x:80 maxconn 50 weight 30 check inter 2000 rise

2 fall 2 ca-file (path to crt)

server web05 x.x.x.x:80 maxconn 50 weight 15 check inter 2000 rise

2 fall 2 ca-file (path to crt)

server web06 x.x.x.x:80 maxconn 50 weight 30 check inter 2000 rise

2 fall 2 ca-file (path to crt)

server web07 x.x.x.x:80 maxconn 50 weight 30 check inter 2000 rise

2 fall 2 ca-file (path to crt)

server web08 x.x.x.x:80 maxconn 50 weight 30 check inter 2000 rise

2 fall 2 ca-file (path to crt)

server web09 x.x.x.x:80 maxconn 50 weight 30 check inter 2000 rise

2 fall 2 ca-file (path to crt)

backend sweb-farm

mode http

# This ridiculous timeout is required due to bad application

design for reporting purposes.

timeout server 1200s

option httpchk HEAD /index.html

option http-server-close

stick match src table web-farm

server sweb01 x.x.x.x:443 maxconn 50 weight 30 check ssl inter 2000

rise 2 fall 2 ca-file (path to crt)

server sweb02 x.x.x.x:443 maxconn 50 weight 30 check ssl inter 2000

rise 2 fall 2 ca-file (path to crt)

server sweb03 x.x.x.x:443 maxconn 50 weight 15 check ssl inter 2000

rise 2 fall 2 ca-file (path to crt)

server sweb04 x.x.x.x:443 maxconn 50 weight 30 check ssl inter 2000

rise 2 fall 2 ca-file (path to crt)

server sweb05 x.x.x.x:443 maxconn 50 weight 15 check ssl inter 2000

rise 2 fall 2 ca-file (path to crt)

server sweb06 x.x.x.x:443 maxconn 50 weight 30 check ssl inter 2000

rise 2 fall 2 ca-file (path to crt)

server sweb07 x.x.x.x:443 maxconn 50 weight 30 check ssl inter 2000

rise 2 fall 2 ca-file (path to crt)

server sweb08 x.x.x.x:443 maxconn 50 weight 30 check ssl inter 2000

rise 2 fall 2 ca-file (path to crt)

server sweb09 x.x.x.x:443 maxconn 50 weight 30 check ssl inter 2000

rise 2 fall 2 ca-file (path to crt)

frontend print-proxy

mode tcp

# This timeout is required due to bad application design for

reporting purposes.

timeout client 2m

option tcplog

bind *:808

default_backend print-farm

backend print-farm

mode tcp

balance roundrobin

# This timeout is required due to bad application design for

reporting purposes.

timeout server 2m

stick match src table web-farm

server web01 x.x.x.x:808

(truncated for brevity)

server web09 x.x.x.x:808

*root@server:/#haproxy -vv*

HA-Proxy version 1.5.4 2014/09/02

Copyright 2000-2014 Willy Tarreau <w@1wt.eu>

Build options :

TARGET = linux2628

CPU = generic

CC = gcc

CFLAGS = -O2 -g -fno-strict-aliasing

OPTIONS = USE_ZLIB=1 USE_OPENSSL=1 USE_PCRE=1

Default settings :

maxconn = 2000, bufsize = 16384, maxrewrite = 8192, maxpollevents = 200

Encrypted password support via crypt(3): yes

Built with zlib version : 1.2.7

Compression algorithms supported : identity, deflate, gzip

Built with OpenSSL version : OpenSSL 1.0.1e 11 Feb 2013

Running on OpenSSL version : OpenSSL 1.0.1e 11 Feb 2013

OpenSSL library supports TLS extensions : yes

OpenSSL library supports SNI : yes

OpenSSL library supports prefer-server-ciphers : yes

Built with PCRE version : 8.30 2012-02-04

PCRE library supports JIT : no (USE_PCRE_JIT not set)

Built with transparent proxy support using: IP_TRANSPARENT

IPV6_TRANSPARENT IP_FREEBIND

Available polling systems :

epoll : pref=300, test result OK

poll : pref=200, test result OK

select : pref=150, test result OK

Total: 3 (3 usable), will use epoll.

*root@server:/#echo "show info" | socat unix-connect:/tmp/haproxy stdio*

Name: HAProxy

Version: 1.5.4

Release_date: 2014/09/02

Nbproc: 1

Process_num: 1

Pid: 13579

Uptime: 0d 3h03m29s

Uptime_sec: 11009

Memmax_MB: 0

Ulimit-n: 64051

Maxsock: 64051

Maxconn: 32000

Hard_maxconn: 32000

CurrConns: 7251

CumConns: 210523

CumReq: 8374386

MaxSslConns: 32000

CurrSslConns: 7094

CumSslConns: 292816

Maxpipes: 0

PipesUsed: 0

PipesFree: 0

ConnRate: 20

ConnRateLimit: 32000

MaxConnRate: 577

SessRate: 20

SessRateLimit: 0

MaxSessRate: 577

SslRate: 19

SslRateLimit: 0

MaxSslRate: 576

SslFrontendKeyRate: 11

SslFrontendMaxKeyRate: 323

SslFrontendSessionReuse_pct: 42

SslBackendKeyRate: 0

SslBackendMaxKeyRate: 8

SslCacheLookups: 168401

SslCacheMisses: 3426

CompressBpsIn: 0

CompressBpsOut: 0

CompressBpsRateLim: 0

ZlibMemUsage: 0

MaxZlibMemUsage: 0

Tasks: 7278

Run_queue: 1

Idle_pct: 74

node: (server)

description:

*root@server:/etc# dpkg -s openssl*

Package: openssl

Status: install ok installed

Priority: optional

Section: utils

Installed-Size: 1082

Maintainer: Debian OpenSSL Team <pkg-openssl-devel@lists.alioth.debian.org>

Architecture: amd64

Version: 1.0.1e-2+deb7u14

Depends: libc6 (>= 2.7), libssl1.0.0 (>= 1.0.1e-2+deb7u5), zlib1g (>=

1:1.1.4)

Suggests: ca-certificates

Show more