2013-11-13

Hello,

We have two D6 boxes (HP DL360 G7) configured in ClusterXL (new HA mode). OS - SPLAT R75.30.

For 1st sync network we are using direct cross-over cable between two nodes (on s0p0 interface).

We have a bonding interface (lacp L3+L4 hash) including s0p1,s0p2,s1p0 and s1p1 physical interfaces (all interfaces are 1Gbps on different modules s0 and s1 and are:

lspci | grep Ether

03:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20)

03:00.1 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20)

04:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20)

04:00.1 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20)

08:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (Copper) (rev 06)

08:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (Copper) (rev 06)

09:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (Copper) (rev 06)

09:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (Copper) (rev 06)

Bonding seems to be working fine and ethtool -S doesn't show any errors on all physical ports.

--------------------- cat /proc/net/bonding/bond0 ---------------------

# cat /proc/net/bonding/bond0

Ethernet Channel Bonding Driver: v3.2.4 (January 28, 2008)

Bonding Mode: IEEE 802.3ad Dynamic link aggregation

Transmit Hash Policy: layer3+4 (1)

MII Status: up

MII Polling Interval (ms): 100

Up Delay (ms): 200

Down Delay (ms): 200

802.3ad info

LACP rate: slow

Active Aggregator Info:

Aggregator ID: 4

Number of ports: 4

Actor Key: 17

Partner Key: 20

Partner Mac Address: d0:d0:fd:a5:e3:80

Slave Interface: s1p0

MII Status: up

Link Failure Count: 1

Permanent HW addr: a0:36:9f:15:11:71

Aggregator ID: 4

Slave Interface: s1p1

MII Status: up

Link Failure Count: 1

Permanent HW addr: a0:36:9f:15:11:70

Aggregator ID: 4

Slave Interface: s0p1

MII Status: up

Link Failure Count: 1

Permanent HW addr: e4:11:5b:d4:30:a4

Aggregator ID: 4

Slave Interface: s0p2

MII Status: up

Link Failure Count: 1

Permanent HW addr: e4:11:5b:d4:30:ae

Aggregator ID: 4

-----------------------------------------------

On this bond interface we have five 802.1q vlans.

Rulebase is pretty simple - it contains 180 rules and 4 NAT rules. We don't use nothing more (no IPS, no AV, etc).

This device has a high CPU usage - between 40-55%. Total traffic going through CP firewall is 1.1Gbps (total in+out on all interfaces). Peak concurrent connections are 170000.

According to the specifications this device could handle 25Gbps and 5M connections max and we are using much much below that (maxumum connections is change from 25000 to 800000)

Actually these D6 boxes replaced cisco ASA5550 pair which was CPU utilised 60%-70%.

Our expectation was not more than 25-30% of CPU load on D6 boxes and we think that there is something wrong in our setup.

CPU usage is caused by fw_worker_0, fw_worker_1 and fw_worker_2 processes. (CPU is E5620 - 4 core with disabled HT, cpuinfo attached). CoreXL is with default settings - all NIC interrupts are handle by CPU0 and fw_worker_0 use CPU1, fw_worker_1 use CPU2 and fw_worker_2 use CPU3:

-------------- part of top ---------------

top - 08:42:19 up 5 days, 18:30, 1 user, load average: 1.83, 1.69, 1.58

Tasks: 97 total, 2 running, 95 sleeping, 0 stopped, 0 zombie

Cpu0 : 0.0%us, 0.0%sy, 0.0%ni, 61.8%id, 0.0%wa, 6.0%hi, 32.2%si, 0.0%st

Cpu1 : 0.0%us, 0.3%sy, 0.0%ni, 39.3%id, 0.0%wa, 0.0%hi, 60.3%si, 0.0%st

Cpu2 : 0.0%us, 0.7%sy, 0.0%ni, 46.8%id, 0.0%wa, 0.0%hi, 52.5%si, 0.0%st

Cpu3 : 0.0%us, 0.7%sy, 0.0%ni, 59.1%id, 0.0%wa, 0.0%hi, 40.2%si, 0.0%st

Mem: 6221296k total, 1917612k used, 4303684k free, 226084k buffers

Swap: 13631144k total, 0k used, 13631144k free, 218220k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

2797 root 15 0 0 0 0 R 61 0.0 1075:00 fw_worker_2

2716 root 16 0 0 0 0 S 53 0.0 1130:56 fw_worker_1

2653 root 15 0 0 0 0 S 40 0.0 1033:12 fw_worker_0

15084 root 15 0 401m 47m 18m S 1 0.8 106:19.91 fw

25670 root 15 0 2068 1016 780 R 0 0.0 0:00.05 top

------------------ fw ctl affinity -l -r --------------

fw ctl affinity -l -r

CPU 0: s0p2 s0p1 s0p0 s1p0 s1p1

CPU 1: fw_2

CPU 2: fw_1

CPU 3: fw_0

All: mpdaemon dtlsd fwd in.geod in.asessiond in.aufpd vpnd cprid cpd

[Expert@ukdxsdfshll013]#

------------- fw ctl multik stat -----------------

fw ctl multik stat

ID | Active | CPU | Connections | Peak

-------------------------------------------

0 | Yes | 3 | 54258 | 57484

1 | Yes | 2 | 55262 | 58884

2 | Yes | 1 | 54806 | 57974

----------- cpstat -t multi_cpu os --------------

cpstat -f multi_cpu os

Processors load

---------------------------------------------------------------------------------

|CPU#|User Time(%)|System Time(%)|Idle Time(%)|Usage(%)|Run queue|Interrupts/sec|

---------------------------------------------------------------------------------

| 1| 0| 39| 60| 40| ?| 0|

| 2| 0| 42| 58| 42| ?| 0|

| 3| 0| 56| 44| 56| ?| 0|

| 4| 0| 45| 55| 45| ?| 0|

---------------------------------------------------------------------------------

I believe that the problem is in SecureXL:

---------------------------- fwaccel stat----------------

fwaccel stat

Accelerator Status : on

Accept Templates : enabled

Drop Templates : disabled

Accelerator Features : Accounting, NAT, Cryptography, Routing,

HasClock, Templates, Synchronous, IdleDetection,

Sequencing, TcpStateDetect, AutoExpire,

DelayedNotif, TcpStateDetectV2, CPLS, WireMode,

DropTemplates, Streaming, MultiFW, AntiSpoofing,

DoS Defender, Nac

Cryptography Features : Tunnel, UDPEncapsulation, MD5, SHA1, NULL,

3DES, DES, CAST, CAST-40, AES-128, AES-256,

ESP, LinkSelection, DynamicVPN, NatTraversal,

EncRouting, AES-XCBC, SHA256

------------------ fwaccel stats ------------------

fwaccel stats

Name Value Name Value

-------------------- --------------- -------------------- ---------------

conns created 47848743 conns deleted 46594114

temporary conns 152420 templates 5202

nat conns 104 accel packets 1009641643

accel bytes 718151584061 F2F packets 2376587399

ESP enc pkts 0 ESP enc err 0

ESP dec pkts 0 ESP dec err 0

ESP other err 0 espudp enc pkts 0

espudp enc err 0 espudp dec pkts 0

espudp dec err 0 espudp other err 0

AH enc pkts 0 AH enc err 0

AH dec pkts 0 AH dec err 0

AH other err 0 memory used 0

free memory 0 acct update interval 3600

current total conns 153686 TCP violations 15215

conns from templates 3469212 TCP conns 152013

delayed TCP conns 0 non TCP conns 1673

delayed nonTCP conns 0 F2F conns 81233

F2F bytes 1303498849580 crypt conns 0

enc bytes 0 dec bytes 0

partial conns 0 anticipated conns 0

dropped packets 161 dropped bytes 33038

nat templates 0 port alloc templates 0

conns from nat tmpl 0 port alloc conns 0

port alloc f2f 0 PXL templates 148

PXL conns 216 PXL packets 516566685

PXL bytes 285833923732 PXL async packets 516589579

------------------ fwaccel stats -s ----------------

Accelerated conns/Total conns : 71941/154182 (46%)

Accelerated pkts/Total pkts : 1013954607/3913942827 (25%)

F2Fed pkts/Total pkts : 2382649836/3913942827 (60%)

PXL pkts/Total pkts : 517338384/3913942827 (13%)

It seems that most of the connections are not accelerated and I don't know why.

This is part of debugging of secureXL:

Nov 12 22:08:00 firewall kernel: [fw_1];cphwd_offload_conn: dir=1, cdir=1, vm_conn=<1X8.58.164.136,53420,1X4.142.120.18,53,17 >

Nov 12 22:08:00 firewall kernel: [fw_1];get_conn_flags: no handler for this conn (no sticky F2F)

Nov 12 22:08:00 firewall kernel: [fw_1];get_conn_flags: sticky_f2f=0 for <1X8.58.164.136,53420,1X4.142.120.18,53,17>

Nov 12 22:08:00 firewall kernel: [fw_1];cphwd_offload_conn: calling cphwd_api_add_connection_, flags 0x0, flags_ex 0x0

Nov 12 22:08:00 firewall kernel: [fw_1];cphwd_add_conn_stat_cb: received add status for <1X8.58.164.136,53420,1X4.142.120.18,53,17>(flags= 0x0, cb_flags=0x0): success

Nov 12 22:08:00 firewall kernel: [fw_1];cphwd_add_conn_stat_cb: received add status for <1X8.58.164.136,0,14.142.120.18,53,17>(flags=0x800 , cb_flags=0x0): success

Nov 12 22:08:00 firewall kernel: [fw_1];cphwd_add_conn_stat_cb: CPHWD_F_TEMPLATE

Nov 12 22:08:00 firewall kernel: [fw_2];cphwd_offload_conn: dir=1, cdir=1, vm_conn=<1X4.142.120.157,38122,1X4.142.121.92,135, 6>

Nov 12 22:08:00 firewall kernel: [fw_2];get_conn_flags: MORE_INSPECT is on -> F2F

Nov 12 22:08:00 firewall kernel: [fw_2];get_conn_flags: sticky_f2f=1 for <1X4.142.120.157,38122,1X4.142.121.92,135,6>

Nov 12 22:08:00 firewall kernel: [fw_2];cphwd_pslglue_provide_conn_opaque: conn is streamed (both sides) -> F2F both dirs

Nov 12 22:08:00 firewall kernel: [fw_2];cphwd_offload_conn: pxl - turning on sticky f2f on conn <1X4.142.120.157:38122 -> 1X4.142.121.92:135 IPP 6>

Nov 12 22:08:00 firewall kernel: [fw_2];cphwd_offload_conn: conn <1X4.142.120.157,38122,1X4.142.121.92,135,6> has sticky f2f (2)

Nov 12 22:08:00 firewall kernel: [fw_2];cphwd_offload_conn: calling cphwd_api_add_connection_, flags 0x20001, flags_ex 0x8

Nov 12 22:08:00 firewall kernel: [fw_2];cphwd_add_conn_stat_cb: received add status for <1X4.142.120.157,38122,1X4.142.121.92,135,6>(flags =0x20001, cb_flags=0x8): success

Nov 12 22:08:00 firewall kernel: [fw_0];cphwd_offload_conn: dir=1, cdir=1, vm_conn=<1X4.142.121.165,1757,1X4.142.120.87,4288, 6>

Nov 12 22:08:00 firewall kernel: [fw_0];get_conn_flags: MORE_INSPECT is on -> F2F

Nov 12 22:08:00 firewall kernel: [fw_0];get_conn_flags: sticky_f2f=1 for <1X4.142.121.165,1757,1X4.142.120.87,4288,6>

Nov 12 22:08:00 firewall kernel: [fw_0];cphwd_pslglue_provide_conn_opaque: conn is streamed (both sides) -> F2F both dirs

Nov 12 22:08:00 firewall kernel: [fw_0];cphwd_offload_conn: pxl - turning on sticky f2f on conn <1X4.142.121.165:1757 -> 1X4.142.120.87:4288 IPP 6>

Nov 12 22:08:00 firewall kernel: [fw_0];cphwd_offload_conn: conn <1X4.142.121.165,1757,1X4.142.120.87,4288,6> has sticky f2f (2)

Nov 12 22:08:00 firewall kernel: [fw_0];cphwd_offload_conn: calling cphwd_api_add_connection_, flags 0x20001, flags_ex 0x8

Nov 12 22:08:00 firewall kernel: [fw_0];cphwd_add_conn_stat_cb: received add status for <1X4.142.121.165,1757,1X4.142.120.87,4288,6>(flags =0x20001, cb_flags=0x8): success

Nov 12 22:08:00 firewall kernel: [fw_0];cphwd_offload_conn: dir=1, cdir=1, vm_conn=<1X4.142.120.149,6632,193.189.13.39,22180, 6>

Nov 12 22:08:00 firewall kernel: [fw_0];get_conn_flags: MORE_INSPECT is on -> F2F

Nov 12 22:08:00 firewall kernel: [fw_0];get_conn_flags: sticky_f2f=1 for <1X4.142.120.149,6632,193.189.13.39,22180,6>

Nov 12 22:08:00 firewall kernel: [fw_0];cphwd_pslglue_provide_conn_opaque: conn is streamed (both sides) -> F2F both dirs

Nov 12 22:08:00 firewall kernel: [fw_0];cphwd_offload_conn: pxl - turning on sticky f2f on conn <1X4.142.120.149:6632 -> 193.189.13.39:22180 IPP 6>

Nov 12 22:08:00 firewall kernel: [fw_0];cphwd_offload_conn: conn <1X4.142.120.149,6632,193.189.13.39,22180,6> has sticky f2f (2)

-------------------------

Any suggestions are welcome!

Show more