LPM Performance with 802.3ad Etherchannel on VIOS

Due to new network gear (Cisco Nexus 5596) being available in the datacenters, i recently switched most of the 1Gbps links on our IBM Power systems over to 10Gbps links. This freed up a lot of interfaces, patch panel ports and switch ports. So i thought it would be a nice idea to give some of the now free 1Gbps network links to the VIOS management interfaces, configure an 802.3ad etherchannel and thus aggregate the multiple links. The plan was to have some additional bandwith available for the resource intensive data transfer during live partition mobility (LPM) operations. Another 10Gbps link for each VIOS would probably have been the better solution, but it's hard to justify the additional investment “just” for the usecase of LPM traffic.

While i was aware that a single LPM operation would not benefit from the etherchannel (one sender, one receiver, one used link), i figured in case of multiple parallel LPM operations the additional bandwidth would speed up the data transfer almost linearly with the number of links in the etherchannel.

In a test setup i configured four VIOS on two IBM Power systems to each have an etherchannel consisting of two 1Gbps links:

root@vios1:/$ ifconfig -a
en27: flags=1e080863,c0<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,CHECKSUM_OFFLOAD(ACTIVE),LARGESEND,CHAIN>
    inet 192.168.1.1 netmask 0xffffff00 broadcast 192.168.1.255
      tcp_sendspace 262144 tcp_recvspace 262144 tcp_nodelay 1 rfc1323 1

root@vios1:/$ lsattr -EHl ent27
attribute       value          description                                     user_settable

adapter_names   ent0,ent2      EtherChannel Adapters                           True
alt_addr        0x000000000000 Alternate EtherChannel Address                  True
auto_recovery   yes            Enable automatic recovery after failover        True
backup_adapter  NONE           Adapter used when whole channel fails           True
hash_mode       src_port       Determines how outgoing adapter is chosen       True
interval        long           Determines interval value for IEEE 802.3ad mode True
mode            8023ad         EtherChannel mode of operation                  True
netaddr         0              Address to ping                                 True
noloss_failover yes            Enable lossless failover after ping failure     True
num_retries     3              Times to retry ping before failing              True
retry_time      1              Wait time (in seconds) between pings            True
use_alt_addr    no             Enable Alternate EtherChannel Address           True
use_jumbo_frame no             Enable Gigabit Ethernet Jumbo Frames            True

All three combinations of the ethercannels hash_mode attribute (dst_port, src_port and src_dst_port) were tested in a szenario with four parallel LPM processes. The four LPARs used for the LPM tests had 32GB memory assigned and were only moderately active at the time.

Even with a rought timing measurement taken by hand it pretty soon became clear, that the transfer rate was still hovering slightly below 1Gbps. A look at the performance counters of the interfaces involved in the etherchannel before the LPM operation:

root@vios1:/$ entstat -d ent27 | grep "Bytes: "
Bytes: 16892                    Bytes: 12594          # Etherchannel: ent27
Bytes: 11958                    Bytes: 8402           # Physical interface: ent0
Bytes: 5100                     Bytes: 4262           # Physical interface: ent2

and after the LPM operation:

root@vios1:/$ entstat -d ent27 | grep "Bytes: "
Bytes: 12018850788              Bytes: 210768159909   # Etherchannel: ent27
Bytes: 137243                   Bytes: 210767988779   # Physical interface: ent0
Bytes: 12018713727              Bytes: 171200         # Physical interface: ent2

confirmed the suspicion that only one physical link was actually used for the data transfer, no matter what hash_mode was chosen.

Looking further into the processes and network connections involved in the LPM operation at VIOS level, showed that four migmover processes were started on the source and destination VIOS, one process for each LPM operation. Each migmover process opened two network connections from the source VIOS (192.168.1.1) to the destination VIOS (192.168.1.2), which in the netstat output looked something like this:

root@vios1:/$ netstat -na | grep EST
tcp4       0      0  192.168.1.1.32788     192.168.1.2.32791    ESTABLISHED     # Socket pair 1
tcp4       0  13888  192.168.1.1.32789     192.168.1.2.32792    ESTABLISHED     # Socket pair 2
tcp4       0      0  192.168.1.1.32790     192.168.1.2.32793    ESTABLISHED     # Socket pair 3
tcp4       0  12768  192.168.1.1.32791     192.168.1.2.32794    ESTABLISHED     # Socket pair 4
tcp4       0      0  192.168.1.1.32792     192.168.1.2.32795    ESTABLISHED     # Socket pair 5
tcp4   18824   9744  192.168.1.1.32793     192.168.1.2.32796    ESTABLISHED     # Socket pair 6
tcp4       0      0  192.168.1.1.32794     192.168.1.2.32797    ESTABLISHED     # Socket pair 7
tcp4       0   5264  192.168.1.1.32795     192.168.1.2.32798    ESTABLISHED     # Socket pair 8

The first socket pair (numbers 1, 3, 5, 7) of each migmover process did not see much data transfer and thus seemed to be some sort of control channel. The second socket pair (numbers 2, 4, 6, 8) on the other hand saw a lot of data transfer and thus seemed to be responsible for the actual LPM data transfer. The TCP source and destination ports for the very first socket pair seemed to start at a random value somewhere in the ephemeral port range¹⁾. All the subsequently created sockets pairs seemed to have TCP source and destination ports that are an increment by one from the ones used in the previously allocated socket pair.

The combination of the three factors:

sequential port allocation from a random starting point in the ephemeral range
alternating port allocation for control and data channels
same port allocation strategy on the source and the destination VIOS

leads to the observed behaviour of only half the number of physical links being fully used in a etherchannel configuration with an even number of interfaces. Due to this TCP port allocation strategy, all control channels consistently get placed on one physical link, while all data channels consistently get placed on the next physical link, regardless of the hash_mode being used.

After figuring this out, i opened a PMR with IBM support, which pretty soon was transfered to development. After some rather fruitless back and forth, the developer finally agreed to accept a DCR (MR030413328) to implement an alternative TCP port allocation strategy for the LPM data transfer:

User Marketing Field Requirement Number: MR030413328
*Title: LPM performance enhancement with etherchannel 802.3ad
*Description: When the Mover Service Partition managing LPM is using an Etherchannel Adapter built on 2 physical adapters and this Etherchannel is configured in 8023ad mode, the port selection of LPM process makes that all the data connection to the same physical adapter and the control connection to the other adapter. This cause an highly unbalanced usage of those adapters. (The data connection will manage nearly all data transfered, where control connection only handle a little)

While LPM operation, mover is first creating the control socket and then the data socket. Hence this is more than possible that src_data = src_ctrl + 1 and dst_data = dst_ctrl + 1, which will cause in an etherchannel build on 2 adapters, configured in 8023ad mode, that all control socket use the same adapter and all data socket will use the same other adapter. (The parity of “src_data + dst_data” will be the same as “(src_ctrl + 1) + (dst_ctrl + 1)” )

I would suggest to enforce the parity of the port number depending on the fact we are the source or destination side mover service partition:
- data and control socket with same port parity if we are the source mover :
→ i.e. : data port 63122 and control port 63120
or : data port 63113 and control port 63115
- data and control socket with a different port parity if we are the destination mover :
→ i.e. : data port 63122 and control port 63121
or : data port 63113 and control port 63114

This should cause a more balanced load on etherchannel adapter.

*Priority: Medium
*Requested Completion Date: 01.01.2014
(Original Requested Date) 01.01.2014
*Resolving Pipe: AIX, LoP, System p Hardware
*Resolving Release: AIX
*Resolving Entity: Virtualization

*IBM Business Justification (Include SIEBEL number for a priority): This is a problem that has been seen many time, and this will happen for any customer with VIOS configured with etherchannel.
*Requirement Type Performance
Suggested Solution:
Source: Customer
Problem Management Record (PMR):
PMR Number: xxxxx, Branch Code xxx, Country Code xxx
Other Related Information:
Environment: VIOS Level is 2.2.2.1

The by far most interesting part about this DCR being the item “IBM Business Justification”! Remind me again, why this wasn't already fixed if it has been seen many times as an issue? Well, fingers crossed the code change will make it into a VIOS release this time …

¹⁾

or - if configured - between the values of tcp_port_low and tcp_port_high of the vioslpm0 device

0 Comments | 2013-03-13 written by Frank Fegert (XING) | Permanentlink
Tags: