During the earlier mentioned (LPM Performance with 802.3ad Etherchannel on VIOS) redesign of our datacenters and the associated infrastructure we – among other things – also switched the 1 Gbps links on our IBM Power systems over to 10 Gbps links. I did some preliminary research on what guidelines and best practices should be followed with regard to performance and availability when using 10 gigabit ethernet (10GE). Building on those recommendations i did systematic performance tests as well. The result of both will be presented and discussed in this post.
Previously our IBM Power environment was – with respect to network – set up with:
Dual VIOS on each IBM Power system.
Seven physical 1 Gbps links on each VIOS attached to different Cisco switching gear (3750 and 6509).
Heavily fragmented/firewalled environment with about 30 VLANs distributed unevenly over the 7+7 physical links.
Seven “failover SEA” (shared ethernet adapter) instances with primary and backup alternating between the two VIOS to ensure at least some load distribution.
All in all not very pretty. Not by intention though, but more of organically grown like most environments.
New setup with 10GE
After the datacenter redesign, the new IBM Power environment would – with respect to network – be set up with:
Dual VIOS on each IBM Power system.
Two or four physical 10 Gbps links on each VIOS attached to four fully meshed Cisco Nexus 5596 switches. In case of four physical links from the IBM Power systems, they would be aggregated into pairs of two links in one 802.3ad etherchannel.
On the P770 (9117-MMB) systems the already available 10GE IVE or HEA ports would be used. On the P550 (8204-E8A) systems newly purchased Chelsio 10GE adapters (FC 5769) would be used.
Still heavily fragmented/firewalled environment with about 30 VLANs distributed evenly over the 1+1 physical links or 802.3ad etherchannels.
One “failover SEA with load sharing” instance to concurrently utilize both of the physical links attached to the SEA.
The design goals were:
Better resource utilization of the individual physical links. Connections needing less than 1 Gbps being able to “donate” the unused bandwidth to connections needing more than 1 Gbps.
Utilization of all physical links with the “failover SEA with load sharing”. No more inactive backup links.
A reduced amount of copper cabling taking up network interfaces, patch panel ports and switch ports. Less obstruction of the cabinet airflow due to less and smaller fiber links. Reduced fire load due to less cabling.
Configuration best practices
Network options that should already be in place from the 1 Gbps network configuration:
TCP window scaling (aka RFC1323)
$ no -o rfc1323 rfc1323 = 1 $ no -o rfc1323=1 -p
Or on individual interfaces:
$ lsattr -EH -a rfc1323 -l enX attribute value description user_settable rfc1323 1 Enable/Disable TCP RFC 1323 Window Scaling True $ chdev -a rfc1323=1 -l enX
TCP receive and transmit socket buffer size (at least 262144 bytes)
$ no -o tcp_recvspace -o tcp_sendspace tcp_recvspace = 262144 tcp_sendspace = 262144 $ no -o tcp_recvspace=262144 -o tcp_sendspace=262144 -p
Or on individual interfaces:
$ lsattr -EH -a tcp_sendspace -a tcp_recvspace -l enX attribute value description user_settable tcp_sendspace 262144 Set Socket Buffer Space for Sending True tcp_recvspace 262144 Set Socket Buffer Space for Receiving True $ chdev -a tcp_sendspace=262144 -a tcp_recvspace=262144 -l enX
Disable TCP nagle algorithm
$ no -o tcp_nodelayack tcp_nodelayack = 1 $ no -o tcp_nodelayack=1 -p
Or on individual interfaces:
$ lsattr -EH -a tcp_nodelay -l enX attribute value description user_settable tcp_nodelay 1 Enable/Disable TCP_NODELAY Option True $ chdev -a tcp_nodelay=1 -l enX
TCP selective acknowledgment (aka “SACK” or RFC2018):
$ no -a | grep -i sack sack = 1 $ no -o sack=1 -p
Turn on flow-control everywhere – interfaces and switches! Below in the performance section will be an example of what performance will look like if flow-control isn't implemented end-to-end. For your network equipment, see your vendors documentation. For IVE/HEA interfaces enable flow-control in the HMC like this:
Serversand select your managed system.
From the drop-down menu select
In the new
Host Ethernet Adapterswindow select your 10GE adapter and click the
In the new
HEA Physical Port Configurationwindow check the
Flow control enabledoption and click the
For other 10GE interfaces like the Chelsion FC 5769 adapters, turn on flow control on the AIX or VIOS device:
$ lsattr -EH -a flow_ctrl -l entX attribute value description user_settable flow_ctrl yes Enable transmit and receive flow control True $ chdev -a flow_ctrl=yes -l entX
Changing this attribute on IVE/HEA devices has no effect though.
TCP checksum offload (enabled by default):
$ lsattr -EH -a chksum_offload -l entX attribute value description user_settable chksum_offload yes Enable transmit and receive checksum True $ chdev -a chksum_offload=yes -l entX
TCP segmentation offload on AIX and VIOS hardware devices (enabled by default):
$ lsattr -EH -a large_receive -a large_send -l entX attribute value description user_settable large_receive yes Enable receive TCP segment aggregation True large_send yes Enable transmit TCP segmentation offload True $ chdev -a large_receive=yes -a large_send=yes -l entX
TCP segmentation offload on VIOS SEA devices (could cause compatibility issues with IBM i LPARs)
$ lsattr -EH -a large_receive -a largesend -l entX attribute value description user_settable large_receive yes Enable receive TCP segment aggregation True largesend 1 Enable Hardware Transmit TCP Resegmentation True $ chdev -a large_receive=yes -a largesend=1 -l entX
padmin@vios$ chdev -dev entX -attr largesend=1 padmin@vios$ chdev -dev entX -attr large_receive=yes
TCP segmentation offload on AIX virtual ethernet devices:
$ lsattr -EH -a mtu_bypass -l enX attribute value description user_settable mtu_bypass off Enable/Disable largesend for virtual Ethernet True $ chdev -a mtu_bypass=on -l enX
To test and evaluate the network troughput performance, Michael Perzls RPM package of netperf was used. The
netperf client and server programs were started with the default options, which means a simplex, single thread throughput measurement with a duration of 10 seconds for each run. The AIX and VIOS LPARs involved in the tests had – unless specified otherwise – the following CPU and memory resource allocation configuration:
AIX was at version 220.127.116.11, oslevel 6100-08-01-1245, VIOS was at ioslevel 18.104.22.168.
Hardware Tests on VIO Servers
For later reference or “calibration” if you will, the initial performance tests were done directly on the hardware devices assigned to the VIO servers. During the DLPAR assignment of the IVE/HEA ports i ran into some issues described earlier in AIX and VIOS DLPAR Operation fails on LHEA Adapters. After those issues were tackled the test environment looked like this:
Two datacenters “DC 1” and “DC 2” with less than 500m fiber distance between them.
An overall number of four P770 (9117-MMB). Two P770 with one CEC and thus one IVE/HEA port per VIOS. Two P770 with two CEC and thus two IVE/HEA ports per VIOS. Of each kind of P770 configuration, one is located in DC 1 and the other in DC 2.
Four Cisco Nexus 5596 switches, two in each datacenter.
Test IP adresses in two different VLANs, but tests not crosssing VLANs over the firewall infrastructure.
The following table shows the results of the test runs in MBit/sec. Excluded – and denoted as grey cells with a “X” – were cases where source and destination IP would be the same and cases where VLANs would be crossed and thus the firewall infrastructure would be involved.
In general the numbers weren't that bad for a first try. Still, there seemed to be a systematic performance issue in those test cases where the numbers in the table are marked red. Those are the cases where the IVE/HEA ports on the second CEC of a two-CEC system were receiving traffic. See the above image of the test environment, where the affected IVE/HEA ports are also marked red. After double checking cabling, system resources and my own configuration and also double checking with our network department with regard to the flow-control configuration, i did a tcpdump trace of the netperf test case
192.168.243.12 (upper left corner of the above table; value: 596 MBit/sec). This is what the wireshark bandwidth analysis graph of the tcpdump trace data looked like:
Noticable are the large gaps of about one second inbetween very short bursts of traffic. This was due to the connection falling back to flow-control on the TCP protocol level, which has relatively high backoff timing values. So the low transfer rate in the test cases marked red was due to inproper flow-control configuration, which resulted in large periods of network inactivity, dragging down the overall average bandwidth value. After getting our network department to set up flow-control properly and doing the same netperf test case again, things improved dramatically:
Now there is a continous flow of packets which is regulated in a more fine grained manner via flow-control in case one of the parties involved gets flooded with packets. The question why this behaviour is only exhibited in conjunction with the IVE/HEA ports on the second CEC is still an open question. A PMR raised with IBM led to no viable or conclusive answer.
Doing the systematic tests above again confirmed the improved results in the previously problematic cases:
Although those results were much better than the ones in the first test run, there were still two issues:
individual throughput values still had a rather large span of over 1 Gbps (3858 - 5037 MBit/sec).
on average the throughput was well below the theoretical maximum of 10 Gbps. In the worst case only 38.5%, in the best case 50.3% of the theoretical maximum.
In order to further improve performance, jumbo frames were enabled on the switches, in the HMC on the IVE/HEA interfaces and on the interfaces within the VIO servers. For the HMC, the jumbo frames option on the IVE/HEA can be enabled in the same panel as the flow-control option mentioned above. After enabling jumbo frames, another test set like the above was done:
As those results show, the throughput can be significantly improved by using jumbo frames over the standard frame size. On average the throughput is now at about 71% (7116 MBit/sec) of the theoretical maximum of 10 Gbps. In the worst case that is 64.2%, in the best case 81.8% of the theoretical maximum. This also means that the span of the throughput values is with 1.7 Gbps (6423 - 8183 MBit/sec) even larger than before. It has to be kept in mind though that this is still single thread performance, so the larger variance is probably due to CPU restrictions of a single thread. A quick singular test between one source and one destination using two parallel netperf threads showed a combined throughput of about 9400 MBit/sec and confirmed this suspicion.
LPAR and SEA Tests
After the initial hardware performance tests above, some more tests closer to the final real-world setup were done. The same setup would later be used for the production environment. The test environment looked like this:
On a physical level of network and systems, the test setup was the same as the one above, except that only the two P770 with one CEC and thus one IVE/HEA port per VIOS were used in the further tests.
The directly assigned network interfaces and IPs on the VIO servers were replaced by a “failover SEA with load sharing” configuration.
An overall number of three AIX LPARs was used for the tests. LPARs #2 and #3 resided on the same P770 system. LPAR #1 resided on the other P770 system in the other datacenter.
All AIX LPARs had test IP adresses in the same VLAN (PVID: 251,
192.168.244.0/24). Thus, no traffic over the firewall infrastructure occured and only one physical link (blue text in the above image) of the failover SEA was used.
The following table shows the results of the test runs in MBit/sec. The result table got a bit more complex this time around, since there were three LPARs of which each could be source or destination. In addition each LPAR also had six different combinations of virtual interface options (see the options legend). Excluded – and denoted as grey cells – were cases where source and destination IP and thus LPAR would be the same. To add a visual aid for interpretation of the results, the table cells are color coded depending on the measurement value. See the color legend for the mapping of color to throughput values.
|none||tcp_sendspace 262144, tcp_recvspace 262144, tcp_nodelay 1, rfc1323 1|
|LS,9000||largesend, MTU 9000|
|LS,65390||largesend, MTU 65390|
|⇐ 1 Gbps|
|> 1 Gbps and ⇐ 3 Gbps|
|> 3 Gbps and ⇐ 5 Gbps|
|> 5 Gbps|
|Managed System||P770 DC 1-2||P770 DC 2-2|
|P770 DC 1-2||.243||none||791.27||727.34||759.35||743.37||766.81||708.33||888.13||885.68||877.76||788.91||778.50||875.20|
|P770 DC 2-2||.137||none||925.17||753.81||865.15||839.55||912.80||930.17||819.35||938.09||864.33||847.11||898.57||827.62|
At the first glance those results show that the throughput can vary significantly, depending on the combination of interface configuration options as well as intra- vs. inter-managed system traffic. In detail the above results indicate that:
single thread performance is going to be nowhere near theoretical maximum of 10 Gbps. In most cases one has to be considered lucky getting 50% of the theoretical maximum.
the default configuration options (test cases:
none) are not suited at all for a 10 Gbps environment.
intra-managed system network traffic, which does not leave the hardware and should be handled only within the PowerVM hypervisor, performs better than inter-managed system traffic in most of the cases. To a certain degree this was expected, since in the latter case the traffic has to additionally go over the SEA, the physical interfaces, the network devices, and again over the other systems physical interfaces and SEA.
there is a quite noticable delta between the intra-managed system results and the theoretical maximum of 10 Gbps. Since this traffic does not leave the hardware and should be handled only within the PowerVM hypervisor, better results somewhere between 9 – 9.5 Gbps were expected.
there is a quite noticable performance drop off comparing the inter-managed system results with the previous “non-jumbo frames, but flow-control” hardware tests. The delta gets even more noticable when comparing the results with the previous jumbo frames hardware tests. It seems the now involved SEA doesn't handle larger frame sizes that well.
strangely the inter-managed system tests show better results than the intra-managed system tests in some cases (
The somewhat dissatisfying results were discussed with IBM support within the scope of the PMR mentioned earlier. The result of this discussion was also – and even with a large amount of good will – dissatisfying. We got the usual feedback of “results are actually not that bad”, “works as designed” and the infamous “not enough CPU resources on the VIO servers, switch to dedicated CPU assignment”. Just to get the latter point out of the way i did another reduced test set similar to the above. This time only the test cases for inter-managed system traffic were performed. The resource allocation on the two involved VIO servers was changed to two dedicated CPU cores. The following table shows the results of the test runs in MBit/sec.
|Managed System||P770 DC 1-2||P770 DC 2-2|
|P770 DC 1-2||.243||none||853.06||862.40||800.55||869.85||873.51||846.27|
|P770 DC 2-2||.137||none||826.02||710.41||800.56||425.30||814.65||824.55|
The results do in some of the cases improve a bit over the previous test set, but not by the expected margin of ~50%, which would be needed to get anywhere near the results of the previous hardware tests. A request to open a DCR in order to further investigate possible optimizations with regard to network performance in general and with the SEA in particular is currently still in the works at IBM.
HMC and Jumbo Frames
In order to still get a working RMC connection from the HMC to the LPARs and thus be able to do DLPAR and LPM operations after switching to jumbo frames, the HMC has to be configured accordingly. See: Configure Jumbo Frames on the Hardware Management Console (HMC) for the details. If – like in our case – you have to deal with a heavily fragmented/firewalled environment and the RMC traffic from the HMC to the LPARs has to pass through several network devices, all the network equipment along the way has to be able to properly deal with jumbo frames. In our case the network department was reluctant and eventually did not reconfigure their devices, since “jumbo frames are a non-standard extension to the network protocol”. As usual this is a tradeoff between optimal performance and the possibility of strange behaviour or even error situations that are hard to debug.
Thoughts and Conclusions
It appears as if with the advent of 10GE or even higher bandwidths, the network itself is no longer the bottleneck. Rather, the components doing the necessary legwork of pre- and postprocessing (i.e. CPU and memory), as well as some parts of the TCP/IP stack are or are becoming the new bottleneck. One can no longer rely on a plug'n'play attitude like with gigabit ethernet, when it comes to network performance and optimal utilization of existing resources.
Although i'm certainly no network protocol designer, AIX kernel programmer or PowerVM expert, it still seems to me as if IBM has left considerable room for improvement in all components involved. As we see 10 gigabit ethernet being deployed more and more in the datacenters, becoming the new standard by replacing gigabit ethernet, there will be an increased demand for high troughput and low latency network communication. It'll be hard to explain to upper management why they're getting only 35% - 75% of the network performance for 100% of the costs. I hope IBM will be able to show that the IBM Power and AIX platform can at least keep up with the current technology and trends as well as the competition from other platforms.
Links & Resources
Failover SEA with load sharing:
AIXpert - Shared Ethernet Adapter (SEA) Failover with Load Balancing
How to setup SEA failover with Load Sharing configuration
AIX and VIOS 10GE configuration:
AIXpert - 10Gbit Ethernet, bad assumption and Best Practice - Part 1
Power IT Pro - Tuning for 10G
Controlling the Flow
Redpaper - Chelsio 10 GbE Adapters for IBM Power Systems
Configure Jumbo Frames on the Hardware Management Console (HMC)