====== AIX and VIOS Performance with 10 Gigabit Ethernet (Update) ======

In october last year (10/2013) a colleague and i were given the opportunity to speak at the "IBM AIX User Group South-West" at IBMs german headquarters in Ehningen. My part of the talk was about our experiences with the move from 1GE to 10GE on our IBM Power systems. It was largely based on my pervious post [[2013:06:08:aix_vios_10ge_performance]]. During the preperation for the talk, while i reviewed and compiled the previously collected material on the subject, i suddenly realized a disastrous mistake in my methodology. Specifically the ''netperf'' program used for the performance tests has a bit of an inadequate heuristic for determining the TCP buffer sizes it will use during a test run. For example:

<cli>
lpar1:/$ netperf -H 192.168.244.137
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to ...
Recv   Send    Send
Socket Socket  Message  Elapsed
Size   Size    Size     Time     Throughput
bytes  bytes   bytes    secs.    10^6bits/sec
16384  16384   16384    10.00    3713.83
</cli>

with the values "''16384''" from the last line being the relevant part here.

It turned out, that on AIX the ''netperf'' utility will only look at the global, system wide values of the ''tcp_recvspace'' and ''tcp_sendspace'' tunables set with the ''no'' command. In my example this was:

<cli>
lpar1:/$ no -o tcp_recvspace -o tcp_sendspace
tcp_recvspace = 16384
tcp_sendspace = 16384
</cli>

The interface specific values "''262144''" or "''524288''", e.g.:

<cli>
lpar1:/$ ifconfig en0
en0: flags=1e080863,4c0<UP,...,CHECKSUM_OFFLOAD(ACTIVE),LARGESEND,CHAIN>
        inet 192.168.244.239 netmask 0xffffff00 broadcast 192.168.244.255
         tcp_sendspace 262144 tcp_recvspace 262144 tcp_nodelay 1 rfc1323 1
</cli>

which would override the system wide default values for a specific network interface were never properly picked up by ''netperf''. The configuration of low, global default values and higher, interface-specific values for the tunables was deliberately chosen in order to allow the coexistance of low and high bandwidth adapters in the same system without the risk of interference. Anyway, i guess this is a very good example of the famous saying:

<blockquote>
A fool with a tool is still a fool.
<cite>Grady Booch</cite>
</blockquote>

For the purpose of another round of performance tests i temporaryly set the global default values for the tunables ''tcp_recvspace'' and ''tcp_sendspace'' first to "''262144''" and then to "''524288''". The number of tests were reduced to "largesend" and "largesend with JF enabled" since those -- as elaborated before -- seemed to be the most feasible options. The following table shows the results of the test runs in MBit/sec. The upper value in each cell being from the test with the tunables ''tcp_recvspace'' and ''tcp_sendspace'' set to "''262144''", the lower value in each cell being from the test with the tunables ''tcp_recvspace'' and ''tcp_sendspace'' set to "''524288''".

^ Options Legend  ^^
| upper value  | tcp_sendspace 262144, tcp_recvspace 262144, tcp_nodelay 1, rfc1323 1  |
| lower value  | tcp_sendspace 524288, tcp_recvspace 524288, tcp_nodelay 1, rfc1323 1  |
| LS           | largesend  |
| LS,9000      | largesend, MTU 9000   |

^ Color Legend ^^
| @#FF0000: | <= 1 Gbps  |
| @#FF8000: | > 1 Gbps and <= 3 Gbps  |
| @#FFFF00: | > 3 Gbps and <= 5 Gbps   |
| @#04B404: | > 5 Gbps  |

^ Managed System                    ^^^  P770 DC 1-2                                                                                               ^^^^  P770 DC 2-2                                       ^^
^             ^ IP     ^ Destination  ^  .243                                                ^^  .239                                                ^^  .137                                              ^^
^             ^ Source ^ Option       ^ LS                        ^ LS,9000                   ^ LS                        ^ LS,9000                   ^ LS                       ^ LS,9000                  ^
^ P770 DC 1-2 ^ .243   ^ LS           |  @#D8D8D8:                                           || @#04B404:9624.47 \\ 10494.85 | @#04B404:9475.76 \\ 10896.52 | @#FFFF00:4114.34 \\ 4810.47 | @#FFFF00:3966.08 \\ 4537.50 |
^ :::         ^ :::    ^ LS,9000      |  :::                                                 || @#04B404:9313.00 \\ 10133.70 | @#04B404:7549.75 \\ 7221.16  | @#FFFF00:4069.94 \\ 4538.29 | @#FFFF00:3699.03 \\ 4302.74 |
^ :::         ^ .239   ^ LS           | @#04B404:8484.16 \\ 10324.04 | @#04B404:8534.12 \\ 10237.57 | @#D8D8D8:                                            || @#FFFF00:4000.52 \\ 4640.57 | @#FFFF00:3834.55 \\ 4356.51 |
^ :::         ^ :::    ^ LS,9000      | @#04B404:8181.51 \\ 11539.93 | @#04B404:7267.75 \\ 6505.37  | :::                                                  || @#FFFF00:4010.89 \\ 4512.89 | @#FFFF00:4056.08 \\ 4410.07 |
^ P770 DC 2-2 ^ .137   ^ LS           | @#FFFF00:3937.75 \\ 4073.65  | @#FFFF00:3945.29 \\ 4386.60  | @#FFFF00:3892.89 \\ 4194.19  | @#FFFF00:3906.86 \\ 4303.14  | @#D8D8D8:                                          ||
^ :::         ^ :::    ^ LS,9000      | @#FFFF00:3849.58 \\ 4742.04  | @#FFFF00:3693.16 \\ 4360.35  | @#FFFF00:3423.19 \\ 4561.73  | @#FFFF00:3502.64 \\ 4317.46  | :::                                                ||

The results show a significant increase in throughput in the cases of intra-managed system network traffic. This is now up to almost wire speed. For the cases of inter-managed system traffic there is also a noticeable, but far less significant increase in throughput. To check if the two TCP buffer sizes selected were still too low, i did a quick check for the network latency in our environment and compared it to the reference values from IBM (see the redbook link below):

^ Network       ^ IBM RTT Reference ^ RTT Measured ^
| 1GE Phys      |  0.144 ms         |  0.188 ms    |
| 10GE Phys     |  0.062 ms         |  0.074 ms    |
| 10GE Hyp      |  0.038 ms         |  0.058 ms    |
| 10GE SEA      |  0.274 ms         |  0.243 ms    |
| 10GE VMXNET3  |  --               |  0.149 ms    |

Every value besides the measurement within a managed system ("10GE Hyp") represents the worst case RTT between two separate managed systems, one in each of the two datacenters. For reference purposes i added a test ("10GE VMXNET3") between two Linux VMs running on two different VMware ESX hosts, one in each of the two datacenters, using the same network hardware as the IBM Power systems. The measured values themselves are well within the range of the reference values given by IBM, so the network setup in general should be fine. A quick calculation of the [[wp>Bandwidth-delay_product|Bandwidth-delay product]] for the value of the test case "10GE SEA":

<code>
B x D = 10^10bps x 0.243ms ~ 326kB
</code>

confirmed that the value of "''524288''" for the tunables ''tcp_recvspace'' and ''tcp_sendspace'' should be sufficient. Very alarming on the other hand is the fact, that processing of simple ICMP packets takes almost twice as long going through the SEA on the IBM Power systems, compared to the network virtualization layer of VMware ESX. Part of the rather sub-par throughput performance measured on IBM Power systems is most likely caused by inefficient processing of network traffic within the SEA and/or VIOS. Rumor has it, that with the advent of 40GE and things getting even worse, the IBM AIX lab finally acknowledged this as an issue and is working on a streamlined, more efficient network stack. Other sources say [[wp>Single_Root_I/O_Virtualization#SR-IOV|SR-IOV]] and with it passing direct access to shared hardware resources to the LPAR systems on a larger scale is considered to be the way forward. Unfortunately this currently conflicts with LPM, so it'll probably be mutually exclusive for most environments.

In any case IBM still seems to have some homework to do on the network front. It'll be interesting to see what the developers come up with in the future.

===== Links & Resources =====

  * AIX and VIOS 10GE configuration:\\
[[http://www.redbooks.ibm.com/abstracts/sg248080.html|IBM Power Systems Performance Guide - Implementing and Optimizing]]\\
[[http://www-03.ibm.com/systems/power/software/aix/whitepapers/perf_faq.html|AIX on Power - Performance FAQ]]\\
<