bityard Blog

// AIX and VIOS Performance with 10 Gigabit Ethernet (Update)

In october last year (10/2013) a colleague and i were given the opportunity to speak at the “IBM AIX User Group South-West” at IBMs german headquarters in Ehningen. My part of the talk was about our experiences with the move from 1GE to 10GE on our IBM Power systems. It was largely based on my pervious post AIX and VIOS Performance with 10 Gigabit Ethernet. During the preperation for the talk, while i reviewed and compiled the previously collected material on the subject, i suddenly realized a disastrous mistake in my methodology. Specifically the netperf program used for the performance tests has a bit of an inadequate heuristic for determining the TCP buffer sizes it will use during a test run. For example:

lpar1:/$ netperf -H
Recv   Send    Send
Socket Socket  Message  Elapsed
Size   Size    Size     Time     Throughput
bytes  bytes   bytes    secs.    10^6bits/sec
16384  16384   16384    10.00    3713.83

with the values “16384” from the last line being the relevant part here.

It turned out, that on AIX the netperf utility will only look at the global, system wide values of the tcp_recvspace and tcp_sendspace tunables set with the no command. In my example this was:

lpar1:/$ no -o tcp_recvspace -o tcp_sendspace
tcp_recvspace = 16384
tcp_sendspace = 16384

The interface specific values “262144” or “524288”, e.g.:

lpar1:/$ ifconfig en0
        inet netmask 0xffffff00 broadcast
         tcp_sendspace 262144 tcp_recvspace 262144 tcp_nodelay 1 rfc1323 1

which would override the system wide default values for a specific network interface were never properly picked up by netperf. The configuration of low, global default values and higher, interface-specific values for the tunables was deliberately chosen in order to allow the coexistance of low and high bandwidth adapters in the same system without the risk of interference. Anyway, i guess this is a very good example of the famous saying:

A fool with a tool is still a fool. Grady Booch

For the purpose of another round of performance tests i temporaryly set the global default values for the tunables tcp_recvspace and tcp_sendspace first to “262144” and then to “524288”. The number of tests were reduced to “largesend” and “largesend with JF enabled” since those – as elaborated before – seemed to be the most feasible options. The following table shows the results of the test runs in MBit/sec. The upper value in each cell being from the test with the tunables tcp_recvspace and tcp_sendspace set to “262144”, the lower value in each cell being from the test with the tunables tcp_recvspace and tcp_sendspace set to “524288”.

Options Legend
upper value tcp_sendspace 262144, tcp_recvspace 262144, tcp_nodelay 1, rfc1323 1
lower value tcp_sendspace 524288, tcp_recvspace 524288, tcp_nodelay 1, rfc1323 1
LS largesend
LS,9000 largesend, MTU 9000
Color Legend
⇐ 1 Gbps
> 1 Gbps and ⇐ 3 Gbps
> 3 Gbps and ⇐ 5 Gbps
> 5 Gbps
Managed System P770 DC 1-2 P770 DC 2-2
IP Destination .243 .239 .137
Source Option LS LS,9000 LS LS,9000 LS LS,9000
P770 DC 1-2 .243 LS 9624.47
LS,9000 9313.00
.239 LS 8484.16
LS,9000 8181.51
P770 DC 2-2 .137 LS 3937.75
LS,9000 3849.58

The results show a significant increase in throughput in the cases of intra-managed system network traffic. This is now up to almost wire speed. For the cases of inter-managed system traffic there is also a noticeable, but far less significant increase in throughput. To check if the two TCP buffer sizes selected were still too low, i did a quick check for the network latency in our environment and compared it to the reference values from IBM (see the redbook link below):

Network IBM RTT Reference RTT Measured
1GE Phys 0.144 ms 0.188 ms
10GE Phys 0.062 ms 0.074 ms
10GE Hyp 0.038 ms 0.058 ms
10GE SEA 0.274 ms 0.243 ms
10GE VMXNET3 0.149 ms

Every value besides the measurement within a managed system (“10GE Hyp”) represents the worst case RTT between two separate managed systems, one in each of the two datacenters. For reference purposes i added a test (“10GE VMXNET3”) between two Linux VMs running on two different VMware ESX hosts, one in each of the two datacenters, using the same network hardware as the IBM Power systems. The measured values themselves are well within the range of the reference values given by IBM, so the network setup in general should be fine. A quick calculation of the Bandwidth-delay product for the value of the test case “10GE SEA”:

B x D = 10^10bps x 0.243ms ~ 326kB

confirmed that the value of “524288” for the tunables tcp_recvspace and tcp_sendspace should be sufficient. Very alarming on the other hand is the fact, that processing of simple ICMP packets takes almost twice as long going through the SEA on the IBM Power systems, compared to the network virtualization layer of VMware ESX. Part of the rather sub-par throughput performance measured on IBM Power systems is most likely caused by inefficient processing of network traffic within the SEA and/or VIOS. Rumor has it, that with the advent of 40GE and things getting even worse, the IBM AIX lab finally acknowledged this as an issue and is working on a streamlined, more efficient network stack. Other sources say SR-IOV and with it passing direct access to shared hardware resources to the LPAR systems on a larger scale is considered to be the way forward. Unfortunately this currently conflicts with LPM, so it'll probably be mutually exclusive for most environments.

In any case IBM still seems to have some homework to do on the network front. It'll be interesting to see what the developers come up with in the future.

// AIX and VIOS Performance with 10 Gigabit Ethernet

Please be sure to also read the update AIX and VIOS Performance with 10 Gigabit Ethernet (Update) to this blog post.

During the earlier mentioned (LPM Performance with 802.3ad Etherchannel on VIOS) redesign of our datacenters and the associated infrastructure we – among other things – also switched the 1 Gbps links on our IBM Power systems over to 10 Gbps links. I did some preliminary research on what guidelines and best practices should be followed with regard to performance and availability when using 10 gigabit ethernet (10GE). Building on those recommendations i did systematic performance tests as well. The result of both will be presented and discussed in this post.

Initial setup

Previously our IBM Power environment was – with respect to network – set up with:

  • Dual VIOS on each IBM Power system.

  • Seven physical 1 Gbps links on each VIOS attached to different Cisco switching gear (3750 and 6509).

  • Heavily fragmented/firewalled environment with about 30 VLANs distributed unevenly over the 7+7 physical links.

  • Seven “failover SEA” (shared ethernet adapter) instances with primary and backup alternating between the two VIOS to ensure at least some load distribution.

All in all not very pretty. Not by intention though, but more of organically grown like most environments.

New setup with 10GE

After the datacenter redesign, the new IBM Power environment would – with respect to network – be set up with:

  • Dual VIOS on each IBM Power system.

  • Two or four physical 10 Gbps links on each VIOS attached to four fully meshed Cisco Nexus 5596 switches. In case of four physical links from the IBM Power systems, they would be aggregated into pairs of two links in one 802.3ad etherchannel.

  • On the P770 (9117-MMB) systems the already available 10GE IVE or HEA ports would be used. On the P550 (8204-E8A) systems newly purchased Chelsio 10GE adapters (FC 5769) would be used.

  • Still heavily fragmented/firewalled environment with about 30 VLANs distributed evenly over the 1+1 physical links or 802.3ad etherchannels.

  • One “failover SEA with load sharing” instance to concurrently utilize both of the physical links attached to the SEA.

The design goals were:

  • Better resource utilization of the individual physical links. Connections needing less than 1 Gbps being able to “donate” the unused bandwidth to connections needing more than 1 Gbps.

  • Utilization of all physical links with the “failover SEA with load sharing”. No more inactive backup links.

  • A reduced amount of copper cabling taking up network interfaces, patch panel ports and switch ports. Less obstruction of the cabinet airflow due to less and smaller fiber links. Reduced fire load due to less cabling.

Configuration best practices

  • Network options that should already be in place from the 1 Gbps network configuration:

    • TCP window scaling (aka RFC1323)


      $ no -o rfc1323
      rfc1323 = 1
      $ no -o rfc1323=1 -p

      Or on individual interfaces:

      $ lsattr -EH -a rfc1323 -l enX
      attribute value description                                user_settable
      rfc1323   1     Enable/Disable TCP RFC 1323 Window Scaling True
      $ chdev -a rfc1323=1 -l enX
    • TCP receive and transmit socket buffer size (at least 262144 bytes)


      $ no -o tcp_recvspace -o tcp_sendspace
      tcp_recvspace = 262144
      tcp_sendspace = 262144
      $ no -o tcp_recvspace=262144 -o tcp_sendspace=262144 -p

      Or on individual interfaces:

      $ lsattr -EH -a tcp_sendspace -a tcp_recvspace -l enX
      attribute     value  description                           user_settable
      tcp_sendspace 262144 Set Socket Buffer Space for Sending   True
      tcp_recvspace 262144 Set Socket Buffer Space for Receiving True
      $ chdev -a tcp_sendspace=262144 -a tcp_recvspace=262144 -l enX
    • Disable TCP nagle algorithm


      $ no -o tcp_nodelayack
      tcp_nodelayack = 1
      $ no -o tcp_nodelayack=1 -p

      Or on individual interfaces:

      $ lsattr -EH -a tcp_nodelay -l enX
      attribute   value description                       user_settable
      tcp_nodelay 1     Enable/Disable TCP_NODELAY Option True
      $ chdev -a tcp_nodelay=1 -l enX
    • TCP selective acknowledgment (aka “SACK” or RFC2018):

      $ no -a | grep -i sack
      sack = 1
      $ no -o sack=1 -p
  • Turn on flow-control everywhere – interfaces and switches! Below in the performance section will be an example of what performance will look like if flow-control isn't implemented end-to-end. For your network equipment, see your vendors documentation. For IVE/HEA interfaces enable flow-control in the HMC like this:

    1. Go to Systems ManagementServers and select your managed system.

    2. From the drop-down menu select Hardware InformationAdaptersHost Ethernet.

    3. In the new Host Ethernet Adapters window select your 10GE adapter and click the Configure button.

    4. In the new HEA Physical Port Configuration window check the Flow control enabled option and click the OK button.

    For other 10GE interfaces like the Chelsion FC 5769 adapters, turn on flow control on the AIX or VIOS device:

    $ lsattr -EH -a flow_ctrl -l entX
    attribute value description                              user_settable
    flow_ctrl yes   Enable transmit and receive flow control True
    $ chdev -a flow_ctrl=yes -l entX

    Changing this attribute on IVE/HEA devices has no effect though.

  • TCP checksum offload (enabled by default):

    $ lsattr -EH -a chksum_offload -l entX
    attribute      value description                          user_settable
    chksum_offload yes   Enable transmit and receive checksum True
    $ chdev -a chksum_offload=yes -l entX
  • TCP segmentation offload on AIX and VIOS hardware devices (enabled by default):

    $ lsattr -EH -a large_receive -a large_send -l entX
    attribute     value description                              user_settable
    large_receive yes   Enable receive TCP segment aggregation   True
    large_send    yes   Enable transmit TCP segmentation offload True
    $ chdev -a large_receive=yes -a large_send=yes -l entX

    TCP segmentation offload on VIOS SEA devices (could cause compatibility issues with IBM i LPARs)

    As root user:

    $ lsattr -EH -a large_receive -a largesend -l entX
    attribute     value description                                 user_settable
    large_receive yes   Enable receive TCP segment aggregation      True
    largesend     1     Enable Hardware Transmit TCP Resegmentation True
    $ chdev -a large_receive=yes -a largesend=1 -l entX

    As padmin user:

    padmin@vios$ chdev -dev entX -attr largesend=1
    padmin@vios$ chdev -dev entX -attr large_receive=yes

    TCP segmentation offload on AIX virtual ethernet devices:

    $ lsattr -EH -a mtu_bypass -l enX
    attribute  value description                                   user_settable
    mtu_bypass off   Enable/Disable largesend for virtual Ethernet True
    $ chdev -a mtu_bypass=on -l enX

Performance Tests

To test and evaluate the network troughput performance, Michael Perzls RPM package of netperf was used. The netperf client and server programs were started with the default options, which means a simplex, single thread throughput measurement with a duration of 10 seconds for each run. The AIX and VIOS LPARs involved in the tests had – unless specified otherwise – the following CPU and memory resource allocation configuration:

LPAR Memory (GB) vCPU EC Mode Weight
AIX 16 2 0.2 uncapped 128
VIOS 4 2 0.4 uncapped 254

AIX was at version, oslevel 6100-08-01-1245, VIOS was at ioslevel

Hardware Tests on VIO Servers

For later reference or “calibration” if you will, the initial performance tests were done directly on the hardware devices assigned to the VIO servers. During the DLPAR assignment of the IVE/HEA ports i ran into some issues described earlier in AIX and VIOS DLPAR Operation fails on LHEA Adapters. After those issues were tackled the test environment looked like this:

Hardware Test Setup with VIOS on IBM Power and Cisco Nexus Switches

  • Two datacenters “DC 1” and “DC 2” with less than 500m fiber distance between them.

  • An overall number of four P770 (9117-MMB). Two P770 with one CEC and thus one IVE/HEA port per VIOS. Two P770 with two CEC and thus two IVE/HEA ports per VIOS. Of each kind of P770 configuration, one is located in DC 1 and the other in DC 2.

  • Four Cisco Nexus 5596 switches, two in each datacenter.

  • Test IP adresses in two different VLANs, but tests not crosssing VLANs over the firewall infrastructure.

The following table shows the results of the test runs in MBit/sec. Excluded – and denoted as grey cells with a “X” – were cases where source and destination IP would be the same and cases where VLANs would be crossed and thus the firewall infrastructure would be involved.

IP address 192.168.243 192.168.241 192.168.243 192.168.241
.11 .12 .21 .22 .31 .32 .41 .42 .51 .52 .61 .62
192.168.243 .11 X 596 X X 4551 4669 4148 4547 3319 364 X X
.12 4588 X X X 4526 4714 4192 4389 2770 560 X X
192.168.241 .21 X X X 4843 X X X X X X 555 3588
.22 X X 400 X X X X X X X 491 3400
192.168.243 .31 4753 294 X X X 4915 4250 4566 2316 602 X X
.32 4629 639 X X 4661 X 4189 4496 1153 449 X X
.41 4754 449 X X 4607 4858 X 4516 1693 603 X X
.42 4746 290 X X 4809 4898 4314 X 1615 352 X X
.51 4792 419 X X 4714 4834 4259 4491 X 370 X X
.52 4886 1046 X X 4663 4843 4300 4576 2151 X X X
192.168.241 .61 X X 454 4667 X X X X X X X 3820
.62 X X 541 4637 X X X X X X 288 X

In general the numbers weren't that bad for a first try. Still, there seemed to be a systematic performance issue in those test cases where the numbers in the table are marked red. Those are the cases where the IVE/HEA ports on the second CEC of a two-CEC system were receiving traffic. See the above image of the test environment, where the affected IVE/HEA ports are also marked red. After double checking cabling, system resources and my own configuration and also double checking with our network department with regard to the flow-control configuration, i did a tcpdump trace of the netperf test case (upper left corner of the above table; value: 596 MBit/sec). This is what the wireshark bandwidth analysis graph of the tcpdump trace data looked like:

Wireshark bandwidth analysis for a test case without proper flow-control configuration

Noticable are the large gaps of about one second inbetween very short bursts of traffic. This was due to the connection falling back to flow-control on the TCP protocol level, which has relatively high backoff timing values. So the low transfer rate in the test cases marked red was due to inproper flow-control configuration, which resulted in large periods of network inactivity, dragging down the overall average bandwidth value. After getting our network department to set up flow-control properly and doing the same netperf test case again, things improved dramatically:

Wireshark bandwidth analysis for a test case with flow-control

Now there is a continous flow of packets which is regulated in a more fine grained manner via flow-control in case one of the parties involved gets flooded with packets. The question why this behaviour is only exhibited in conjunction with the IVE/HEA ports on the second CEC is still an open question. A PMR raised with IBM led to no viable or conclusive answer.

Doing the systematic tests above again confirmed the improved results in the previously problematic cases:

IP address 192.168.243 192.168.241 192.168.243 192.168.241
.11 .12 .21 .22 .31 .32 .41 .42 .51 .52 .61 .62
192.168.243 .11 X 4624 X X 4638 4550 4347 4618 4643 4178 X X
.12 4562 X X X 4642 4861 4374 4267 4527 4210 X X
192.168.241 .21 X X X 4578 X X X X X X 4347 3953
.22 X X 3858 X X X X X X X 4283 3918
192.168.243 .31 4501 4107 X X X 4719 4331 4492 4478 4118 X X
.32 4697 4089 X X 4481 X 4189 4313 4458 4091 X X
.41 4510 4327 X X 4735 4567 X 4485 4482 4420 X X
.42 4814 4122 X X 4696 4841 4453 X 4491 4221 X X
.51 4574 4198 X X 4656 5037 4501 4526 X 4167 X X
.52 4565 4283 X X 4543 4594 4421 4487 4416 X X X
192.168.241 .61 X X 4394 4497 X X X X X X X 4038
.62 X X 4343 4445 X X X X X X 4237 X

Although those results were much better than the ones in the first test run, there were still two issues:

  • individual throughput values still had a rather large span of over 1 Gbps (3858 - 5037 MBit/sec).

  • on average the throughput was well below the theoretical maximum of 10 Gbps. In the worst case only 38.5%, in the best case 50.3% of the theoretical maximum.

In order to further improve performance, jumbo frames were enabled on the switches, in the HMC on the IVE/HEA interfaces and on the interfaces within the VIO servers. For the HMC, the jumbo frames option on the IVE/HEA can be enabled in the same panel as the flow-control option mentioned above. After enabling jumbo frames, another test set like the above was done:

IP address 192.168.243 192.168.241 192.168.243 192.168.241
.11 .12 .21 .22 .31 .32 .41 .42 .51 .52 .61 .62
192.168.243 .11 X 7265 X X 7687 7218 7785 6795 7514 7459 X X
.12 7098 X X X 7492 6794 7644 6551 6979 7059 X X
192.168.241 .21 X X X 7187 X X X X X X 7228 7312
.22 X X 7190 X X X X X X X 6937 7203
192.168.243 .31 7823 7445 X X X 7086 8183 7051 7274 7548 X X
.32 7107 6730 X X 7427 X 7635 6616 6612 6677 X X
.41 7138 6930 X X 7585 6871 X 6701 6749 7114 X X
.42 6748 6593 X X 6944 6423 7641 X 6617 6865 X X
.51 7202 6716 X X 7055 6611 7779 6827 X 6929 X X
.52 7198 6711 X X 7816 6904 7907 6907 6990 X X X
192.168.241 .61 X X 6800 6710 X X X X X X X 7080
.62 X X 7197 6971 X X X X X X 7099 X

As those results show, the throughput can be significantly improved by using jumbo frames over the standard frame size. On average the throughput is now at about 71% (7116 MBit/sec) of the theoretical maximum of 10 Gbps. In the worst case that is 64.2%, in the best case 81.8% of the theoretical maximum. This also means that the span of the throughput values is with 1.7 Gbps (6423 - 8183 MBit/sec) even larger than before. It has to be kept in mind though that this is still single thread performance, so the larger variance is probably due to CPU restrictions of a single thread. A quick singular test between one source and one destination using two parallel netperf threads showed a combined throughput of about 9400 MBit/sec and confirmed this suspicion.

LPAR and SEA Tests

After the initial hardware performance tests above, some more tests closer to the final real-world setup were done. The same setup would later be used for the production environment. The test environment looked like this:

LPAR and SEA Test Setup with AIX and VIOS on IBM Power and Cisco Nexus Switches

  • On a physical level of network and systems, the test setup was the same as the one above, except that only the two P770 with one CEC and thus one IVE/HEA port per VIOS were used in the further tests.

  • The directly assigned network interfaces and IPs on the VIO servers were replaced by a “failover SEA with load sharing” configuration.

  • An overall number of three AIX LPARs was used for the tests. LPARs #2 and #3 resided on the same P770 system. LPAR #1 resided on the other P770 system in the other datacenter.

  • All AIX LPARs had test IP adresses in the same VLAN (PVID: 251, Thus, no traffic over the firewall infrastructure occured and only one physical link (blue text in the above image) of the failover SEA was used.

The following table shows the results of the test runs in MBit/sec. The result table got a bit more complex this time around, since there were three LPARs of which each could be source or destination. In addition each LPAR also had six different combinations of virtual interface options (see the options legend). Excluded – and denoted as grey cells – were cases where source and destination IP and thus LPAR would be the same. To add a visual aid for interpretation of the results, the table cells are color coded depending on the measurement value. See the color legend for the mapping of color to throughput values.

Options Legend
none tcp_sendspace 262144, tcp_recvspace 262144, tcp_nodelay 1, rfc1323 1
LS largesend
9000 MTU 9000
65390 MTU 65390
LS,9000 largesend, MTU 9000
LS,65390 largesend, MTU 65390
Color Legend
⇐ 1 Gbps
> 1 Gbps and ⇐ 3 Gbps
> 3 Gbps and ⇐ 5 Gbps
> 5 Gbps
Managed System P770 DC 1-2 P770 DC 2-2
IP Destination .243 .239 .137
Source Option none LS 9000 65390 LS,9000 LS,65390 none LS 9000 65390 LS,9000 LS,65390 none LS 9000 65390 LS,9000 LS,65390
P770 DC 1-2 .243 none 791.27 727.34 759.35 743.37 766.81 708.33 888.13 885.68 877.76 788.91 778.50 875.20
LS 707.83 5239.54 709.68 721.63 5838.78 6714.79 3214.58 3225.06 3951.26 3744.28 4009.69 3961.77
9000 774.41 798.24 3422.03 4167.58 3518.23 3326.78 904.98 899.89 3676.68 3566.73 3507.67 3669.01
65390 755.57 772.49 3399.87 5233.71 3510.81 5821.94 867.31 878.49 3575.03 3697.73 3599.02 3766.30
LS,9000 785.19 6204.71 3658.68 3561.76 6384.92 7034.94 4073.84 4007.71 3964.47 4207.72 4024.49 3910.21
LS,65390 725.01 6058.76 3364.60 5909.11 7639.95 5529.59 3826.15 3918.82 3698.29 4139.26 3909.20 3970.33
.239 none 841.27 997.21 842.15 829.07 786.50 825.14 846.17 832.81 842.15 856.19 858.69 905.92
LS 827.75 5728.69 821.22 739.40 6028.86 6323.06 3674.06 3972.93 3750.00 3545.48 3238.95 3711.50
9000 817.95 847.33 3456.38 3659.09 3522.94 4436.41 855.46 897.06 3652.22 3410.08 3546.00 3439.29
65390 774.20 809.28 4552.87 6348.45 3598.86 5707.20 884.41 844.58 3173.88 3767.90 3217.50 3273.08
LS,9000 788.39 6569.07 4311.45 4004.73 7637.28 7529.95 3261.91 2854.86 4050.66 3458.85 3811.94 3972.71
LS,65390 801.97 5801.10 4018.57 5640.19 7815.83 5752.02 2952.91 3352.97 3961.08 3384.68 3421.31 4040.02
P770 DC 2-2 .137 none 925.17 753.81 865.15 839.55 912.80 930.17 819.35 938.09 864.33 847.11 898.57 827.62
LS 3351.54 2955.23 3323.41 3209.28 3158.83 3212.72 3143.14 2796.92 3118.80 2868.22 3244.99 3166.33
9000 930.69 953.76 3610.26 3460.80 3622.97 3585.96 785.10 869.91 3421.05 3245.40 3391.89 3369.89
65390 900.80 934.22 3564.63 3622.87 3769.84 3682.52 904.77 761.53 3231.62 3630.97 3079.13 3541.30
LS,9000 3066.75 3061.99 3604.73 4024.04 4059.77 4006.04 3094.73 2841.52 3820.87 3427.57 3792.07 3525.97
LS,65390 3333.52 3277.25 4047.81 4203.61 3957.74 3913.66 3025.75 3160.08 3829.30 3804.31 3283.60 3957.68

At the first glance those results show that the throughput can vary significantly, depending on the combination of interface configuration options as well as intra- vs. inter-managed system traffic. In detail the above results indicate that:

  • single thread performance is going to be nowhere near theoretical maximum of 10 Gbps. In most cases one has to be considered lucky getting 50% of the theoretical maximum.

  • the default configuration options (test cases: none) are not suited at all for a 10 Gbps environment.

  • intra-managed system network traffic, which does not leave the hardware and should be handled only within the PowerVM hypervisor, performs better than inter-managed system traffic in most of the cases. To a certain degree this was expected, since in the latter case the traffic has to additionally go over the SEA, the physical interfaces, the network devices, and again over the other systems physical interfaces and SEA.

  • there is a quite noticable delta between the intra-managed system results and the theoretical maximum of 10 Gbps. Since this traffic does not leave the hardware and should be handled only within the PowerVM hypervisor, better results somewhere between 9 – 9.5 Gbps were expected.

  • there is a quite noticable performance drop off comparing the inter-managed system results with the previous “non-jumbo frames, but flow-control” hardware tests. The delta gets even more noticable when comparing the results with the previous jumbo frames hardware tests. It seems the now involved SEA doesn't handle larger frame sizes that well.

  • strangely the inter-managed system tests show better results than the intra-managed system tests in some cases (LSnone, LS,9000none, LS,65390none).

The somewhat dissatisfying results were discussed with IBM support within the scope of the PMR mentioned earlier. The result of this discussion was also – and even with a large amount of good will – dissatisfying. We got the usual feedback of “results are actually not that bad”, “works as designed” and the infamous “not enough CPU resources on the VIO servers, switch to dedicated CPU assignment”. Just to get the latter point out of the way i did another reduced test set similar to the above. This time only the test cases for inter-managed system traffic were performed. The resource allocation on the two involved VIO servers was changed to two dedicated CPU cores. The following table shows the results of the test runs in MBit/sec.

Managed System P770 DC 1-2 P770 DC 2-2
IP Destination .243 .137
Source Option none LS 9000 65390 LS,9000 LS,65390 none LS 9000 65390 LS,9000 LS,65390
P770 DC 1-2 .243 none 853.06 862.40 800.55 869.85 873.51 846.27
LS 3661.52 3630.96 3530.22 3016.51 2986.46 3144.03
9000 808.54 654.28 3435.51 3570.83 3475.94 3426.25
65390 643.44 880.24 3301.09 3617.22 3592.70 3511.19
LS,9000 3645.29 2487.22 3783.74 3983.06 3774.08 4072.15
LS,65390 3591.95 3506.01 3646.02 4127.02 3987.71 3716.59
P770 DC 2-2 .137 none 826.02 710.41 800.56 425.30 814.65 824.55
LS 3509.81 3327.02 3419.89 3300.43 2387.57 3654.18
9000 827.58 835.04 3354.82 3301.09 3482.56 3550.13
65390 827.28 834.30 3451.69 3131.41 3506.15 3470.74
LS,9000 3480.38 3624.86 4191.55 4197.86 4194.89 4018.06
LS,65390 3580.14 2614.46 4097.04 4234.08 4181.17 3863.07

The results do in some of the cases improve a bit over the previous test set, but not by the expected margin of ~50%, which would be needed to get anywhere near the results of the previous hardware tests. A request to open a DCR in order to further investigate possible optimizations with regard to network performance in general and with the SEA in particular is currently still in the works at IBM.

HMC and Jumbo Frames

In order to still get a working RMC connection from the HMC to the LPARs and thus be able to do DLPAR and LPM operations after switching to jumbo frames, the HMC has to be configured accordingly. See: Configure Jumbo Frames on the Hardware Management Console (HMC) for the details. If – like in our case – you have to deal with a heavily fragmented/firewalled environment and the RMC traffic from the HMC to the LPARs has to pass through several network devices, all the network equipment along the way has to be able to properly deal with jumbo frames. In our case the network department was reluctant and eventually did not reconfigure their devices, since “jumbo frames are a non-standard extension to the network protocol”. As usual this is a tradeoff between optimal performance and the possibility of strange behaviour or even error situations that are hard to debug.

Thoughts and Conclusions

It appears as if with the advent of 10GE or even higher bandwidths, the network itself is no longer the bottleneck. Rather, the components doing the necessary legwork of pre- and postprocessing (i.e. CPU and memory), as well as some parts of the TCP/IP stack are or are becoming the new bottleneck. One can no longer rely on a plug'n'play attitude like with gigabit ethernet, when it comes to network performance and optimal utilization of existing resources.

Although i'm certainly no network protocol designer, AIX kernel programmer or PowerVM expert, it still seems to me as if IBM has left considerable room for improvement in all components involved. As we see 10 gigabit ethernet being deployed more and more in the datacenters, becoming the new standard by replacing gigabit ethernet, there will be an increased demand for high troughput and low latency network communication. It'll be hard to explain to upper management why they're getting only 35% - 75% of the network performance for 100% of the costs. I hope IBM will be able to show that the IBM Power and AIX platform can at least keep up with the current technology and trends as well as the competition from other platforms.

// Nagios Performance Tuning

I'm running Nagios in a bit of an unusual setup, namely on a Debian/PPC system, which runs in a LPAR on an IBM Power System, using a dual VIOS setup for access to I/O resources (SAN disks and networks). The Nagios server runs the stock Debian “Squeeze” packages:

nagios-nrpe-plugin        2.12-4             Nagios Remote Plugin Executor Plugin
nagios-nrpe-server        2.12-4             Nagios Remote Plugin Executor Server
nagios-plugins            1.4.15-3squeeze1   Plugins for the nagios network monitoring and management system
nagios-plugins-basic      1.4.15-3squeeze1   Plugins for the nagios network monitoring and management system
nagios-plugins-standard   1.4.15-3squeeze1   Plugins for the nagios network monitoring and management system
nagios3                   3.2.1-2            A host/service/network monitoring and management system
nagios3-cgi               3.2.1-2            cgi files for nagios3
nagios3-common            3.2.1-2            support files for nagios3
nagios3-core              3.2.1-2            A host/service/network monitoring and management system core files
nagios3-doc               3.2.1-2            documentation for nagios3
ndoutils-common           1.4b9-1.1          NDOUtils common files
ndoutils-doc              1.4b9-1.1          Documentation for ndoutils
ndoutils-nagios3-mysql    1.4b9-1.1          This provides the NDOUtils for Nagios with MySQL support
pnp4nagios                0.6.12-1~bpo60+1   Nagios addon to create graphs from performance data
pnp4nagios-bin            0.6.12-1~bpo60+1   Nagios addon to create graphs from performance data (binaries)
pnp4nagios-web            0.6.12-1~bpo60+1   Nagios addon to create graphs from performance data (web interface)

It monitors about 220 hosts, which are either Unix systems (AIX or Linux), storage systems (EMC Clariion, EMC Centera, Fujitsu DX, HDS AMS, IBM DS, IBM TS), SAN devices (Brocade 48000, Brocade DCX, IBM SVC) or other hardware devices (IBM Power, IBM HMC, Rittal CMC). About 5000 services are checked in a 5 minute interval, mostly with NRPE, SNMP, TCP/UDP and custom plugins utilizing vendor specific tools. The server also runs Cacti, SNMPTT, DokuWiki and several other smaller tools, so it has always been quite busy.

In the past i had already implemented some performance optimization measures in order to mitigate the overall load on the system and to get all checks done in the 5 minute timeframe. Those were - in no specific order:

  • Use C/C++ based plugins. Try to avoid plugins that depend on additional, rather large runtime environments (e.g. Perl, Java, etc.).

  • If Perl based plugins are still necessary, use the Nagios embedded Perl interpreter to run them. See Using The Embedded Perl Interpreter for more information and Developing Plugins For Use With Embedded Perl on how to develop and debug plugins for the embedded Perl interpreter.

  • When using plugins based on shell scripts, try to minimize the number of calls to additional command line tools by using the shells built-in facilities. For example bash and ksh93 have a built-in syntax for manipulating strings, which can be used instead of calling sed or awk.

  • Create a ramdisk with a filesystem on it to hold the I/O heavy files and directories. In my case /etc/fstab contains the following line:

    # RAM disk for volatile Nagios files
    none    /var/ram/nagios3    tmpfs   defaults,size=256m,mode=750,uid=nagios,gid=nagios   0   0

    preparing an empty ramdisk based filesystem at boot time. The init script /etc/init.d/nagios-prepare is run before the Nagios init script. It creates a directory tree on the ramdisk and copies the necessary files from the disk based, non-volatile filesystems:

    Source Destination
    /var/log/nagios3/retention.dat /var/ram/nagios3/log/retention.dat
    /var/log/nagios3/nagios.log /var/ram/nagios3/log/nagios.log
    /var/cache/nagios3/ /var/ram/nagios3/cache/

    over to the ramdisk via rsync. The following Nagios configuration stanzas have been altered to use the new ramdisk:


    To prevent loss of data in the event of a system crash, a cronjob is run every 5 minutes to rsync some files back from the ramdisk to the disk based, non-volatile filesystems:

    Source Destination
    /var/ram/nagios3/log/ /var/log/nagios3/
    /var/ram/nagios3/cache/ /var/cache/nagios3/

    This job should also be run before the system is shut down or rebooted.

  • Disable free()ing of child process memory (see: Nagios Main Configuration File Options - free_child_process_memory) with:


    in the main Nagios configuration file. This can safely be done since the Linux OS will take care of that once the fork()ed child exits.

  • Disable fork()ing twice when creating a child process (see: Nagios Main Configuration File Options - child_processes_fork_twice) with:


    in the main Nagios configuration file.

Recently, after “just” adding another process to be monitored, i ran into a strange problem, where suddenly all checks would fail with error messages similar to this example:

[1348838643] Warning: Return code of 127 for check of service 'Check_process_gmond' on host
    'host1' was out of bounds. Make sure the plugin you're trying to run actually exists
[1348838643] SERVICE ALERT: host1;Check_process_gmond;CRITICAL;SOFT;1;(Return code of 127 is
    out of bounds - plugin may be missing)

Even checks totally unrelated to the new process to be monitored. After removing the new check from some hostgroups everything was be fine again. So the problem wasn't with the check per se, but rather with the number of checks, which felt like i was hitting some internal limit of the Nagios process. It turned out to be exactly that. The Nagios macros which are exported to the fork()ed child process had reached the OS limit for a processes environment. After checking all plugins for any usage of the Nagios macros in the process environment i decided to turn this option off (see Nagios Main Configuration File Options - enable_environment_macros) with:


in the main Nagios configuration file. Along with the two above options i had now basically reached what is also done by the single use_large_installation_tweaks configuration option (see: Nagios Main Configuration File Options - use_large_installation_tweaks).

After reloading the Nagios process not only did the new process checks now work for all hostgroups, but there was also a very noticable drop in the systems CPU load:

Nagios CPU usage with environment macros turned off

// Webserver - Windows vs. Unix

Recently at work, i was given the task of evaluating alternatives for the current OS platform running the company homepage. Sounds trivial enough, doesn't it? But every subject in a moderately complex corporate environment has some history, lots of pitfalls and a considerable amount of politics attached to it, so why should this particular one be an exception.

The current environment was running a WAMP (Windows, Apache, MySQL, PHP) stack with a PHP-based CMS and was not performing well at all. The systems would cave under even minimal connection load, not to mention user rushes during campaign launches. The situation dragged on for over a year and a half, while expert consultants were brought in, measurements were made, fingers were pointed and even new hardware was purchased. Nothing helped, the new hardware brought the system down even faster, because it could serve more initial user requests thus effectively overrunning the system. IT management drew a lot of fire for the situation, but nontheless stuck with the “Microsoft, our strategic platform” mantra. I guess at some point the pressure got too high for even those guys.

This is where i, the Unix guy with almost no M$ knowledge, got the task of evaluating whether or not an “alternative OS platform” could do the job. Hot potatoe, anyone?

So i went on and set up four different environments that were at least somewhere within the scope of our IT departments supported systems (so no *BSD, no Solaris, etc.):

  1. Linux on the newly purchased x86 hardware mentioned above

  2. Linux on our VMware ESX cluster

  3. Linux as a LPAR on our IBM Power systems

  4. AIX as a LPAR on our IBM Power systems

Apache, MySQL and PHP were all the same version as in the Windows environment. The CMS and content were direct copies from the Windows production systems. Without any special further tweaking i ran some load tests with siege:

Webserver performance comparison - Transaktions per second Webserver performance comparison - Response time

Compared to the Windows environment (gray line), scenario 1 (dark blue line) was giving about 5 times the performance on the exact same hardware. The virtualized scenarios 2, 3 and 4 did not perform so well in absolute values. But since their CPU resources were only about 1/2 of the ones available in scenario 1, their relative performance isn't too bad after all. Also notable is the fact that all scenarios served requests up to the test limit of a thousend parallel clients. Windows started dropping requests after about 300 parallel clients.

Presented with those numbers, management decided the company webserver environment should be migrated to an “alternative OS platform”. AIX on Power systems was chosen for operational reasons, even though it didn't have the highest possible performance out of the tested scenarios. The go-live of the new webserver environment was wednesday last week at noon, with the switchover of the load-balancing groups. Take a look what happened to the response time measurements around that time:

Webserver performance - Daily after migration

Also very interesting is the weekly graph a few days after the migration:

Webserver performance - Weekly after migration

Note the largely reduced jitter in the response time!

This website uses cookies for visitor traffic analysis. By using the website, you agree with storing the cookies on your computer.More information