bityard Blog

// AIX and VIOS Performance with 10 Gigabit Ethernet (Update)

In october last year (10/2013) a colleague and i were given the opportunity to speak at the “IBM AIX User Group South-West” at IBMs german headquarters in Ehningen. My part of the talk was about our experiences with the move from 1GE to 10GE on our IBM Power systems. It was largely based on my pervious post AIX and VIOS Performance with 10 Gigabit Ethernet. During the preperation for the talk, while i reviewed and compiled the previously collected material on the subject, i suddenly realized a disastrous mistake in my methodology. Specifically the netperf program used for the performance tests has a bit of an inadequate heuristic for determining the TCP buffer sizes it will use during a test run. For example:

lpar1:/$ netperf -H
Recv   Send    Send
Socket Socket  Message  Elapsed
Size   Size    Size     Time     Throughput
bytes  bytes   bytes    secs.    10^6bits/sec
16384  16384   16384    10.00    3713.83

with the values “16384” from the last line being the relevant part here.

It turned out, that on AIX the netperf utility will only look at the global, system wide values of the tcp_recvspace and tcp_sendspace tunables set with the no command. In my example this was:

lpar1:/$ no -o tcp_recvspace -o tcp_sendspace
tcp_recvspace = 16384
tcp_sendspace = 16384

The interface specific values “262144” or “524288”, e.g.:

lpar1:/$ ifconfig en0
        inet netmask 0xffffff00 broadcast
         tcp_sendspace 262144 tcp_recvspace 262144 tcp_nodelay 1 rfc1323 1

which would override the system wide default values for a specific network interface were never properly picked up by netperf. The configuration of low, global default values and higher, interface-specific values for the tunables was deliberately chosen in order to allow the coexistance of low and high bandwidth adapters in the same system without the risk of interference. Anyway, i guess this is a very good example of the famous saying:

A fool with a tool is still a fool. Grady Booch

For the purpose of another round of performance tests i temporaryly set the global default values for the tunables tcp_recvspace and tcp_sendspace first to “262144” and then to “524288”. The number of tests were reduced to “largesend” and “largesend with JF enabled” since those – as elaborated before – seemed to be the most feasible options. The following table shows the results of the test runs in MBit/sec. The upper value in each cell being from the test with the tunables tcp_recvspace and tcp_sendspace set to “262144”, the lower value in each cell being from the test with the tunables tcp_recvspace and tcp_sendspace set to “524288”.

Options Legend
upper value tcp_sendspace 262144, tcp_recvspace 262144, tcp_nodelay 1, rfc1323 1
lower value tcp_sendspace 524288, tcp_recvspace 524288, tcp_nodelay 1, rfc1323 1
LS largesend
LS,9000 largesend, MTU 9000
Color Legend
⇐ 1 Gbps
> 1 Gbps and ⇐ 3 Gbps
> 3 Gbps and ⇐ 5 Gbps
> 5 Gbps
Managed System P770 DC 1-2 P770 DC 2-2
IP Destination .243 .239 .137
Source Option LS LS,9000 LS LS,9000 LS LS,9000
P770 DC 1-2 .243 LS 9624.47
LS,9000 9313.00
.239 LS 8484.16
LS,9000 8181.51
P770 DC 2-2 .137 LS 3937.75
LS,9000 3849.58

The results show a significant increase in throughput in the cases of intra-managed system network traffic. This is now up to almost wire speed. For the cases of inter-managed system traffic there is also a noticeable, but far less significant increase in throughput. To check if the two TCP buffer sizes selected were still too low, i did a quick check for the network latency in our environment and compared it to the reference values from IBM (see the redbook link below):

Network IBM RTT Reference RTT Measured
1GE Phys 0.144 ms 0.188 ms
10GE Phys 0.062 ms 0.074 ms
10GE Hyp 0.038 ms 0.058 ms
10GE SEA 0.274 ms 0.243 ms
10GE VMXNET3 0.149 ms

Every value besides the measurement within a managed system (“10GE Hyp”) represents the worst case RTT between two separate managed systems, one in each of the two datacenters. For reference purposes i added a test (“10GE VMXNET3”) between two Linux VMs running on two different VMware ESX hosts, one in each of the two datacenters, using the same network hardware as the IBM Power systems. The measured values themselves are well within the range of the reference values given by IBM, so the network setup in general should be fine. A quick calculation of the Bandwidth-delay product for the value of the test case “10GE SEA”:

B x D = 10^10bps x 0.243ms ~ 326kB

confirmed that the value of “524288” for the tunables tcp_recvspace and tcp_sendspace should be sufficient. Very alarming on the other hand is the fact, that processing of simple ICMP packets takes almost twice as long going through the SEA on the IBM Power systems, compared to the network virtualization layer of VMware ESX. Part of the rather sub-par throughput performance measured on IBM Power systems is most likely caused by inefficient processing of network traffic within the SEA and/or VIOS. Rumor has it, that with the advent of 40GE and things getting even worse, the IBM AIX lab finally acknowledged this as an issue and is working on a streamlined, more efficient network stack. Other sources say SR-IOV and with it passing direct access to shared hardware resources to the LPAR systems on a larger scale is considered to be the way forward. Unfortunately this currently conflicts with LPM, so it'll probably be mutually exclusive for most environments.

In any case IBM still seems to have some homework to do on the network front. It'll be interesting to see what the developers come up with in the future.

// Display the AIX devices in a tree format

In AIX there is the proctree (or ps -fT 0) command to display the currently running processes in a tree format. This is very helpful when one is primaryly interested in the parent and child relationship between the individual processes. Unfortunately a similar command for the parent and child relationship of devices is still missing from the stock AIX. There are already several script implementations out on the net to fill that particular gap. As a scripting exercise i wanted to do my own version of a devtree command. It was included in the aaa_base RPM package. The output for e.g. an LPAR with virtual ethernet and virtual SCSI devices looks like this:

|-- inet0
|   |-- en1
|   |-- et1
|   |-- lo0
|-- iocp0
|-- lvdd
|-- pty0
|-- rootvg
|   |-- hd1
|   |-- hd2
|   |-- hd3
|   |-- hd4
|   |-- hd5
|   |-- hd6
|   |-- hd8
|   |-- hd10opt
|   |-- hd9var
|   |-- lv_srv
|-- sfw0
|-- sys0
|   |-- sysplanar0
|   |   |-- L2cache0
|   |   |-- mem0
|   |   |-- pci0
|   |   |   |-- pci4
|   |   |   |-- pci5
|   |   |   |-- pci6
|   |   |-- pci1
|   |   |   |-- pci7
|   |   |   |-- pci8
|   |   |-- pci2
|   |   |   |-- pci9
|   |   |   |-- pci10
|   |   |   |-- pci11
|   |   |-- pci3
|   |   |   |-- pci12
|   |   |   |-- pci13
|   |   |-- pci14
|   |   |-- proc0
|   |   |-- proc4
|   |   |-- vio0
|   |   |   |-- ent1
|   |   |   |-- vsa0
|   |   |   |   |-- vty0
|   |   |   |-- vscsi0
|   |   |   |   |-- hdisk0
|   |   |   |-- vscsi1
`-- End of the device tree

in the regular mode, or like this:

|-- inet0                                         Available                   Internet Network Extension
|   |-- en1                                       Available                   Standard Ethernet Network Interface
|   |-- et1                                         Defined                   IEEE 802.3 Ethernet Network Interface
|   |-- lo0                                       Available                   Loopback Network Interface
|-- iocp0                                           Defined                   I/O Completion Ports
|-- lvdd                                          Available                   LVM Device Driver
|-- pty0                                          Available                   Asynchronous Pseudo-Terminal
|-- rootvg                                          Defined                   Volume group
|   |-- hd1                                         Defined                   Logical volume
|   |-- hd2                                         Defined                   Logical volume
|   |-- hd3                                         Defined                   Logical volume
|   |-- hd4                                         Defined                   Logical volume
|   |-- hd5                                         Defined                   Logical volume
|   |-- hd6                                         Defined                   Logical volume
|   |-- hd8                                         Defined                   Logical volume
|   |-- hd10opt                                     Defined                   Logical volume
|   |-- hd9var                                      Defined                   Logical volume
|   |-- lv_srv                                      Defined                   Logical volume
|-- sfw0                                          Available                   Storage Framework Module
|-- sys0                                          Available                   System Object
|   |-- sysplanar0                                Available                   System Planar
|   |   |-- L2cache0                              Available                   L2 Cache
|   |   |-- mem0                                  Available                   Memory
|   |   |-- pci0                                    Defined                   PCI Bus
|   |   |   |-- pci4                                Defined            00-10  PCI Bus
|   |   |   |-- pci5                                Defined            00-12  PCI Bus
|   |   |   |-- pci6                                Defined            00-16  PCI Bus
|   |   |-- pci1                                    Defined                   PCI Bus
|   |   |   |-- pci7                                Defined            02-10  PCI Bus
|   |   |   |-- pci8                                Defined            02-12  PCI Bus
|   |   |-- pci2                                    Defined                   PCI Bus
|   |   |   |-- pci9                                Defined            03-10  PCI Bus
|   |   |   |-- pci10                               Defined            03-12  PCI Bus
|   |   |   |-- pci11                               Defined            03-16  PCI Bus
|   |   |-- pci3                                    Defined                   PCI Bus
|   |   |   |-- pci12                               Defined            01-10  PCI Bus
|   |   |   |-- pci13                               Defined            01-12  PCI Bus
|   |   |-- pci14                                   Defined                   PCI Bus
|   |   |-- proc0                                 Available            00-00  Processor
|   |   |-- proc4                                 Available            00-04  Processor
|   |   |-- vio0                                  Available                   Virtual I/O Bus
|   |   |   |-- ent1                              Available                   Virtual I/O Ethernet Adapter (l-lan)
|   |   |   |-- vsa0                              Available                   LPAR Virtual Serial Adapter
|   |   |   |   |-- vty0                          Available                   Asynchronous Terminal
|   |   |   |-- vscsi0                            Available                   Virtual SCSI Client Adapter
|   |   |   |   |-- hdisk0                        Available                   Virtual SCSI Disk Drive
|   |   |   |-- vscsi1                            Available                   Virtual SCSI Client Adapter
`-- End of the device tree

in the detailed output mode.

// Ganglia Fibre Channel Power/Attenuation Monitoring on AIX and VIO Servers

Although usually only available upon request via IBM support, efc_power is quite the handy tool when it comes to debugging or narrowing down fibre channel link issues. It provides information about the transmit and receive, power and attenuation values for a given FC port on a AIX or VIO server. Fortunately the output of efc_power:

$ /opt/freeware/bin/efc_power /dev/fscsi2
TX: 1232 -> 0.4658 mW, -3.32 dBm
RX: 10a9 -> 0.4265 mW, -3.70 dBm

is very parser-friendly, so it can very easily be read by a script for further processing. In this case further processing means a continuous Ganglia monitoring of the fibre channel transmit and receive, power and attenuation values for each FC port on a AIX or VIO server. This is accomplished by the two RPM packages ganglia-addons-aix and ganglia-addons-aix-scripts:

RPM packages

Source RPM packages

FilenameFilesizeLast modified
ganglia-addons-aix-0.1-1.src.rpm6.6 KiB2013/07/30 09:41
ganglia-addons-aix-0.1-1.src.rpm.sha1sum75.0 B2013/07/30 09:41

The package ganglia-addons-aix-scripts is to be installed on the AIX or VIO server which has the FC adapter installed. It depends on the aaa_base package for the efc_power binary and on the ganglia-addons-base package, specifically on the cronjob (/opt/freeware/etc/run_parts/conf.d/ defined by this package. In the context of this cronjob all avaliable scripts in the directory /opt/freeware/libexec/ganglia-addons/ are executed. For this specific Ganglia addon an iteration over all fscsi devices in the system is done and efc_power is called for each fscsi device. Devices can be excluded by assigning a regex pattern to the BLACKLIST variable in the configuration file /opt/freeware/etc/ganglia-addons/ganglia-addons-efc_power.cfg. The output of each efc_power call is parsed and via the gmetric command fed into a Ganglia monitoring system that has to be already set up.

The package ganglia-addons-aix is to be installed on the host running the Ganglia webinterface. It contains templates for the customization of the FC power and attenuation metrics within the Ganglia Web 2 interface. See the README.templates file for further installation instructions. Here are samples of the two graphs created with those Ganglia monitoring templates:

Example of FC power and attenuation with a bad cable

In the section “1” of the graphs, the receive attenuation on FC port fscsi2 was about -7.7 dBm, which means that of the 476.6 uW sent from the Brocade switchport:

$ sfpshow 1/10

Identifier:  3    SFP
Connector:   7    LC
Transceiver: 540c404000000000 200,400,800_MB/s M5,M6 sw Short_dist
Encoding:    1    8B10B
Baud Rate:   85   (units 100 megabaud)
Length 9u:   0    (units km)
Length 9u:   0    (units 100 meters)
Length 50u:  5    (units 10 meters)
Length 62.5u:2    (units 10 meters)
Length Cu:   0    (units 1 meter)
Vendor Name: BROCADE
Vendor OUI:  00:05:1e
Vendor PN:   57-1000012-01
Vendor Rev:  A
Wavelength:  850  (units nm)
Options:     003a Loss_of_Sig,Tx_Fault,Tx_Disable
BR Max:      0
BR Min:      0
Serial No:   UAF1112600001JW
Date Code:   110619
DD Type:     0x68
Enh Options: 0xfa
Status/Ctrl: 0x82
Alarm flags[0,1] = 0x5, 0x40
Warn Flags[0,1] = 0x5, 0x40
                                          Alarm                  Warn
                                      low        high       low         high
Temperature: 41      Centigrade     -10         90         -5          85
Current:     7.392   mAmps          1.000       17.000     2.000       14.000
Voltage:     3264.9  mVolts         2900.0      3700.0     3000.0      3600.0
RX Power:    -4.0    dBm (400.1 uW) 10.0   uW   1258.9 uW  15.8   uW   1000.0 uW
TX Power:    -3.2    dBm (476.6 uW) 125.9  uW   631.0  uW  158.5  uW   562.3  uW

only about 200 uW actually made it to the FC port fscsi2 on the VIO server. Section “2” shows even worse values during the time the FC connections and cables were checked, which basically means that the FC link was down during that time period. Section “3” shows the values after the bad cable was found and replaced. Receive attenuation on FC port fscsi2 went down to about -3.7 dBm, which means that of the now 473.6 uW sent from the Brocade switchport, 427.3 uW actually make it to the FC port fscsi2 on the VIO server.

The goal with the continuous monitoring of the fibre channel transmit and receive, power and attenuation values is to catch slowly deterioration situations early on, before they become a real issue or even a service interruption. As shown above, this can be accomplished with Ganglia and the two RPM packages ganglia-addons-aix and ganglia-addons-aix-scripts. For ad hoc checks, e.g. during the debugging of the components in a suspicious FC link, efc_power is still best to be called directly from the AIX or VIO server command line.

// AIX RPMs: GestioIP, net-snmp and netdisco-mibs

Following up to the previous posts, regarding the topic of AIX RPM packages, here are three new AIX RPM packages related to network and IP adress management (IPAM):

// AIX RPMs: Perl and CPAN Perl Modules

Following up to the previous posts, regarding the topic of AIX RPM packages, here are several new AIX RPM packages related to Perl:

This website uses cookies for visitor traffic analysis. By using the website, you agree with storing the cookies on your computer.More information