bityard Blog

// Live Partition Mobility (LPM) with Debian on a IBM Power LPAR - Part 1

Live partition mobility (LPM) in the IBM Power environment is roughtly the same as vMotion in the VMware ESX world. It allows the movement of a LPAR from one IBM Power hardware system to another without (major) interruption to the system running within the LPAR. We use this feature on a regular basis for our AIX LPARs and it has simplified our work to a great extent, since we no longer need downtimes for a lot our regular administrative work. LPM also works for LPARs running Linux as an OS, but since IBM only supports the SuSE and Red Hat enterprise distributions, the necessary service and productivity tools to successfully perform LPM – and also DLPAR – operations are not readyly available to users of the Debian distribution. I still wanted to be able to do DLPAR and LPM operations on our Debian LPARs as well. To that effect, i did a conversion of the necessary RPM packages from the service and productivity tools mentioned before, to the DEB package format. Most of the conversion work was done by the alien tool provided by the Debian distribution. Besides that, there still were some manual patches necessary to adjust the components to the specifics of the Debian environment. A current version of the Linux kernel and a rebuild of the kernel package with some PPC specific options enabled was also necessary. Here are the individual steps:

  1. Install the prerequisite Debian packages:

    libstdc++5
    ksh
    uuid
    libsgutils1
    libsqlite3-0
    $ apt-get install libstdc++5 ksh uuid libsgutils1 libsqlite3-0
    

    The libsqlite3-0 is currently only necessary for the libservicelog-1-1-1-32bit package. The packages libservicelog and servicelog, which also have a dependency to libsqlite3-0, contain binaries and libraries that are build for a 64bit userland (ppc64) which is currently not available for Debian. Using the binaries or libraries from libservicelog or servicelog will therefore result in an error about unresolved symbols.

  2. Convert or download the IBM service and productivity tools:

    Option 1: Download the already converted IBM service and productivity tools:

    FilenameFilesizeLast modified
    devices.chrp.base.servicerm_2.3.0.0-11232_powerpc.deb94.6 KiB2013/10/06 18:43
    devices.chrp.base.servicerm_2.3.0.0-11232_powerpc.deb.sha1sum96.0 B2013/10/06 18:43
    dynamicrm_1.3.9-8_powerpc.deb35.0 KiB2013/10/06 18:43
    dynamicrm_1.3.9-8_powerpc.deb.sha1sum72.0 B2013/10/06 18:43
    librtas-32bit_1.3.6-5_powerpc.deb88.6 KiB2013/10/06 18:43
    librtas-32bit_1.3.6-5_powerpc.deb.sha1sum76.0 B2013/10/06 18:43
    librtas_1.3.6-4_powerpc.deb120.7 KiB2013/10/06 18:43
    librtas_1.3.6-4_powerpc.deb.sha1sum70.0 B2013/10/06 18:43
    libservicelog-1-1-1-32bit_1.1.11-10_powerpc.deb111.1 KiB2013/10/06 18:43
    libservicelog-1-1-1-32bit_1.1.11-10_powerpc.deb.sha1sum90.0 B2013/10/06 18:43
    libservicelog-1-1-1_1.1.11-10_powerpc.deb113.6 KiB2013/10/06 18:43
    libservicelog-1-1-1_1.1.11-10_powerpc.deb.sha1sum84.0 B2013/10/06 18:43
    libservicelog_1.1.11-10_powerpc.deb14.5 KiB2013/10/06 18:43
    libservicelog_1.1.11-10_powerpc.deb.sha1sum78.0 B2013/10/06 18:43
    libvpd2_2.1.3-4_powerpc.deb330.6 KiB2013/10/06 18:43
    libvpd2_2.1.3-4_powerpc.deb.sha1sum70.0 B2013/10/06 18:43
    lsvpd_1.6.11-5_powerpc.deb860.8 KiB2013/10/06 18:43
    lsvpd_1.6.11-5_powerpc.deb.sha1sum69.0 B2013/10/06 18:43
    powerpc-ibm-utils_1.2.12-1_powerpc.deb216.3 KiB2013/10/06 18:43
    powerpc-ibm-utils_1.2.12-1_powerpc.deb.sha1sum81.0 B2013/10/06 18:43
    rsct.core.utils_3.1.0.7-11278_powerpc.deb858.3 KiB2013/10/06 18:43
    rsct.core.utils_3.1.0.7-11278_powerpc.deb.sha1sum84.0 B2013/10/06 18:43
    rsct.core_3.1.0.7-11278_powerpc.deb11.7 MiB2013/10/06 18:43
    rsct.core_3.1.0.7-11278_powerpc.deb.sha1sum78.0 B2013/10/06 18:43
    servicelog_1.1.9-9_powerpc.deb73.7 KiB2013/10/06 18:43
    servicelog_1.1.9-9_powerpc.deb.sha1sum73.0 B2013/10/06 18:43
    src_1.3.1.1-11278_powerpc.deb267.3 KiB2013/10/06 18:43
    src_1.3.1.1-11278_powerpc.deb.sha1sum72.0 B2013/10/06 18:43

    Option 2: Convert the IBM service and productivity tools from RPM to DEB:

    1. Install the prerequisite Debian packages:

      alien
      $ apt-get install alien
      
    2. Download the necessary patch files:

      FilenameFilesizeLast modified
      DynamicRM.patch3.7 KiB2013/10/06 19:17
      devices.chrp.base.ServiceRM.patch2.9 KiB2013/10/06 19:17
      librtas-32bit.patch453.0 B2013/10/06 19:17
      librtas.patch434.0 B2013/10/06 19:17
      libservicelog-1_1-1-32bit.patch582.0 B2013/10/06 19:17
      libservicelog-1_1-1.patch553.0 B2013/10/06 19:17
      libservicelog.patch522.0 B2013/10/06 19:17
      libvpd2.patch494.0 B2013/10/06 19:17
      lsvpd.patch789.0 B2013/10/06 19:17
      rsct.core.patch10.6 KiB2013/10/06 19:17
      rsct.core.utils.patch7.5 KiB2013/10/06 19:17
      servicelog.patch487.0 B2013/10/06 19:17
      src.patch7.2 KiB2013/10/06 19:17
    3. Convert librtas:

      $ alien -gc librtas-32bit-1.3.6-4.ppc64.rpm
      $ patch -p 0 < librtas-32bit.patch
      $ cd librtas-32bit-1.3.6
      $ ./debian/rules binary
      
      $ alien -gc librtas-1.3.6-3.ppc64.rpm
      $ patch -p 0 < librtas.patch
      $ cd librtas-1.3.6
      $ ./debian/rules binary
      
    4. Convert src:

      $ alien -gc src-1.3.1.1-11277.ppc.rpm
      $ patch -p 0 < src.patch
      $ rm src-1.3.1.1/debian/postrm
      $ cd src-1.3.1.1
      $ perl -i -p -e 's/ppc64/powerpc/' ./debian/control
      $ ./debian/rules binary
      
    5. Convert RSCT core and utils:

      $ alien -gc rsct.core.utils-3.1.0.7-11277.ppc.rpm
      $ patch -p 0 < rsct.core.utils.patch
      $ cd rsct.core.utils-3.1.0.7
      $ ./debian/rules binary
      
      $ alien -gc rsct.core-3.1.0.7-11277.ppc.rpm
      $ patch -p 0 < rsct.core.patch
      $ cd rsct.core-3.1.0.7
      $ ./debian/rules binary
      
    6. Convert ServiceRM:

      $ alien -gc devices.chrp.base.ServiceRM-2.3.0.0-11231.ppc.rpm
      $ patch -p 0 < devices.chrp.base.ServiceRM.patch
      $ cd devices.chrp.base.ServiceRM-2.3.0.0
      $ ./debian/rules binary
      
    7. Convert DynamicRM:

      $ alien -gc DynamicRM-1.3.9-7.ppc64.rpm
      $ patch -p 0 < DynamicRM.patch
      $ cd DynamicRM-1.3.9
      $ ./debian/rules binary
      
    8. Convert lsvpd and libvpd:

      $ alien -gc libvpd2-2.1.3-3.ppc64.rpm
      $ patch -p 0 < libvpd.patch
      $ cd libvpd2-2.1.3
      $ ./debian/rules binary
      
      $ alien -gc lsvpd-1.6.11-4.ppc64.rpm
      $ patch -p 0 < lsvpd.patch
      $ cd lsvpd-1.6.11
      $ ./debian/rules binary
      
    9. Build, package and/or install PowerPC Utils:

      Install from sources at sourceforge.net or use custom build DEB package from:

      FilenameFilesizeLast modified
      powerpc-ibm-utils_1.2.12-1_powerpc.deb216.3 KiB2013/10/06 18:43
      powerpc-ibm-utils_1.2.12-1_powerpc.deb.sha1sum81.0 B2013/10/06 18:43
    10. Convert libservicelog and servicelog:

      $ alien -gc libservicelog-1_1-1-1.1.11-9.ppc64.rpm
      $ patch -p 0 < libservicelog-1_1-1.patch
      $ cd libservicelog-1_1-1-1.1.11
      $ ./debian/rules binary
      
      $ alien -gc libservicelog-1.1.11-9.ppc64.rpm
      $ patch -p 0 < libservicelog.patch
      $ cd libservicelog-1.1.11
      $ ./debian/rules binary
      
      $ alien -gc libservicelog-1_1-1-32bit-1.1.11-9.ppc.rpm
      $ patch -p 0 < libservicelog-1_1-1-32bit.patch
      $ cd libservicelog-1_1-1-32bit-1.1.11
      $ ./debian/rules binary
      
      $ alien -gc servicelog-1.1.9-8.ppc64.rpm
      $ patch -p 0 < servicelog.patch
      $ cd servicelog-1.1.9
      $ ./debian/rules binary
      
  3. Install the IBM service and productivity tools DEB packages:

    $ dpkg -i librtas_1.3.6-4_powerpc.deb librtas-32bit_1.3.6-5_powerpc.deb \
        src_1.3.1.1-11278_powerpc.deb rsct.core.utils_3.1.0.7-11278_powerpc.deb \
        rsct.core_3.1.0.7-11278_powerpc.deb devices.chrp.base.servicerm_2.3.0.0-11232_powerpc.deb \
        dynamicrm_1.3.9-8_powerpc.deb libvpd2_2.1.3-4_powerpc.deb lsvpd_1.6.11-5_powerpc.deb \
        powerpc-ibm-utils_1.2.12-1_powerpc.deb libservicelog-1-1-1-32bit_1.1.11-10_powerpc.deb \
        libservicelog_1.1.11-10_powerpc.deb servicelog_1.1.9-9_powerpc.deb
    
  4. Rebuild the stock Debian kernel package as described in HowTo Rebuild An Official Debian Kernel Package. I've confirmed DLPAR and LPM to successfully work with at least the Debian kernel packages versions 2.6.39-3~bpo60+1 and 3.2.46-1~bpo60+1. On the make menuconfig step make sure the following kernel configuration options are selected:

    CONFIG_MIGRATION=y
    CONFIG_PPC_PSERIES=y
    CONFIG_PPC_SPLPAR=y
    CONFIG_LPARCFG=y
    CONFIG_PPC_SMLPAR=y
    CONFIG_PPC_RTAS=y
    CONFIG_RTAS_PROC=y
    CONFIG_NUMA=y
    # CONFIG_SPARSEMEM_VMEMMAP is not set
    CONFIG_MEMORY_HOTPLUG=y
    CONFIG_MEMORY_HOTPLUG_SPARSE=y
    CONFIG_MEMORY_HOTREMOVE=y
    CONFIG_ARCH_MEMORY_PROBE=y
    CONFIG_HOTPLUG_PCI=y
    CONFIG_HOTPLUG_PCI_RPA=y
    CONFIG_HOTPLUG_PCI_RPA_DLPAR=y

    Or use one of the following trimmed down kernel configuration files:

    FilenameFilesizeLast modified
    config-2.6.39-bpo.2-powerpc.config54.6 KiB2013/10/06 22:48
    config-3.2.0-0.bpo.4.ssb.1-powerpc64.config57.3 KiB2013/10/06 22:48

    Install the newly build kernel package, reboot and select the new kernel to be loaded.

  5. A few minutes after the system been started, DLPAR and LPM operations on the LPAR should now be possible from the HMC. A good indication from the HMC GUI is a properly filled field in the “OS Version” column. From the HMC CLI you can check with:

    $ lspartition -dlpar
    ...
    <#108> Partition:<45*8231-E2D*06AB35T, ststnagios02.lan.ssbag, 10.8.32.46>
           Active:<1>, OS:<Linux/Debian, 3.2.0-0.bpo.4.ssb.1-powerUnknown, Unknown>, DCaps:<0x2c7f>, CmdCaps:<0x19, 0x19>, PinnedMem:<0>
    ...
    

    The LPAR should show up in the output and the value of DCaps should be different from 0x0.

    After a successful LPM operation the output of dmesg from within the OS should look like this:

    ...
    [539043.613297] calling ibm,suspend-me on cpu 4
    [539043.960651] EPOW <0x6240040000000b8 0x0 0x0>
    [539043.960665] ibmvscsi 30000003: Re-enabling adapter!
    [539043.961606] RTAS: event: 21, Type: EPOW, Severity: 1
    [539043.962920] ibmvscsi 30000002: Re-enabling adapter!
    [539044.175848] property parse failed in parse_next_property at line 230
    [539044.485745] ibmvscsi 30000002: partner initialization complete
    [539044.485815] ibmvscsi 30000002: host srp version: 16.a, host partition vios1-p730-222 (1), OS 3, max io 262144
    [539044.485892] ibmvscsi 30000002: Client reserve enabled
    [539044.485907] ibmvscsi 30000002: sent SRP login
    [539044.485964] ibmvscsi 30000002: SRP_LOGIN succeeded
    [539044.525723] ibmvscsi 30000003: partner initialization complete
    [539044.525779] ibmvscsi 30000003: host srp version: 16.a, host partition vios2-p730-222 (2), OS 3, max io 262144
    [539044.525884] ibmvscsi 30000003: Client reserve enabled
    [539044.525897] ibmvscsi 30000003: sent SRP login
    [539044.525943] ibmvscsi 30000003: SRP_LOGIN succeeded
    [539044.884514] property parse failed in parse_next_property at line 230
    ...

Although this was done some time ago and there now have already been several new versions of the packages from the service and productivity tools, DLPAR and LPM still work with this setup. There will be another installment of this post in the future with updated package versions. Another item on my ToDo list is to provide those components from the service and productivity tools which are available in source code as native Debian packages.

// Ganglia Performance Monitoring on IBM Power with AIX and LPM

Ganglia is a great open source performance monitoring tool. Like many others (e.g. Cacti, Munin, etc.) it uses RRDtool as a consolidated data storage tool. Unlike others, the main focus of Ganglia is on efficiently monitoring large scale distributed environments. Ganglia can very easily be used to do performance monitoring in IBM Power environments running AIX, Linux/PPC and VIO servers. This is thanks to the great effort of Michael Perzl, who maintains pre-build Ganglia RPM packages for AIX which also cover some of the AIX specific metrics. Ganglia can very easily be utilized to also do application specific performance monitoring, since this is a very extensive subject it'll be discussed in upcoming articles.

Since Ganglia is usually used in environments where there is a “grid” of one or more “clusters” containing a larger number of individual “hosts” to be monitored, it applies the same semantics to the way it builds its hierarchical views. Unfortunately there is no exact equivalent to those classification terms in the IBM Power world. I found the following mapping of terms to be the most useful:

Ganglia Terminology IBM Power Terminology Comment
Grid Grid Container entity of 1+ clusters or managed systems
Cluster Managed System Container entity of 1+ hosts or LPARs
Host Host, LPAR Individual host running an OS

On the one hand this mapping makes sense, since you usually are either interested in the perfomance metrics of an individual host or in the relation between individual host metrics within the context of the managed system that is running those hosts. E.g. how much CPU time is a individual host using versus how much CPU time are all hosts on a managed system using and in what distribution.

On the other hand this mapping turns out to be a bit problematic, since Ganglia expects a rather static assignment of hosts to clusters. In a traditional HPC environment a host is seldomly moved from one cluster to another, making the necessary configuration work a manageable amount of administrative overhead. In a IBM Power environment with the above mapping of terms applied, a LPAR could – and with the introduction of LPM even more – easily be moved between different clusters. To reduce the administrative overhead, the necessary configuration changes should be done automatically. Historical performance data of a host should be preserved, even when moving between different clusters.

Also, by default Ganglia uses IP multicast for the necessary communication between the hosts and the Ganglia server. While this may be a viable method for large cluster setups in flat, unsegmented networks, it does not do so well in heavily segmented or firewalled network environments. Ganglia can be configured to instead use IP unicast for the communication between the hosts and the Ganglia server, but this also has some effect on the cluster to host mapping described above.

The Ganglia setup described below will meet the following design criteria:

  • Use of IP unicast communication with predefined UDP ports.

  • A “firewall-friendly” behaviour with regard to the Ganglia network communication.

  • Automatic reconfiguration of Ganglia in case a LPAR is moved from one managed system to another.

  • Preservation of the historical performance data in case a LPAR is moved from one managed system to another.

Prerequisites

For a Ganglia setup to fullfill the above design goals, some prerequisites have to be met:

  1. A working basic Ganglia server, with the following Ganglia RPM packages installed:

    ganglia-gmetad-3.4.0-1
    ganglia-gmond-3.4.0-1
    ganglia-gweb-3.5.2-1
    ganglia-lib-3.4.0-1

    If the OS of the ganglia server is also to be monitored as a Ganglia client, install the Ganglia modules you see fit. In my case e.g.:

    ganglia-mod_ibmame-3.4.0-1
    ganglia-p6-mod_ibmpower-3.4.0-1
  2. A working Ganglia client setup on the hosts to be monitored, with the following minimal Ganglia RPM packages installed:

    ganglia-gmond-3.4.0-1
    ganglia-lib-3.4.0-1

    In addition to the minimal Ganglia RPM packages, you can install any number of additional Ganglia modules you might need. In my case e.g. for all the fully virtualized LPARs:

    ganglia-mod_ibmame-3.4.0-1
    ganglia-p6-mod_ibmpower-3.4.0-1

    and for all the VIO servers and LPARs that have network and/or fibre channel hardware resources assigned:

    ganglia-gmond-3.4.0-1
    ganglia-lib-3.4.0-1
    ganglia-mod_ibmfc-3.4.0-1
    ganglia-mod_ibmnet-3.4.0-1
    ganglia-p6-mod_ibmpower-3.4.0-1
  3. A convention of which UDP ports will be used for the communication between the Ganglia clients and the Ganglia server. Each managed system in your IBM Power environment will get its own gmond process on the Ganglia server. For the Ganglia clients on a single managed system to be able to communicate with the Ganglia server, an individual and unique UDP port will be used.

  4. Model name, serial number and name of all managed systems in your IBM Power environment.

  5. If necessary, a set of firewall rules that allow communication between the Ganglia clients and the Ganglia server along the lines of the previously defined UDP port convention.

With the information from the above items 3 and 4, create a three-way mapping table like this:

Managed System Name Model Name_Serial Number Ganglia Unicast UDP Port
P550-DC1-R01 8204-E8A_0000001 8301
P770-DC1-R05 9117-MMB_0000002 8302
P770-DC2-R35 9117-MMB_0004711 8303

The information in this mapping table will be used in the following examples, so here's a bit more detailed explaination:

  • Three columns, “Managed System Name”, “Model Name_Serial Number” and “Ganglia Unicast UDP Port”.

  • Each row contains the information for one IBM Power managed system.

  • The field “Managed System Name” contains the name of the managed system, which is later on displayed in Ganglia. For ease of administration it should ideally be consistent with the name used in the HMC. In this example the naming convention is “Systemtype-Datacenter-Racknumber”.

  • The field “Model Name_Serial Number” contains the model and the S/N of the managed system, which are concatenated with “_” to a single string containing no whitespaces. Model and S/N are substrings of the values that the command:

    $ lsattr  -El sys0 -a systemid -a modelname
    
    systemid  IBM,020000001 Hardware system identifier False
    modelname IBM,8204-E8A  Machine name               False
    

    reports.

  • The field “Ganglia Unicast UDP Port” contains the UDP port which was assigned to a specific IBM Power managed system by the convention mentioned above. In this example the UDP ports were simply allocated in a incremental, sequential order, starting at port 8301. It can be any UDP port that is available in your environment, you just have to be consistent about its use and be careful to have no duplicate assignments.

Configuration

  1. Create a filesystem and a directory hierarchy for the Ganglia RRD database files to be stored in. In this example the already existing /ganglia filesystem is used to create the following directories:

    /ganglia
    /ganglia/rrds
    /ganglia/rrds/LPARS
    /ganglia/rrds/P550-DC1-R01
    /ganglia/rrds/P770-DC1-R05
    /ganglia/rrds/P770-DC2-R35
    ...
    /ganglia/rrds/<Managed System Name>

    Make sure all the directories are owned by the user under which the gmond processes will be running. In my case this is the user nobody.

    Later on, the directory /ganglia/rrds/LPARS will contain subdirectories for each LPAR, which in turn contain the RRD files storing the performance data for each LPAR. The directories /ganglia/rrds/<Managed System Name> will only contain a symbolic link for each LPAR, pointing to the actual LPAR directory within /ganglia/rrds/LPARS, e.g.:

    $ ls -ald /ganglia/rrds/*/saixtest.mydomain
    
    drwxr-xr-x 2 nobody [...] /ganglia/rrds/LPARS/saixtest.mydomain
    lrwxrwxrwx 1 nobody [...] /ganglia/rrds/P550-DC1-R01/saixtest.mydomain -> ../LPARS/saixtest.mydomain
    lrwxrwxrwx 1 nobody [...] /ganglia/rrds/P770-DC1-R05/saixtest.mydomain -> ../LPARS/saixtest.mydomain
    lrwxrwxrwx 1 nobody [...] /ganglia/rrds/P770-DC2-R35/saixtest.mydomain -> ../LPARS/saixtest.mydomain
    

    In addition – and this is the actual reason for the whole symbolic link orgy – Ganglia will automatically create a subdirectory /ganglia/rrds/<Managed System Name>/__SummaryInfo__, which contains the RRD files used to generate the aggregated summary performance metrics of each managed system. If the above setup would be simplified by placing the symbolic links one level higher in the directory hierarchy, each managed system would use the same __SummaryInfo__ subdirectory, rendering the RRD files in it useless for summary purposes.

    This directory and symbolic link setup might seem strange or even unnecessary at first, but it plays a major role in getting Ganglia to play along with LPM. So it's important to set this up accordingly.

  2. Setup a gmond server process for each managed system in your IBM Power environment. In this example we need three processes for the three managed systems mentioned above. On the Ganglia server, create a gmond configuration file for each managed system that can send performance data to the Ganglia server. Note that only the essential parts of the gmond configuration are shown in the following code snippets:

    /etc/ganglia/gmond-p550-dc1-r01.conf
    globals {
      ...
      mute = yes
      ...
    }
     
    cluster {
      name = "P550-DC1-R01"
      ...
    }
     
    udp_recv_channel {
      port = 8301
    }
     
    tcp_accept_channel {
      port = 8301
    }
    /etc/ganglia/gmond-p770-dc1-r05.conf
    globals {
      ...
      mute = yes
      ...
    }
     
    cluster {
      name = "P770-DC1-R05"
      ...
    }
     
    udp_recv_channel {
      port = 8302
    }
     
    tcp_accept_channel {
      port = 8302
    }
    /etc/ganglia/gmond-p770-dc2-r35.conf
    globals {
      ...
      mute = yes
      ...
    }
     
    cluster {
      name = "P770-DC2-R35"
      ...
    }
     
    udp_recv_channel {
      port = 8303
    }
     
    tcp_accept_channel {
      port = 8303
    }

    Copy the stock gmond init script for each gmond server process:

    $ for MSYS in p550-dc1-r01 p770-dc1-r05 p770-dc2-r35; do
      cp -pi /etc/rc.d/init.d/gmond /etc/rc.d/init.d/gmond-${MSYS}
    done
    

    and edit the following lines:

    /etc/rc.d/init.d/gmond-p550-dc1-r01
    ...
    PIDFILE=/var/run/gmond-p550-dc1-r01.pid
    ...
    GMOND_CONFIG=/etc/ganglia/gmond-p550-dc1-r01.conf
    ...
    /etc/rc.d/init.d/gmond-p770-dc1-r05
    ...
    PIDFILE=/var/run/gmond-p770-dc1-r05.pid
    ...
    GMOND_CONFIG=/etc/ganglia/gmond-p770-dc1-r05.conf
    ...
    /etc/rc.d/init.d/gmond-p770-dc2-r35
    ...
    PIDFILE=/var/run/gmond-p770-dc2-r35.pid
    ...
    GMOND_CONFIG=/etc/ganglia/gmond-p770-dc2-r35.conf
    ...

    and start the gmond server processes. Check for running processes and UDP ports being listened on:

    $ for MSYS in p550-dc1-r01 p770-dc1-r05 p770-dc2-r35; do
      /etc/rc.d/init.d/gmond-${MSYS} start
    done
    
    $ ps -ef | grep gmond-
    nobody  4260072 [...] /opt/freeware/sbin/gmond -c /etc/ganglia/gmond-p770-dc2-r35.conf -p /var/run/gmond-p770-dc2-r35.pid
    nobody  6094872 [...] /opt/freeware/sbin/gmond -c /etc/ganglia/gmond-p770-dc1-r05.conf -p /var/run/gmond-p770-dc1-r05.pid
    nobody  6291514 [...] /opt/freeware/sbin/gmond -c /etc/ganglia/gmond-p550-dc1-r01.conf -p /var/run/gmond-p550-dc1-r01.pid
    
    $ netstat -na | grep 830
    tcp4       0      0  *.8300   *.*   LISTEN
    tcp4       0      0  *.8301   *.*   LISTEN
    tcp4       0      0  *.8302   *.*   LISTEN
    udp4       0      0  *.8218   *.*
    udp4       0      0  *.8222   *.*
    udp4       0      0  *.8223   *.*
    

    If you want the gmond server processes to be startet automatically on startup of the Ganglia server system, create the appropriate “S” and “K” symbolic links in /etc/rc.d/rc2.d/. E.g.:

    $ for MSYS in p550-dc1-r01 p770-dc1-r05 p770-dc2-r35; do
      ln -s /etc/rc.d/init.d/gmond-${MSYS} /etc/rc.d/rc2.d/Sgmond-${MSYS}
      ln -s /etc/rc.d/init.d/gmond-${MSYS} /etc/rc.d/rc2.d/Kgmond-${MSYS}
    done
    
  3. Setup the gmetad server process to query each gmond server process. On the Ganglia server, edit the gmetad configuration file:

    /etc/ganglia/gmetad.conf
    ...
    data_source "P550-DC1-R01" 60 localhost:8301
    data_source "P770-DC1-R05" 60 localhost:8302
    data_source "P770-DC2-R35" 60 localhost:8303
    ...
    rrd_rootdir "/ganglia/rrds"
    ...

    and restart the gmetad process.

  4. Create a DNS or /etc/hosts entry named ganglia which points to the IP address of your Ganglia server. Make sure the DNS or hostname is resolvable on all the Ganglia client LPARs in your environment.

  5. Install the aaa_base and ganglia-addons-base RPM packages on all the Ganglia client LPARs in your environment.

    RPM packages

    FilenameFilesizeLast modified
    ganglia-addons-base-0.1-5.ppc-aix5.1.rpm10.4 KiB2013/06/16 16:08
    ganglia-addons-base-0.1-5.ppc-aix5.1.rpm.sha1sum83.0 B2013/06/16 16:08

    Source RPM packages

    FilenameFilesizeLast modified
    ganglia-addons-base-0.1-5.src.rpm10.2 KiB2013/06/16 16:08
    ganglia-addons-base-0.1-5.src.rpm.sha1sum76.0 B2013/06/16 16:08

    The ganglia-addons-base package installs several extensions to the stock ganglia-gmond package which are:

    • a modified gmond client init script /opt/freeware/libexec/ganglia-addons-base/gmond.

    • a configuration file /etc/ganglia/gmond.init.conf for the modified gmond client init script.

    • a configuration template file /etc/ganglia/gmond.tpl.

    • a run_parts script 800_ganglia_check_links.sh for the Ganglia server to perform a sanity check on the symbolic links in the /ganglia/rrds/<managed system>/ directories.

    Edit or create /etc/ganglia/gmond.init.conf to reflect your environment and roll it out to all the Ganglia client LPARs. E.g. with the example data from above:

    /etc/ganglia/gmond.init.conf
    # MSYS_ARRAY:
    MSYS_ARRAY["8204-E8A_0000001"]="P550-DC1-R01"
    MSYS_ARRAY["9117-MMB_0000002"]="P770-DC1-R05"
    MSYS_ARRAY["9117-MMB_0004711"]="P770-DC2-R35"
    #
    # GPORT_ARRAY:
    GPORT_ARRAY["P550-DC1-R01"]="8301"
    GPORT_ARRAY["P770-DC1-R05"]="8302"
    GPORT_ARRAY["P770-DC2-R35"]="8303"
    #
    # Exception for the ganglia host:
    G_HOST="ganglia"
    G_PORT_R="8399"

    Stop the currently running gmond client process, switch the gmond client init script to the modified version from ganglia-addons-base and start a new gmond client process:

    $ /etc/rc.d/init.d/gmond stop
    $ ln -sf /opt/freeware/libexec/ganglia-addons-base/gmond /etc/rc.d/init.d/gmond
    $ /etc/rc.d/init.d/gmond start
    

    The modified gmond client init script will now determine the model an S/N of the managed system the LPAR is running on. It will also evaluate the configuration from /etc/ganglia/gmond.init.conf and in conjunction with model an S/N of the managed system determine values for the Ganglia configuration entries “cluster → name”, “host → location” and “udp_send_channel → port”. The new configuration values will be merged with the configuration template file /etc/ganglia/gmond.tpl and be written to a temporary gmond configuration file. If the temporary and current gmond configuration files differ, the current gmond configuration file will be overwritten by the temporary one and the gmond client process will be restarted. If the temporary and current gmond configuration files match, the temporary one will be deleted and no further action is taken.

    The aaa_base package will add a errnotify ODM entry, which will call the shell script /opt/freeware/libexec/aaa_base/post_lpm.sh. This shell script will trigger actions upon the generation of errpt messages with the label CLIENT_PMIG_DONE, which signify the end of a successful LPM operation. Currently there are two actions performed in the shell script, one of which is to restart the gmond client process on the LPAR. Since this restart is done by calling the gmond client init script with the restart parameter, a new gmond configuration file will be created along the lines of the sequence explained in the previous paragraph. With the new configuration file, the freshly started gmond will now send its performance data to the gmond server process that is responsible for the managed system the LPAR is now running on.

  6. After some time – usually a few minutes – new directories should start to appear under the /ganglia/rrds/<managed system>/ directories on the Ganglia server. To complete the setup those directories have to be moved to /ganglia/rrds/LPARS/ and be replaced by appropriate symbolic links. E.g.:

    $ cd /ganglia/rrds/
    $ HST="saixtest.mydomain"; for DR in P550-DC1-R01 P770-DC1-R05 P770-DC2-R35; do
      pushd ${DR}
      [[ -d ${HST} ]] && mv -i ${HST} ../LPARS/
      ln -s ../LPARS/${HST} ${HST}
      popd
    done
    

    There is a helper script 800_ganglia_check_links.sh in the ganglia-addons-base package which should be run daily to bring up possible inconsistencies, but can also be run interactively from a shell on the Ganglia server.

  7. If everything was setup correctly, the LPARs should now report to the Ganglia server based on which managed system they currently run on. This should be reflected in the Ganglia web interface by every LPAR being inserted under the appropriate managed system (aka cluster). After a successful LPM operation, the LPAR should – after a short delay – automatically be sorted in under the new managed system (aka cluster) in the Ganglia web interface.

Debugging

If some problem arises or things generally do not go according to plan, here are some hints where to look for issues:

  • Check if a gmetad and all gmond server processes are running on the Ganglia server. Also check with netstat or lsof if the gmond server processes are listening on the correct UDP ports.

  • Check if the gmond configuration file on the Ganglia client LPARs is successfully written. Check if the content of /etc/ganglia/gmond.conf is accurate with regard to your environment. Especially check if the TCP/UDP ports on the Ganglia client LPAR match the ones on the Ganglia server.

  • Check if the Ganglia client LPARs can reach the Ganglia server (e.g. with ping). On the Ganglia server check if UDP packets are coming in from the LPAR in question (e.g. with tcpdump). On the Ganglia client LPAR, check if UDP packets are send out to the correct Ganglia server IP address (e.g. with tcpdump).

  • Check if the directory or symbolic link /ganglia/rrds/<managed system>/<hostname>/ exists. Check if there are RRD files in the directory. Check if they've recently been written.

  • Enable logging and increase the debug level on the Ganglia client and server processes, restart them and check for error messages.

Conclusions

With the described configuration changes and procedures, the already existing value of Ganglia as a performance monitoring tool for the IBM Power environment can even be increased. A more “firewall-friendly” Ganglia configuration is possible and has been used for several years in a heavily segmented and firewalled network environment. As new features like LPM are added to the IBM Power environment, increasing its overall flexibility and productivity, measures also have to be taken to ensure that surrounding tools and infrastructure like Ganglia are also up to the task.

// LPM Performance with 802.3ad Etherchannel on VIOS

Due to new network gear (Cisco Nexus 5596) being available in the datacenters, i recently switched most of the 1Gbps links on our IBM Power systems over to 10Gbps links. This freed up a lot of interfaces, patch panel ports and switch ports. So i thought it would be a nice idea to give some of the now free 1Gbps network links to the VIOS management interfaces, configure an 802.3ad etherchannel and thus aggregate the multiple links. The plan was to have some additional bandwith available for the resource intensive data transfer during live partition mobility (LPM) operations. Another 10Gbps link for each VIOS would probably have been the better solution, but it's hard to justify the additional investment “just” for the usecase of LPM traffic.

While i was aware that a single LPM operation would not benefit from the etherchannel (one sender, one receiver, one used link), i figured in case of multiple parallel LPM operations the additional bandwidth would speed up the data transfer almost linearly with the number of links in the etherchannel.

In a test setup i configured four VIOS on two IBM Power systems to each have an etherchannel consisting of two 1Gbps links:

root@vios1:/$ ifconfig -a
en27: flags=1e080863,c0<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,CHECKSUM_OFFLOAD(ACTIVE),LARGESEND,CHAIN>
    inet 192.168.1.1 netmask 0xffffff00 broadcast 192.168.1.255
      tcp_sendspace 262144 tcp_recvspace 262144 tcp_nodelay 1 rfc1323 1

root@vios1:/$ lsattr -EHl ent27
attribute       value          description                                     user_settable

adapter_names   ent0,ent2      EtherChannel Adapters                           True
alt_addr        0x000000000000 Alternate EtherChannel Address                  True
auto_recovery   yes            Enable automatic recovery after failover        True
backup_adapter  NONE           Adapter used when whole channel fails           True
hash_mode       src_port       Determines how outgoing adapter is chosen       True
interval        long           Determines interval value for IEEE 802.3ad mode True
mode            8023ad         EtherChannel mode of operation                  True
netaddr         0              Address to ping                                 True
noloss_failover yes            Enable lossless failover after ping failure     True
num_retries     3              Times to retry ping before failing              True
retry_time      1              Wait time (in seconds) between pings            True
use_alt_addr    no             Enable Alternate EtherChannel Address           True
use_jumbo_frame no             Enable Gigabit Ethernet Jumbo Frames            True

All three combinations of the ethercannels hash_mode attribute (dst_port, src_port and src_dst_port) were tested in a szenario with four parallel LPM processes. The four LPARs used for the LPM tests had 32GB memory assigned and were only moderately active at the time.

Even with a rought timing measurement taken by hand it pretty soon became clear, that the transfer rate was still hovering slightly below 1Gbps. A look at the performance counters of the interfaces involved in the etherchannel before the LPM operation:

root@vios1:/$ entstat -d ent27 | grep "Bytes: "
Bytes: 16892                    Bytes: 12594          # Etherchannel: ent27
Bytes: 11958                    Bytes: 8402           # Physical interface: ent0
Bytes: 5100                     Bytes: 4262           # Physical interface: ent2

and after the LPM operation:

root@vios1:/$ entstat -d ent27 | grep "Bytes: "
Bytes: 12018850788              Bytes: 210768159909   # Etherchannel: ent27
Bytes: 137243                   Bytes: 210767988779   # Physical interface: ent0
Bytes: 12018713727              Bytes: 171200         # Physical interface: ent2

confirmed the suspicion that only one physical link was actually used for the data transfer, no matter what hash_mode was chosen.

Looking further into the processes and network connections involved in the LPM operation at VIOS level, showed that four migmover processes were started on the source and destination VIOS, one process for each LPM operation. Each migmover process opened two network connections from the source VIOS (192.168.1.1) to the destination VIOS (192.168.1.2), which in the netstat output looked something like this:

root@vios1:/$ netstat -na | grep EST
tcp4       0      0  192.168.1.1.32788     192.168.1.2.32791    ESTABLISHED     # Socket pair 1
tcp4       0  13888  192.168.1.1.32789     192.168.1.2.32792    ESTABLISHED     # Socket pair 2
tcp4       0      0  192.168.1.1.32790     192.168.1.2.32793    ESTABLISHED     # Socket pair 3
tcp4       0  12768  192.168.1.1.32791     192.168.1.2.32794    ESTABLISHED     # Socket pair 4
tcp4       0      0  192.168.1.1.32792     192.168.1.2.32795    ESTABLISHED     # Socket pair 5
tcp4   18824   9744  192.168.1.1.32793     192.168.1.2.32796    ESTABLISHED     # Socket pair 6
tcp4       0      0  192.168.1.1.32794     192.168.1.2.32797    ESTABLISHED     # Socket pair 7
tcp4       0   5264  192.168.1.1.32795     192.168.1.2.32798    ESTABLISHED     # Socket pair 8

The first socket pair (numbers 1, 3, 5, 7) of each migmover process did not see much data transfer and thus seemed to be some sort of control channel. The second socket pair (numbers 2, 4, 6, 8) on the other hand saw a lot of data transfer and thus seemed to be responsible for the actual LPM data transfer. The TCP source and destination ports for the very first socket pair seemed to start at a random value somewhere in the ephemeral port range1). All the subsequently created sockets pairs seemed to have TCP source and destination ports that are an increment by one from the ones used in the previously allocated socket pair.

The combination of the three factors:

  • sequential port allocation from a random starting point in the ephemeral range

  • alternating port allocation for control and data channels

  • same port allocation strategy on the source and the destination VIOS

leads to the observed behaviour of only half the number of physical links being fully used in a etherchannel configuration with an even number of interfaces. Due to this TCP port allocation strategy, all control channels consistently get placed on one physical link, while all data channels consistently get placed on the next physical link, regardless of the hash_mode being used.

After figuring this out, i opened a PMR with IBM support, which pretty soon was transfered to development. After some rather fruitless back and forth, the developer finally agreed to accept a DCR (MR030413328) to implement an alternative TCP port allocation strategy for the LPM data transfer:

User Marketing Field Requirement Number: MR030413328
*Title: LPM performance enhancement with etherchannel 802.3ad
*Description: When the Mover Service Partition managing LPM is using an Etherchannel Adapter built on 2 physical adapters and this Etherchannel is configured in 8023ad mode, the port selection of LPM process makes that all the data connection to the same physical adapter and the control connection to the other adapter. This cause an highly unbalanced usage of those adapters. (The data connection will manage nearly all data transfered, where control connection only handle a little)

While LPM operation, mover is first creating the control socket and then the data socket. Hence this is more than possible that src_data = src_ctrl + 1 and dst_data = dst_ctrl + 1, which will cause in an etherchannel build on 2 adapters, configured in 8023ad mode, that all control socket use the same adapter and all data socket will use the same other adapter. (The parity of “src_data + dst_data” will be the same as “(src_ctrl + 1) + (dst_ctrl + 1)” )

I would suggest to enforce the parity of the port number depending on the fact we are the source or destination side mover service partition:
- data and control socket with same port parity if we are the source mover :
→ i.e. : data port 63122 and control port 63120
or : data port 63113 and control port 63115
- data and control socket with a different port parity if we are the destination mover :
→ i.e. : data port 63122 and control port 63121
or : data port 63113 and control port 63114

This should cause a more balanced load on etherchannel adapter.

*Priority: Medium
*Requested Completion Date: 01.01.2014
(Original Requested Date) 01.01.2014
*Resolving Pipe: AIX, LoP, System p Hardware
*Resolving Release: AIX
*Resolving Entity: Virtualization

*IBM Business Justification (Include SIEBEL number for a priority): This is a problem that has been seen many time, and this will happen for any customer with VIOS configured with etherchannel.
*Requirement Type Performance
Suggested Solution:
Source: Customer
Problem Management Record (PMR):
PMR Number: xxxxx, Branch Code xxx, Country Code xxx
Other Related Information:
Environment: VIOS Level is 2.2.2.1

The by far most interesting part about this DCR being the item “IBM Business Justification”! Remind me again, why this wasn't already fixed if it has been seen many times as an issue? Well, fingers crossed the code change will make it into a VIOS release this time …

1)
or - if configured - between the values of tcp_port_low and tcp_port_high of the vioslpm0 device
This website uses cookies for visitor traffic analysis. By using the website, you agree with storing the cookies on your computer.More information