bityard Blog

// Check_MK Monitoring - SAP Cloud Connector

The SAP Cloud Connector provides a service which is used to connect on-premise systems – non-SAP, SAP ECC and SAP HANA – with applications running on the SAP Cloud Platform. This article introduces a new Check_MK agent and several service checks to monitor the status of the SAP Cloud Connector and its connections to the SAP Cloud Platform.

For the impatient and TL;DR here is the Check_MK package of the SAP Cloud Connector monitoring checks:

SAP Cloud Connector monitoring checks (Compatible with Check_MK versions 1.4.0p19 and later)

The sources are to be found in my Check_MK repository on GitHub


Monitoring the SAP Cloud Connector can be done in two different, not mutually exclusive, ways. The first approach uses the traditional application monitoring features, already built into Check_MK, like:

  • the presence and count of the application processes

  • the reachability of the applications TCP ports

  • the validity of the SSL Certificates

  • queries to the applications health-check URL

The first approach is covered by the section Built-in Check_MK Monitoring below.

The second approach uses a new Check_MK agent and several service checks, dedicated to monitor the internal status of the SAP Cloud Connector and its connections to the SAP Cloud Platform. The new Check_MK agent uses the monitoring API provided by the SAP Cloud Connector in order the monitor the application specific states and metrics. The monitoring endpoints on the SAP Cloud Connector currently used by this Check_MK agent are the:

  • “List of Subaccounts” (URL: https://<scchost>:<sccport>/api/monitoring/subaccounts)

  • “List of Open Connections” (URL: https://<scchost>:<sccport>/api/monitoring/connections/backends)

  • “Performance Monitor Data” (URL: https://<scchost>:<sccport>/api/monitoring/performance/backends)

  • “Top Time Consumers” (URL: https://<scchost>:<sccport>/api/monitoring/performance/toptimeconsumers)

At the time of writing, there unfortunately is no monitoring endpoint on the SAP Cloud Connector for the Most Recent Requests metric. This metric is currently only available via the SAP Cloud Connectors WebUI. The Most Recent Requests metric would be a much more interesting and useful metric than the currently available Top Time Consumers or Performance Monitor Data, both of which have limitations. The application requests covered by the Top Time Consumers metric need a manual acknowledgement inside the SAP Cloud Connector in order to reset events with the longest request runtime, which limits the metrics usability for external monitoring tools. The Performance Monitor Data metric aggregates the application requests into buckets based on their overall runtime. By itself this can be useful for external monitoring tools and is in fact used by the Check_MK agent covered in this article. In the process of runtime bucket aggregation though, the Performance Monitor Data metric hides the much more useful breakdown of each request into runtime subsections (“External (Back-end)”, “Open Connection”, “Internal (SCC)”, “SSO Handling” and “Latency Effects”). Hopefully the Most Recent Requests metric will in the future also be exposed via the monitoring API provided by the SAP Cloud Connector. The new Check_MK agent can then be extended to use the newly exposed metric in order to gain a more fine grained insight into the runtime of application requests through the SAP Cloud Connector.

The second approach is covered by the section SAP Cloud Connector Agent below.

Built-in Check_MK Monitoring

Application Processes

To monitor the SAP Cloud Connector process, use the standard Check_MK check “State and count of processes”. This can be found in the WATO WebUI under:

-> Manual Checks
   -> Applications, Processes & Services
      -> State and count of processes
         -> Create rule in folder ...
            -> Rule Options
               Description: Process monitoring of the SAP Cloud Connector
               Checktype: [ps - State and Count of Processes]
               Process Name: SAP Cloud Connector

            -> Parameters
               [x] Process Matching
               [Exact name of the process without arguments]
               [/opt/sapjvm_8/bin/java]

               [x] Name of the operating system user
               [Exact name of the operating system user]
               [sccadmin]

               [x] Levels for process count
               Critical below [1] processes
               Warning below [1] processes
               Warning above [1] processes
               Critical above [2] processes

            -> Conditions
               Folder [The folder containing the SAP Cloud Connector systems]
               and/or
               Explicit hosts [x]
               Specify explicit host names [SAP Cloud Connector systems]

Application Health Check

To implement a rudimentary monitoring of the SAP Cloud Connector application health, use the standard Check_MK check “Check HTTP service” to query the Health Check endpoint of the monitoring API provided by the SAP Cloud Connector. The “Check HTTP service” can be found in the WATO WebUI under:

-> Host & Service Parameters
   -> Active checks (HTTP, TCP, etc.)
      -> Check HTTP service
         -> Create rule in folder ...
            -> Rule Options
               Description: SAP Cloud Connector (SCC)

            -> Check HTTP service
               Name: SAP SCC
               [x] Check the URL
               [x] URI to fetch (default is /)
               [/exposed?action=ping]

               [x] TCP Port
               [8443]

               [x] Use SSL/HTTPS for the connection:
               [Use SSL with auto negotiation]

            -> Conditions
               Folder [The folder containing the SAP Cloud Connector systems]
               and/or
               Explicit hosts [x]
               Specify explicit host names [SAP Cloud Connector systems]

SSL Certificates

To monitor the validity of the SSL certificate of the SAP Cloud Connector WebUI, use the standard Check_MK check “Check HTTP service”. The “Check HTTP service” can be found in the WATO WebUI under:

-> Host & Service Parameters
   -> Active checks (HTTP, TCP, etc.)
      -> Check HTTP service
         -> Create rule in folder ...
            -> Rule Options
               Description: SAP Cloud Connector (SCC)

            -> Check HTTP service
               Name: SAP SCC Certificate
               [x] Check SSL Certificate Age
               Warning at or below [30] days
               Critical at or below [60] days

               [x] TCP Port
               [8443]

            -> Conditions
               Folder [The folder containing the SAP Cloud Connector systems]
               and/or
               Explicit hosts [x]
               Specify explicit host names [SAP Cloud Connector systems]

SAP Cloud Connector Agent

The new Check_MK package to monitor the status of the SAP Cloud Connector and its connections to the SAP Cloud Platform consists of three major parts – an agent plugin, two check plugins and several auxiliary files and plugins (WATO plugins, Perf-o-meter plugins, metrics plugins and man pages).

Prerequisites

The following prerequisites are necessary in order for the SAP Cloud Connector agent to work properly:

  • A SAP Cloud Connector application user must be created for the Check_MK agent to be able to authenticate against the SAP Cloud Connector and gain access to the protected monitoring API endpoints. See the article SAP Cloud Connector - Configuring Multiple Local Administrative Users on how to create a new application user.

  • A DNS alias or an additional IP address for the SAP Cloud Connector service.

  • An additional host in Check_MK for the SAP Cloud Connector service with the previously created DNS alias or IP address.

  • Installation of the Python requests library on the Check_MK server. This library is used in the Check_MK agent plugin agent_sapcc to perform the authentication and the HTTP requests against the monitoring API of the SAP Cloud Connector. On e.g. RHEL based systems it can be installed with:

    root@host:# yum install python-requests
    
  • Installation of the new Check_MK package for the SAP Cloud Connector monitoring checks on the Check_MK server.

SAP Cloud Connector Agent Plugin

The Check_MK agent plugin agent_sapcc is responsible for querying the endpoints of the monitoring API on the SAP Cloud Connector, which are described above. It transforms the data returned from the monitoring endpoints into a format digestible by Check_MK. The following example shows the – anonymized and abbreviated – agent plugin output for a SAP Cloud Connector system:

<<<check_mk>>>
Version: 0.1

<<<sapcc_connections_backends:sep(59)>>>
subaccounts,abcdefghi,locationID;Test Location
subaccounts,abcdefghi,regionHost;hana.ondemand.com
subaccounts,abcdefghi,subaccount;abcdefghi

<<<sapcc_performance_backends:sep(59)>>>
subaccounts,abcdefghi,backendPerformance,PROTOCOL/sapecc.example.com:PORT,buckets,1,minimumCallDurationMs;10
subaccounts,abcdefghi,backendPerformance,PROTOCOL/sapecc.example.com:PORT,buckets,1,numberOfCalls;1
subaccounts,abcdefghi,backendPerformance,PROTOCOL/sapecc.example.com:PORT,buckets,2,minimumCallDurationMs;20
subaccounts,abcdefghi,backendPerformance,PROTOCOL/sapecc.example.com:PORT,buckets,2,numberOfCalls;36
[...]
subaccounts,abcdefghi,backendPerformance,PROTOCOL/sapecc.example.com:PORT,buckets,20,minimumCallDurationMs;3000
subaccounts,abcdefghi,backendPerformance,PROTOCOL/sapecc.example.com:PORT,buckets,21,minimumCallDurationMs;4000
subaccounts,abcdefghi,backendPerformance,PROTOCOL/sapecc.example.com:PORT,buckets,22,minimumCallDurationMs;5000
subaccounts,abcdefghi,backendPerformance,PROTOCOL/sapecc.example.com:PORT,name;PROTOCOL/sapecc.example.com:44300
subaccounts,abcdefghi,backendPerformance,PROTOCOL/sapecc.example.com:PORT,protocol;PROTOCOL
subaccounts,abcdefghi,backendPerformance,PROTOCOL/sapecc.example.com:PORT,virtualHost;sapecc.example.com
subaccounts,abcdefghi,backendPerformance,PROTOCOL/sapecc.example.com:PORT,virtualPort;44300
subaccounts,abcdefghi,locationID;Test Location
subaccounts,abcdefghi,regionHost;hana.ondemand.com
subaccounts,abcdefghi,sinceTime;2019-02-13T08:05:36.084 +0100
subaccounts,abcdefghi,subaccount;abcdefghi

<<<sapcc_performance_toptimeconsumers:sep(59)>>>
subaccounts,abcdefghi,locationID;Test Location
subaccounts,abcdefghi,regionHost;hana.ondemand.com
subaccounts,abcdefghi,requests,0,externalTime;373
subaccounts,abcdefghi,requests,0,id;932284302
subaccounts,abcdefghi,requests,0,internalBackend;sapecc.example.com:PORT
subaccounts,abcdefghi,requests,0,openRemoteTime;121
subaccounts,abcdefghi,requests,0,protocol;PROTOCOL
subaccounts,abcdefghi,requests,0,receivedBytes;264
subaccounts,abcdefghi,requests,0,resource;/sap-webservice-url/
subaccounts,abcdefghi,requests,0,sentBytes;4650
subaccounts,abcdefghi,requests,0,startTime;2019-02-13T11:31:59.113 +0100
subaccounts,abcdefghi,requests,0,totalTime;536
subaccounts,abcdefghi,requests,0,user;RFC_USER
subaccounts,abcdefghi,requests,0,virtualBackend;sapecc.example.com:PORT
subaccounts,abcdefghi,requests,1,externalTime;290
subaccounts,abcdefghi,requests,1,id;1882731830
subaccounts,abcdefghi,requests,1,internalBackend;sapecc.example.com:PORT
subaccounts,abcdefghi,requests,1,latencyTime;77
subaccounts,abcdefghi,requests,1,openRemoteTime;129
subaccounts,abcdefghi,requests,1,protocol;PROTOCOL
subaccounts,abcdefghi,requests,1,receivedBytes;264
subaccounts,abcdefghi,requests,1,resource;/sap-webservice-url/
subaccounts,abcdefghi,requests,1,sentBytes;4639
subaccounts,abcdefghi,requests,1,startTime;2019-02-13T11:31:59.114 +0100
subaccounts,abcdefghi,requests,1,totalTime;532
subaccounts,abcdefghi,requests,1,user;RFC_USER
subaccounts,abcdefghi,requests,1,virtualBackend;sapecc.example.com:PORT
[...]
subaccounts,abcdefghi,requests,49,externalTime;128
subaccounts,abcdefghi,requests,49,id;1774317106
subaccounts,abcdefghi,requests,49,internalBackend;sapecc.example.com:PORT
subaccounts,abcdefghi,requests,49,protocol;PROTOCOL
subaccounts,abcdefghi,requests,49,receivedBytes;263
subaccounts,abcdefghi,requests,49,resource;/sap-webservice-url/
subaccounts,abcdefghi,requests,49,sentBytes;4660
subaccounts,abcdefghi,requests,49,startTime;2019-02-16T11:32:09.352 +0100
subaccounts,abcdefghi,requests,49,totalTime;130
subaccounts,abcdefghi,requests,49,user;RFC_USER
subaccounts,abcdefghi,requests,49,virtualBackend;sapecc.example.com:PORT
subaccounts,abcdefghi,sinceTime;2019-02-13T08:05:36.085 +0100
subaccounts,abcdefghi,subaccount;abcdefghi

<<<sapcc_subaccounts:sep(59)>>>
subaccounts,abcdefghi,displayName;Test Application
subaccounts,abcdefghi,locationID;Test Location
subaccounts,abcdefghi,regionHost;hana.ondemand.com
subaccounts,abcdefghi,subaccount;abcdefghi
subaccounts,abcdefghi,tunnel,applicationConnections,abcdefg:hijklmnopqr,connectionCount;8
subaccounts,abcdefghi,tunnel,applicationConnections,abcdefg:hijklmnopqr,name;abcdefg:hijklmnopqr
subaccounts,abcdefghi,tunnel,applicationConnections,abcdefg:hijklmnopqr,type;JAVA
subaccounts,abcdefghi,tunnel,connectedSince;2019-02-14T10:11:00.630 +0100
subaccounts,abcdefghi,tunnel,connections;8
subaccounts,abcdefghi,tunnel,state;Connected
subaccounts,abcdefghi,tunnel,user;P123456

The agent plugin comes with a Check_MK check plugin of the same name, which is solely responsible for the construction of the command line arguments from the WATO configuration and passing it to the Check_MK agent plugin.

With the additional WATO plugin sapcc_agent.py it is possible to configure the username and password for the SAP Cloud Connector application user which is used to connect to the monitoring API. It is also possible to configure the TCP port and the connection timeout for the connection to the monitoring API through the WATO WebUI and thus override the default values. The default value for the TCP port is 8443, the default value for the connection timeout is 30 seconds. The configuration options for the Check_MK agent plugin agent_sapcc can be found in the WATO WebUI under:

-> Host & Service Parameters
   -> Datasource Programs
      -> SAP Cloud Connector systems
         -> Create rule in folder ...
            -> Rule Options
               Description: SAP Cloud Connector (SCC)

            -> SAP Cloud Connector systems
               SAP Cloud Connector user name: [username]
               SAP Cloud Connector password: [password]
               SAP Cloud Connector TCP port: [8443]

            -> Conditions
               Folder [The folder containing the SAP Cloud Connector systems]
               and/or
               Explicit hosts [x]
               Specify explicit host names [SAP Cloud Connector systems]

After saving the new rule, restarting Check_MK and doing an inventory on the additional host for the SAP Cloud Connector service in Check_MK, several new services starting with the name prefix SAP CC should appear.

The following image shows a status output example from the WATO WebUI with the service checks HTTP SAP SCC TLS and HTTP SAP SCC TLS Certificate from the Built-in Check_MK Monitoring described above. In addition to those, the example also shows the service checks based on the data from the SAP Cloud Connector Agent. The service checks SAP CC Application Connection, SAP CC Subaccount and SAP CC Tunnel are provided by the check plugin sapcc_subaccounts, the service check SAP CC Perf Backend is provided by the plugin sapcc_performance_backends:

Status output example for the complete monitoring of the SAP Cloud Connector

SAP Cloud Connector Subaccount

The check plugin sapcc_subaccounts implements the three sub-checks sapcc_subaccounts.app_conn, sapcc_subaccounts.info and sapcc_subaccounts.tunnel.

Info

The sub-check sapcc_subaccounts.info just gathers information on several configuration options for each subaccount on the SAP Cloud Connector and displays them in the status details of the check. These configuration options are the:

  • subaccount name on the SAP Cloud Platform to which the connection is made.

  • display name of the subaccount.

  • location ID of the subaccount.

  • the region host of the SAP Cloud Platform to which the SAP Cloud Connector establishes a connection.

The sub-check sapcc_subaccounts.info always returns an OK status. No performance data is currently reported by this check.

Tunnel

The sub-check sapcc_subaccounts.tunnel is responsible for the monitoring of each tunnel connection for each subaccount on the SAP Cloud Connector. Upon inventory this sub-check creates a service check for each tunnel connection found on the SAP Cloud Connector. During normal check execution, the status of the tunnel connection is determined for each inventorized item. If the tunnel connection is not in the Connected state, an alarm is raised accordingly. Additionally, the number of currently active connections over a tunnel as well as the elapsed time in seconds since the tunnel connection was established are determined for each inventorized item. If either the value of the currently active connections or the number of seconds since the connection was established are above or below the configured warning and critical threshold values, an alarm is raised accordingly. For both values, performance data is reported by the check.

With the additional WATO plugin sapcc_subaccounts.py it is possible to configure the warning and critical levels for the sub-check sapcc_subaccounts.tunnel through the WATO WebUI and thus override the following default values:

Metric Warning Low Threshold Critical Low Threshold Warning High Threshold Critical High Threshold
Number of connections 0 0 30 40
Connection duration 0 sec 0 sec 284012568 sec 315569520 sec

The configuration options for the tunnel connection levels can be found in the WATO WebUI under:

-> Host & Service Parameters
   -> Parameters for discovered services
      -> Applications, Processes & Services
         -> SAP Cloud Connector Subaccounts
            -> Create Rule in Folder ...
               -> Rule Options
                  Description: SAP Cloud Connector Subaccounts

               -> Parameters
                  [x] Number of tunnel connections
                      Warning if equal or below [0] connections
                      Critical if equal or below [0] connections
                      Warning if equal or above [30] connections
                      Critical if equal or above [40] connections
                  [x] Connection time of tunnel connections 
                      Warning if equal or below [0] seconds
                      Critical if equal or below [0] seconds
                      Warning if equal or above [284012568] seconds
                      Critical if equal or above [315569520] seconds

               -> Conditions
                  Folder [The folder containing the SAP Cloud Connector systems]
                  and/or
                  Explicit hosts [x]
                  Specify explicit host names [SAP Cloud Connector systems]
                  and/or
                  Application Or Tunnel Name [x]
                  Specify explicit values [Tunnel name]

The above image with a status output example from the WATO WebUI shows one sapcc_subaccounts.tunnel service check as the last of the displayed items. The service name is prefixed by the string SAP CC Tunnel and followed by the subaccount name, which in this example is anonymized. For each tunnel connection the connection state, the overall number of application connections currently active over the tunnel, the time when the tunnel connection was established and the number of seconds elapsed since establishing the connection are shown. The overall number of currently active application connections is also visualized in the perf-o-meter, with a logarithmic scale growing from the left to the right.

The following image shows an example of the two metric graphs for a single sapcc_subaccounts.tunnel service check:

Example metric graph for a single sapcc_subaccounts.tunnel service check

The upper graph shows the time elapsed since the tunnel connection was established. The lower graph shows the overall number of application connections currently active over the tunnel connection. Both graphs would show warning and critical thresholds values, which in this example are currently outside the displayed range of values for the y-axis.

Application Connection

The sub-check sapcc_subaccounts.app_conn is responsible for the monitoring of each applications connection through each tunnel connection for each subaccount on the SAP Cloud Connector. Upon inventory this sub-check creates a service check for each application connection found on the SAP Cloud Connector. During normal check execution, the number of currently active connections for each application is determined for each inventorized item. If the value of the currently active connections is above or below the configured warning and critical threshold values, an alarm is raised accordingly. For the number of currently active connections, performance data is reported by the check.

With the additional WATO plugin sapcc_subaccounts.py it is possible to configure the warning and critical levels for the sub-check sapcc_subaccounts.app_conn through the WATO WebUI and thus override the following default values:

Metric Warning Low Threshold Critical Low Threshold Warning High Threshold Critical High Threshold
Number of connections 0 0 30 40

The configuration options for the tunnel connection levels can be found in the WATO WebUI under:

-> Host & Service Parameters
   -> Parameters for discovered services
      -> Applications, Processes & Services
         -> SAP Cloud Connector Subaccounts
            -> Create Rule in Folder ...
               -> Rule Options
                  Description: SAP Cloud Connector Subaccounts

               -> Parameters
                  [x] Number of application connections
                      Warning if equal or below [0] connections
                      Critical if equal or below [0] connections
                      Warning if equal or above [30] connections
                      Critical if equal or above [40] connections

               -> Conditions
                  Folder [The folder containing the SAP Cloud Connector systems]
                  and/or
                  Explicit hosts [x]
                  Specify explicit host names [SAP Cloud Connector systems]
                  and/or
                  Application Or Tunnel Name [x]
                  Specify explicit values [Application name]

The above image with a status output example from the WATO WebUI shows one sapcc_subaccounts.app_conn service check as the 5th item from top of the displayed items. The service name is prefixed by the string SAP CC Application Connection and followed by the application name, which in this example is anonymized. For each application connection the number of currently active connections and the connection type are shown. The number of currently active application connections is also visualized in the perf-o-meter, with a logarithmic scale growing from the left to the right.

The following image shows an example of the metric graph for a single sapcc_subaccounts.app_conn service check:

Example metric graph for a single sapcc_subaccounts.app_conn service check

The graph shows the number of currently active application connections. The graph would show warning and critical thresholds values, which in this example are currently outside the displayed range of values for the y-axis.

SAP Cloud Connector Performance Backends

The check sapcc_performance_backends is responsible for the monitoring of the performance of each (on-premise) backend system connected to the SAP Cloud Connector. Upon inventory this check creates a service check for each backend connection found on the SAP Cloud Connector. During normal check execution, the number of requests to the backend system, categorized in one of the 22 runtime buckets is determined for each inventorized item. From these raw values, the request rate in requests per second is derived for each of the 22 runtime buckets. Also from the raw values, the following four additional metrics are derived:

  • calls_total: the total request rate over all of the 22 runtime buckets.

  • calls_pct_ok: the relative number of requests in percent with a runtime below a given runtime warning threshold.

  • calls_pct_warn: the relative number of requests in percent with a runtime above a given runtime warning threshold.

  • calls_pct_crit: the relative number of requests in percent with a runtime above a given runtime critical threshold.

If the relative number of requests is above the configured warning and critical threshold values, an alarm is raised accordingly. For each of the 22 runtime buckets, the total number of requests and the relative number of requests (calls_pct_ok, calls_pct_warn, calls_pct_crit), performance data is reported by the check.

With the additional WATO plugin sapcc_performance_backends.py it is possible to configure the warning and critical levels for the check sapcc_performance_backends through the WATO WebUI and thus override the following default values:

Metric Warning Threshold Critical Threshold
Request runtime 500 msec 1000 msec
Percentage of requests over request runtime thresholds 10% 5%

The configuration options for the backend performance levels can be found in the WATO WebUI under:

-> Host & Service Parameters
   -> Parameters for discovered services
      -> Applications, Processes & Services
         -> SAP Cloud Connector Backend Performance
            -> Create Rule in Folder ...
               -> Rule Options
                  Description: SAP Cloud Connector Backend Performance

               -> Parameters
                  [x] Runtime bucket definition and calls per bucket in percent
                      Warning if percentage of calls in warning bucket equal or above [10.00] %
                      Assign calls to warning bucket if runtime equal or above [500] milliseconds
                      Critical if percentage of calls in critical bucket equal or above [5.00] %
                      Assign calls to critical bucket if runtime equal or above [1000] milliseconds

               -> Conditions
                  Folder [The folder containing the SAP Cloud Connector systems]
                  and/or
                  Explicit hosts [x]
                  Specify explicit host names [SAP Cloud Connector systems]
                  and/or
                  Backend Name [x]
                  Specify explicit values [Backend name]

The above image with a status output example from the WATO WebUI shows one sapcc_performance_backends service check as the 6th item from top of the displayed items. The service name is prefixed by the string SAP CC Perf Backend and followed by a string concatenated from the protocol, FQDN and TCP port of the backend system, which in this example is anonymized. For each backend connection the total number of requests, the total request rate, the percentage of requests below the runtime warning threshold, the percentage of requests above the runtime warning threshold and the percentage of requests above the runtime critical threshold are shown. The relative number of requests in percent are also visualized in the perf-o-meter.

The following image shows an example of the metric graph for the total request rate from the sapcc_performance_backends service check:

Example metric graph for the total request rate from the sapcc_performance_backends service check

The following image shows an example of the metric graph for the relative number of requests from the sapcc_performance_backends service check:

Example metric graph for the relative number of requests from the sapcc_performance_backends service check

The graph shows the percentage of requests below the runtime warning threshold in the color green at the bottom, the percentage of requests above the runtime warning threshold in the color yellow stacked above and the percentage of requests above the runtime critical threshold in the color red stacked at the top.

The following image shows an example of the combined metric graphs for the request rates to a single backend system in each of the 22 runtime buckets from the sapcc_performance_backends service check:

Example combined metric graphs for the request rates to a single backend system in each of the 22 runtime buckets from the sapcc_performance_backends service check

To provide a better overview, the individual metrics are grouped together into three graphs. The first graph shows the request rate in the runtime buckets >=10ms, >=20ms, >=30ms, >=40ms, >=50ms, >=75ms and >=100ms. The second graph shows the request rate in the runtime buckets >=125ms, >=150ms, >=200ms, >=300ms, >=400ms, >=500ms, >=750ms and >=1000ms. The third and last graph shows the request rate in the runtime buckets >=1250ms, >=1500ms, >=2000ms, >=2500ms, >=3000ms, >=4000ms and >=5000ms.

The following image shows an example of the individual metric graphs for the request rates to a single backend system in each of the 22 runtime buckets from the sapcc_performance_backends service check:

Example individual metric graphs for the request rates to a single backend system in each of the 22 runtime buckets from the sapcc_performance_backends service check

Each of the metric graphs shows exactly the same data as the previously show combined graphs. The combined metric graphs are actually based on the individual metric graphs for the request rates to a single backend system.

Conclusion

The newly introduced checks for the SAP Cloud Connector enables you to monitor several application specific aspects of the SAP Cloud Connector with your Check_MK Server. The combination of built-in Check_MK monitoring facilities and a new agent plugin for the SAP Cloud Connector complement each other in this regard. While the new SAP Cloud Connector agent plugin for Check_MK utilizes most of the data provided by the monitoring endpoints on the SAP Cloud Connector, a more in-depth monitoring could be achieved if the data from the Most Recent Requests metric would also be exposed over the monitoring API of SAP Cloud Connector. It hope this will be the case in a future release of the SAP Cloud Connector.

I hope you find the provided new check useful and enjoyed reading this blog post. Please don't hesitate to drop me a note if you have any suggestions or run into any issues with the provided checks.

// Experiences with Dell PowerConnect Switches

This blog post is going to be about my recent experiences with Broadcom FASTPATH based Dell PowerConnect M-Series M8024-k and M6348 switches. Especially with their various limitations and – in my opinion – sometimes buggy behaviour.


Recently i was given the opportunity to build a new and central storage and virtualization environment from ground up. This involved a set of hardware systems which – unfortunately – were chosen and purchased previously, before i came on board with the project.

System environment

Specifically those hardware components were:

  • Multiple Dell PowerEdge M1000e blade chassis

  • Multiple Dell PowerEdge M-Series blade servers, all equipped with Intel X520 network interfaces for LAN connectivity through fabric A of the blade chassis. Servers with additional central storage requirements were also equipped with QLogic QME/QMD8262 or QLogic/Broadcom BCM57810S iSCSI HBAs for SAN connectivity through fabric B of the blade chassis.

  • Multiple Dell PowerConnect M8024-k switches in fabric A of the blade chassis forming the LAN network. Those were configured and interconnected as a stack of switches. Each stack of switches had two uplinks, one to each of two carrier grade Cisco border routers. Since the network edge was between those two border routers on the one side and the stack of M8024-k switches on the other side, the switch stack was also used as a layer 3 device and was thus running the default gateways of the local network segments provided to the blade servers.

  • Multiple Dell PowerConnect M6348 switches, which were connected through aggregated links to the stack of M8024-k switches described above. These switches were exclusively used to provide a LAN connection for external, standalone servers and devices through their external 1 GBit ethernet interfaces. The M6348 switches were located in the slots belonging to fabric C of the blade chassis.

  • Multiple Dell PowerConnect M8024-k switches in fabric B of the blade chassis forming the SAN network. In contrast to the M8024-k LAN switches, the M8024-k SAN switches were configured and interconnected as individual switches. Since there as no need for outside SAN connectivity, the M8024-k switches in fabric B ran a flat layer 2 network without any layer 3 configuration.

  • Initially all PowerConnect switches – M8024-k both LAN and SAN and M6348 – ran the firmware version 5.1.8.2.

  • Multiple Dell EqualLogic PS Series storage systems, providing central block storage capacity for the PowerEdge M-Series blade servers via iSCSI over the SAN mentioned above. Some blade chassis based PS Series models (PS-M4110) were internally connected to the SAN formed by the M8024-k switches in fabric B. Other standalone PS Series models were connected to the same SAN utilizing the external ports of the M8024-k switches.

  • Multiple Dell EqualLogic FS Series file server appliances, providing central NFS and CIFS storage capacity over the LAN mentioned above. In the back-end those FS Series file server appliances also used the block storage capacity provided by the PS Series storage systems via iSCSI over the SAN mentioned above. Both LAN and SAN connections of the EqualLogic FS Series were made through the external ports of the M8024-k switches.

There were multiple locations with roughly the same setup composed of the hardware components described above. Each location had two daisy-chained Dell PowerEdge M1000e blade chassis systems. The layer 2 LAN and SAN networks stretched over the two blade chassis. The setup at each location is shown in the following schematic:

Schematic of the Dell PowerConnect LAN and SAN setup

All in all not an ideal setup. Instead, i would have preferred a pair of capable – both functionality and performance-wise – central top-of-rack switches to which the individual M1000e blade chassis would have been connected. Preferrably a seperate pair for LAN an SAN connectivity. But again, the mentioned components were already preselected and pre-purchased.

During the implementation and later the operational phase several limitations and issues surfaced with regard to the Dell PowerConnect switches and the networks build with them. The following – probably not exhaustive – list of limitations and issues i've encountered is in no particular order with regard to their occurrence or severity.

Limitations

  • While the Dell PowerConnect switches support VRRP as a redundancy protocol for layer 3 instances, there is only support for VRRP version 2, described in RFC 3768. This limits the use of VRRP to IPv4 only. VRRP version 3 described in RFC 5798, which is needed for the implementation of redundant layer 3 instances for both IPv4 and IPv6, is not supported by Dell PowerConnect switches. Due to this limitation and the need for full IPv6 support in the whole environment, the design decision was made to run the Dell PowerConnect M8024-k switches for the LAN as a stack of switches.

  • Limited support of routing protocols. There is only support for the routing protocols OSPF and RIP v2 in Dell PowerConnect switches. In this specific setup and triggered by the design decision to run the LAN switches as layer 3 devices, BGP would have been a more suitable routing protocol. Unfortunately there were no plans to implement BGP on the Dell PowerConnect devices.

  • Limitation in the number of secondary interface addresses. Only one IPv4 secondary address is supported per interface on a layer 3 instance running on the Dell PowerConnect switches. Opposed to e.g. Cisco based layer 3 capable switches this was a limitation that caused, in this particular setup, the need for a lot more (VLAN) interfaces than would otherwise have been necessary.

  • No IPv6 secondary interface addresses. For IPv6 based layer 3 instances there is no support at all for secondary interface addresses. Although this might be a fundamental rather than product specific limitation.

  • For layer 3 instances in general there is no support for very small IPv4 subnets (e.g. /31 with 2 IPv4 addresses) which are usually used for transfer networks. In setups using private IPv4 address ranges this is no big issue. In this case though, official IPv4 addresses were used and in conjunction with the excessive need for VLAN interfaces this limitation caused a lot of wasted official IPv4 addresses.

  • The access control list (ACL) feature is very limited and rather rudimentary in Dell PowerConnect switches. There is no support for port ranges, no statefulness and each access list has a hard limit of 256 access list entries. All three – and possibly even more – limitations in combination make the ACL feature of Dell PowerConnect switches almost useless. Especially if there are seperate layer 3 networks on the system which are in need of fine-grained traffic control.

  • From the performance aspect of ACLs i have gotten the impression, that especially IPv6 ACLs are handled by the switches CPU. If IPv6 is used in conjunction with extensive ACLs, this would dramatically impact the network performance of IPv6-based traffic. Admittedly i have no hard proof to support this suspicion.

  • The out-of-band (OOB) management interface of the Dell PowerConnect switches does not provide a true out-of-band management. Instead it is integrated into the switch as just as another IP interface – although one with a special purpose. Due to this interaction of the OOB with the IP stack of the Dell PowerConnect switch there are side-effects when the switch is running at least one layer 3 instance. In this case, the standard IP routing table of the switch is not only used for routing decisions of the payload traffic, but instead it is also used to determine the destination of packets originating from the OOB interface. This behaviour can cause an asymmetric traffic flow when the systems connecting to the OOB are covered by an entry in the switches IP routing table. Far from ideal when it comes to true OOB management, not to mention the issuses arising when there are also stateful firewall rules involved.

    I addressed this limitation with a support case at Dell and got the following statement back:

    FASTPATH can learn a default gateway for the service port, the network port,
    or a routing interface. The IP stack can only have a single default gateway.
    (The stack may accept multiple default routes, but if we let that happen we may
    end up with load balancing across the network and service port or some other
    combination we don't want.) RTO may report an ECMP default route. We only give
    the IP stack a single next hop in this case, since it's not likely we need to
    additional capacity provided by load sharing for packets originating on the
    box.

    The precedence of default gateways is as follows:
    - via routing interface
    - via service port
    - via network port

    As per the above precedence, ip stack is having the default gateway which is
    configured through RTO. When the customer is trying to ping the OOB from
    different subnet , route table donesn't have the exact route so,it prefers the
    default route and it is having the RTO default gateway as next hop ip. Due to
    this, it egresses from the data port.

    If we don't have the default route which is configured through RTO then IP
    stack is having the OOB default gateway as next hop ip. So, it egresses from
    the OOB IP only.

    In my opinion this just confirms how the OOB management of the Dell PowerConnect switches is severely broken by design.

  • Another issue with the out-of-band (OOB) management interface of the Dell PowerConnect switches is that they support only a very limited access control list (ACL) in order to protect the access to the switch. The management ACL only supports one IPv4 ACL entry. IPv6 support within the management ACL protecting the OOB interface is missing altogether.

  • The Dell PowerConnect have no support for Shortest Path Bridging (SPB) as defined in the IEEE 802.1aq standard. On layer 2 the traditional spanning-tree protocols STP (IEEE 802.1D), RSTP (IEEE 802.1w) or MSTP (IEEE 802.1s) have to be used. This is particularly a drawback in the SAN network shown in the schematic above, due to the protocol determined inactivity of one inter-switch link. With the use of SPB, all inter-switch links could be equally utilizied and a traffic interruption upon link failure and spanning-tree (re)convergence could be avoided.

  • Another SAN-specific limitation is the incomplete implementation of Data Center Bridging (DCB) in the Dell PowerConnect switches. Although the protocols Priority-based Flow Control (PFC) according to IEEE 802.1Qbb and Congestion Notification (CN) according to IEEE 802.1Qau are supportet, the third needed protocol Enhanced Transmission Selection (ETS) according to IEEE 802.1Qaz is missing in Dell PowerConnect switches. The Dell EqualLogic PS Series storage systems used in the setup shown above explicitly need ETS if DCB should be used on layer 2. Since ETS is not implemented in Dell PowerConnect switches, the traditional layer 2 protocols had to be used in the SAN.

Issues

  • Not per se an issue, but the baseline CPU utilization on Dell PowerConnect M8024-k switches running layer 3 instances is significantly higher compared to those running only as layer 2 devices. The following CPU utilization graphs show a direct comparison of a layer 3 (upper graph) and a layer 2 (lower graph) device:

    CPU utilization on a Dell PowerConnect M8024-k switch as a Layer 3 device
    CPU utilization on a Dell PowerConnect M8024-k switch as a Layer 2 device

    The CPU utilization is between 10 and 15% higher once the tasks of processing layer 3 traffic are involved. What kind of switch function or what type of traffic is causing this additional CPU utilization is completely intransparent. Documentation on such in-depth subjects or details on how the processing within the Dell PowerConnect switches works is very scarce. It would be very interesting to know what kind of traffic is sent to the switches CPU for processing instead of being handled by the hardware.

  • The very high CPU utilization plateau on the right hand side of the upper graph (approximately between 10:50 - 11:05) was due to a bug in processing of IPv6 traffic on Dell PowerConnect switches. This issue caused IPv6 packets to be sent to the switchs CPU for processing instead of doing the forwarding decision in the hardware. I narrowed down the issue by transferring a large file between two hosts via the SCP protocol. In the first case and determined by preferred name resolution via DNS a IPv6 connection was used:

    user@host1:~$ scp testfile.dmp user@host2:/var/tmp/
    testfile.dmp                                   8%  301MB 746.0KB/s 1:16:05 ETA
    

    The CPU utilization on the switch stack during the transfer was monitored on the switches CLI:

    stack1(config)# show process cpu
    
    Memory Utilization Report
    
    status      bytes
    ------ ----------
      free  170642152
     alloc  298144904
    
    CPU Utilization:
    
      PID      Name                    5 Secs     60 Secs    300 Secs
    -----------------------------------------------------------------
     41be030 tNet0                     27.05%      30.44%      21.13%
     41cbae0 tXbdService                2.60%       0.40%       0.09%
     43d38d0 ipnetd                     0.40%       0.11%       0.11%
     43ee580 tIomEvtMon                 0.40%       0.09%       0.22%
     43f7d98 osapiTimer                 2.00%       3.56%       3.13%
     4608b68 bcmL2X.0                   0.00%       0.08%       1.16%
     462f3a8 bcmCNTR.0                  1.00%       0.87%       1.04%
     4682d40 bcmTX                      4.20%       5.12%       3.83%
     4d403a0 bcmRX                      9.21%      12.64%      10.35%
     4d60558 bcmNHOP                    0.80%       0.21%       0.11%
     4d72e10 bcmATP-TX                  0.80%       0.24%       0.32%
     4d7c310 bcmATP-RX                  0.20%       0.12%       0.14%
     53321e0 MAC Send Task              0.20%       0.19%       0.40%
     533b6e0 MAC Age Task               0.00%       0.05%       0.09%
     5d59520 bcmLINK.0                  5.41%       2.75%       2.15%
     84add18 tL7Timer0                  0.00%       0.22%       0.23%
     84ca140 osapiWdTask                0.00%       0.05%       0.05%
     84d3640 osapiMonTask               0.00%       0.00%       0.01%
     84d8b40 serialInput                0.00%       0.00%       0.01%
     95e8a70 servPortMonTask            0.40%       0.09%       0.12%
     975a370 portMonTask                0.00%       0.06%       0.09%
     9783040 simPts_task                0.80%       0.73%       1.40%
     9b70100 dtlTask                    5.81%       7.52%       5.62%
     9dc3da8 emWeb                      0.40%       0.12%       0.09%
     a1c9400 hapiRxTask                 4.00%       8.84%       6.46%
     a65ba38 hapiL3AsyncTask            1.60%       0.45%       0.37%
     abcd0c0 DHCP snoop                 0.00%       0.00%       0.20%
     ac689d0 Dynamic ARP Inspect        0.40%       0.10%       0.05%
     ac7a6c0 SNMPTask                   0.40%       0.19%       0.95%
     b8fa268 dot1s_timer_task           1.00%       0.78%       2.74%
     b9134c8 dot1s_task                 0.20%       0.07%       0.04%
     bdb63e8 dot1xTimerTask             0.00%       0.03%       0.02%
     c520db8 radius_task                0.00%       0.02%       0.05%
     c52a0b0 radius_rx_task             0.00%       0.03%       0.03%
     c58a2e0 tacacs_rx_task             0.20%       0.06%       0.15%
     c59ce70 unitMgrTask                0.40%       0.10%       0.20%
     c5c7410 umWorkerTask               1.80%       0.27%       0.13%
     c77ef60 snoopTask                  0.60%       0.25%       0.16%
     c8025a0 dot3ad_timer_task          1.00%       0.24%       0.61%
     ca2ab58 dot3ad_core_lac_tas        0.00%       0.02%       0.00%
     d1860b0 dhcpsPingTask              0.20%       0.13%       0.39%
     d18faa0 SNTP                       0.00%       0.02%       0.01%
     d4dc3b0 sFlowTask                  0.00%       0.00%       0.03%
     d6a4448 spmTask                    0.00%       0.13%       0.14%
     d6b79c8 fftpTask                   0.40%       0.06%       0.01%
     d6dcdf0 tCkptSvc                   0.00%       0.00%       0.01%
     d7babe8 ipMapForwardingTask        0.40%       0.18%       0.29%
     dba91b8 tArpCallback               0.00%       0.04%       0.04%
     defb340 ARP Timer                  2.60%       0.92%       1.29%
     e1332f0 tRtrDiscProcessingT        0.00%       0.00%       0.11%
    12cabe30 ip6MapLocalDataTask        0.00%       0.03%       0.01%
    12cb5290 ip6MapExceptionData       11.42%      12.95%       9.41%
    12e1a0d8 lldpTask                   0.60%       0.17%       0.30%
    12f8cd10 dnsTask                    0.00%       0.00%       0.01%
    140b4e18 dnsRxTask                  0.00%       0.03%       0.03%
    14176898 DHCPv4 Client Task         0.00%       0.01%       0.02%
    1418a3f8 isdpTask                   0.00%       0.00%       0.10%
    14416738 RMONTask                   0.00%       0.20%       0.42%
    144287f8 boxs Req                   0.20%       0.09%       0.21%
    15c90a18 sshd                       0.40%       0.07%       0.07%
    15cde0e0 sshd[0]                    0.20%       0.05%       0.02%
    -----------------------------------------------------------------
     Total CPU Utilization             89.77%      92.50%      77.29%
    

    In second case a IPv4 connection was deliberately choosen:

    user@host1:~$ scp testfile.dmp user@10.0.0.1:/var/tmp/
    testfile.dmp                                 100% 3627MB  31.8MB/s   01:54
    

    Not only was the transfer rate of the SCP copy process significantly higher – and the transfer time subsequently much lower – in the second case using a IPv4 connection. But the CPU utilization on the switch stack during the transfer using a IPv4 connection was also much lower:

    stack1(config)# show process cpu
    
    Memory Utilization Report
    
    status      bytes
    ------ ----------
      free  170642384
     alloc  298144672
    
    CPU Utilization:
    
      PID      Name                    5 Secs     60 Secs    300 Secs
    -----------------------------------------------------------------
     41be030 tNet0                      0.80%      23.49%      21.10%
     41cbae0 tXbdService                0.00%       0.17%       0.08%
     43d38d0 ipnetd                     0.20%       0.14%       0.12%
     43ee580 tIomEvtMon                 0.60%       0.26%       0.24%
     43f7d98 osapiTimer                 2.20%       3.10%       3.08%
     4608b68 bcmL2X.0                   4.20%       1.10%       1.22%
     462f3a8 bcmCNTR.0                  0.80%       0.80%       0.99%
     4682d40 bcmTX                      0.20%       3.35%       3.59%
     4d403a0 bcmRX                      4.80%       9.90%      10.06%
     4d60558 bcmNHOP                    0.00%       0.11%       0.10%
     4d72e10 bcmATP-TX                  1.00%       0.30%       0.32%
     4d7c310 bcmATP-RX                  0.00%       0.14%       0.15%
     53321e0 MAC Send Task              0.80%       0.39%       0.42%
     533b6e0 MAC Age Task               0.00%       0.12%       0.10%
     5d59520 bcmLINK.0                  1.80%       2.38%       2.14%
     84add18 tL7Timer0                  0.00%       0.11%       0.20%
     84ca140 osapiWdTask                0.00%       0.05%       0.05%
     84d3640 osapiMonTask               0.00%       0.00%       0.01%
     84d8b40 serialInput                0.00%       0.00%       0.01%
     95e8a70 servPortMonTask            0.20%       0.09%       0.11%
     975a370 portMonTask                0.00%       0.06%       0.09%
     9783040 simPts_task                3.20%       1.54%       1.49%
     9b70100 dtlTask                    0.20%       5.47%       5.45%
     9dc3da8 emWeb                      0.40%       0.13%       0.09%
     a1c9400 hapiRxTask                 0.20%       6.46%       6.30%
     a65ba38 hapiL3AsyncTask            0.40%       0.37%       0.35%
     abcd0c0 DHCP snoop                 0.00%       0.02%       0.18%
     ac689d0 Dynamic ARP Inspect        0.40%       0.15%       0.07%
     ac7a6c0 SNMPTask                   0.00%       1.32%       1.12%
     b8fa268 dot1s_timer_task           7.21%       2.99%       2.97%
     b9134c8 dot1s_task                 0.00%       0.03%       0.03%
     bdb63e8 dot1xTimerTask             0.00%       0.01%       0.02%
     c520db8 radius_task                0.00%       0.01%       0.04%
     c52a0b0 radius_rx_task             0.00%       0.03%       0.03%
     c58a2e0 tacacs_rx_task             0.20%       0.21%       0.17%
     c59ce70 unitMgrTask                0.60%       0.20%       0.21%
     c5c7410 umWorkerTask               0.20%       0.17%       0.12%
     c77ef60 snoopTask                  0.20%       0.18%       0.15%
     c8025a0 dot3ad_timer_task          2.20%       0.80%       0.68%
     d1860b0 dhcpsPingTask              1.80%       0.58%       0.45%
     d18faa0 SNTP                       0.00%       0.00%       0.01%
     d4dc3b0 sFlowTask                  0.20%       0.03%       0.03%
     d6a4448 spmTask                    0.20%       0.15%       0.14%
     d6b79c8 fftpTask                   0.00%       0.02%       0.01%
     d6dcdf0 tCkptSvc                   0.00%       0.00%       0.01%
     d7babe8 ipMapForwardingTask        0.20%       0.19%       0.28%
     dba91b8 tArpCallback               0.00%       0.06%       0.05%
     defb340 ARP Timer                  4.60%       1.54%       1.36%
     e1332f0 tRtrDiscProcessingT        0.40%       0.14%       0.12%
    12cabe30 ip6MapLocalDataTask        0.00%       0.01%       0.01%
    12cb5290 ip6MapExceptionData        0.00%       8.60%       8.91%
    12cbe790 ip6MapNbrDiscTask          0.00%       0.02%       0.00%
    12e1a0d8 lldpTask                   0.80%       0.24%       0.29%
    12f8cd10 dnsTask                    0.00%       0.00%       0.01%
    140b4e18 dnsRxTask                  0.40%       0.07%       0.04%
    14176898 DHCPv4 Client Task         0.00%       0.00%       0.02%
    1418a3f8 isdpTask                   0.00%       0.00%       0.09%
    14416738 RMONTask                   1.00%       0.44%       0.44%
    144287f8 boxs Req                   0.40%       0.16%       0.21%
    15c90a18 sshd                       0.20%       0.06%       0.06%
    15cde0e0 sshd[0]                    0.00%       0.03%       0.02%
    -----------------------------------------------------------------
     Total CPU Utilization             43.28%      78.79%      76.50%
    

    Comparing the two above output samples by per process CPU utilization showed that the major share of the higher CPU utilization in the case of a IPv6 connection is allotted to the processes tNet0, bcmTX, bcmRX, bcmLINK.0, dtlTask, hapiRxTask and ip6MapExceptionData. In a process by process comparison, those seven processes used 60.3% more CPU time in case of a IPv6 connection compared to the case using a IPv4 connection. Unfortunately the documentation on what the individual processes are exactly doing is very sparse or not available at all. In order to further analyze this issue a support case with the collected information was opened with Dell. A fix for the described issue was made availible with firmware version 5.1.9.3

  • The LAN stack of several Dell PowerConnect M8024-k switches showed sometimes erratic behaviour. There were several occasions, where the switch stack would suddenly show a hugely increased latency in packet processing or where it would just stop passing certain types of traffic altogether. Usually a reload of the stack would restore its operation and the increased latency or the packet drops would disappear with the reload as suddenly as they had appeared. The root cause of this was unfortunately never really found. Maybe it was the combination of functions (layer 3, dual stack IPv4 and IPv6, extensive ACLs, etc.) that were running simultaneously on the stack in this setup.

  • During both planned and unplanned failovers of the master switch in the stack, there is a time period of up to 120 seconds where no packets are processed by the switch stack. This occurs even with continuous forwarding enabled. I've had a strong suspicion that this issue was related to the layer 3 instances running on the switch stack. A comparison between a pure layer 2 stack and a layer 3 enabled stack in a controlled test environment confirmed this. As soon as at least one layer 3 instance was added, the described delay occured on switch failovers. The fact that migrating layer 3 instances from the former master switch to the new one takes some time makes sense to me. What's unclear to me is why this seems to also affect the layer 2 traffic going over the stack.

  • There were several occasions where the hardware- and software MAC table of the Dell PowerConnect switches got out of sync. While the root cause (hardware defect, bit flip, power surge, cosmic radiation, etc.) of this issue is unknown, the effect was a sudden reboot of affected switch. Luckily we had console servers in place, which were storing a console output history from the time the issue occured. After raising a support case with Dell with the information from the console output, we got a firmware update (v5.1.9.4) in which the issue would not trigger a sudden reboot anymore, but instead log an appropriate message to the switches log. With this fix the out of sync MAC tables will still require a reboot of the affected switch, but this can now be done in a controlled fashion. Still, a solution requiring no reboot at all would have been much more preferrable.

  • While querying the Dell PowerConnect switches with the SNMP protocol for monitoring purposes, obscure and confusing messages containing the string MGMT_ACAL would reproducibly be logged into the switches log. See the article Check_MK Monitoring - Dell PowerConnect Switches - Global Status in this blog for the gory details.

  • With a stack of Dell PowerConnect M8024-k switches the information provided via the SNMP protocol would occasionally get out of sync with the information available from the CLI. E.g. the temperature values from the stack stack1 of LAN switches compared to the standalone SAN switches standalone{1,2,3,4,5,6}:

    user@host:# for HST in stack1 standalone1 standalone2 standalone3 stack2 standalone4 standalone5 standalone6; do 
      echo "$HST: ";
      for OID in 4 5; do
        echo -n "  ";
        snmpbulkwalk -v2c -c [...] -m '' -M '' -Cc -OQ -OU -On -Ot $HST .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.${OID};
      done;
    done
    
    stack1: 
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.4 = No Such Object available on this agent at this OID
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.5 = No Such Object available on this agent at this OID
    standalone1: 
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.4.1.0 = 0
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.5.1.0 = 40
    standalone2: 
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.4.1.0 = 0
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.5.1.0 = 37
    standalone3: 
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.4.1.0 = 0
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.5.1.0 = 32
    stack2: 
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.4.1.0 = 1
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.4.2.0 = 0
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.5.1.0 = 42
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.5.2.0 = 41
    standalone4: 
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.4.1.0 = 1
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.5.1.0 = 39
    standalone5: 
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.4.1.0 = 1
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.5.1.0 = 39
    standalone6: 
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.4.1.0 = 1
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.5.1.0 = 35
    

    At the same time the CLI management interface of the switch stack showed the correct temperature values:

    stack1# show system       
    
    System Description: Dell Ethernet Switch
    System Up Time: 89 days, 01h:50m:11s
    System Name: stack1
    Burned In MAC Address: F8B1.566E.4AFB
    System Object ID: 1.3.6.1.4.1.674.10895.3041
    System Model ID: PCM8024-k
    Machine Type: PowerConnect M8024-k
    Temperature Sensors:
    
    Unit     Description       Temperature    Status
                                (Celsius)
    ----     -----------       -----------    ------
    1        System            39             Good
    2        System            39             Good
    [...]
    

    Only after a reboot of the switch stack, the information provided via the SNMP protocol:

    user@host:# for HST in stack1 standalone1 standalone2 standalone3 stack2 standalone4 standalone5 standalone6; do 
      echo "$HST: ";
      for OID in 4 5; do
        echo -n "  ";
        snmpbulkwalk -v2c -c [...] -m '' -M '' -Cc -OQ -OU -On -Ot $HST .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.${OID};
      done;
    done
    
    stack1: 
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.4.1.0 = 1
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.4.2.0 = 1
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.5.1.0 = 37
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.5.2.0 = 37
    standalone1: 
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.4.1.0 = 0
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.5.1.0 = 39
    standalone2: 
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.4.1.0 = 1
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.5.1.0 = 37
    standalone3: 
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.4.1.0 = 1
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.5.1.0 = 32
    stack2: 
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.4.1.0 = 0
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.4.2.0 = 1
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.5.1.0 = 41
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.5.2.0 = 41
    standalone4: 
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.4.1.0 = 1
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.5.1.0 = 38
    standalone5: 
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.4.1.0 = 1
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.5.1.0 = 38
    standalone6: 
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.4.1.0 = 1
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.5.1.0 = 34
    

    would again be in sync with the information available from the CLI:

    stack1# show system   
    
    System Description: Dell Ethernet Switch
    System Up Time: 0 days, 00h:05m:32s
    System Name: stack1
    Burned In MAC Address: F8B1.566E.4AFB
    System Object ID: 1.3.6.1.4.1.674.10895.3041
    System Model ID: PCM8024-k
    Machine Type: PowerConnect M8024-k
    Temperature Sensors:
    
    Unit     Description       Temperature    Status
                                (Celsius)
    ----     -----------       -----------    ------
    1        System            37             Good
    2        System            37             Good
    [...]
    

Conclusion

Although the setup build with the Dell PowerConnect switches and the other hardware components was working and providing its basic, intended functionality, there were some pretty big and annoying limitations associated with it. A lot of these limitations would have not been that significant to the entire setup if certain design descisions would have been made more carefully. For example if the layer 3 part of the LAN would have been implemented in external network components or if a proper fully meshed, fabric-based SAN would have been favored over what can only be described as a legacy technology. From the reliability, availability and serviceability (RAS) points of view, the setup is also far from ideal. By daisy-chaining the Dell PowerEdge M1000e blade chassis, stacking the LAN switches, stretching the LAN and SAN over both chassis and by connecting external devices through the external ports of the Dell PowerConnect switches, there are a lot of parts in the setup that are depending on each other. This makes normal operations difficult at best and can have disastrous effects in case of a failure.

In retrospect, either using pure pass-through network modules in the Dell PowerEdge M1000e blade chassis in conjunction with capcable 10GE top-of-rack switches or using the much more capable Dell Force10 MXL switches in the Dell PowerEdge M1000e blade chassis seem to be better solutions. The uptick for Dell Force10 MXL switches of about €2000 list price per device compared to the Dell PowerConnect switches seems negligible compared to the costs that arose through debugging, bugfixing and finding workarounds for the various limitations of the Dell PowerConnect switches. In either case a pair of capable, central layer 3 devices for gateway redundancy, routing and possibly fine-grained traffic control would be advisable.

For simpler setups, without some of the more special requirements of this particular setup, the Dell PowerConnect switches still offer a nice price-performance ratio. Especially with regard to their 10GE port density.

// AIX and VIOS Performance with 10 Gigabit Ethernet (Update)

In october last year (10/2013) a colleague and i were given the opportunity to speak at the “IBM AIX User Group South-West” at IBMs german headquarters in Ehningen. My part of the talk was about our experiences with the move from 1GE to 10GE on our IBM Power systems. It was largely based on my pervious post AIX and VIOS Performance with 10 Gigabit Ethernet. During the preperation for the talk, while i reviewed and compiled the previously collected material on the subject, i suddenly realized a disastrous mistake in my methodology. Specifically the netperf program used for the performance tests has a bit of an inadequate heuristic for determining the TCP buffer sizes it will use during a test run. For example:

lpar1:/$ netperf -H 192.168.244.137
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to ...
Recv   Send    Send
Socket Socket  Message  Elapsed
Size   Size    Size     Time     Throughput
bytes  bytes   bytes    secs.    10^6bits/sec
16384  16384   16384    10.00    3713.83

with the values “16384” from the last line being the relevant part here.

It turned out, that on AIX the netperf utility will only look at the global, system wide values of the tcp_recvspace and tcp_sendspace tunables set with the no command. In my example this was:

lpar1:/$ no -o tcp_recvspace -o tcp_sendspace
tcp_recvspace = 16384
tcp_sendspace = 16384

The interface specific values “262144” or “524288”, e.g.:

lpar1:/$ ifconfig en0
en0: flags=1e080863,4c0<UP,...,CHECKSUM_OFFLOAD(ACTIVE),LARGESEND,CHAIN>
        inet 192.168.244.239 netmask 0xffffff00 broadcast 192.168.244.255
         tcp_sendspace 262144 tcp_recvspace 262144 tcp_nodelay 1 rfc1323 1

which would override the system wide default values for a specific network interface were never properly picked up by netperf. The configuration of low, global default values and higher, interface-specific values for the tunables was deliberately chosen in order to allow the coexistance of low and high bandwidth adapters in the same system without the risk of interference. Anyway, i guess this is a very good example of the famous saying:

A fool with a tool is still a fool. Grady Booch

For the purpose of another round of performance tests i temporaryly set the global default values for the tunables tcp_recvspace and tcp_sendspace first to “262144” and then to “524288”. The number of tests were reduced to “largesend” and “largesend with JF enabled” since those – as elaborated before – seemed to be the most feasible options. The following table shows the results of the test runs in MBit/sec. The upper value in each cell being from the test with the tunables tcp_recvspace and tcp_sendspace set to “262144”, the lower value in each cell being from the test with the tunables tcp_recvspace and tcp_sendspace set to “524288”.

Options Legend
upper value tcp_sendspace 262144, tcp_recvspace 262144, tcp_nodelay 1, rfc1323 1
lower value tcp_sendspace 524288, tcp_recvspace 524288, tcp_nodelay 1, rfc1323 1
LS largesend
LS,9000 largesend, MTU 9000
Color Legend
⇐ 1 Gbps
> 1 Gbps and ⇐ 3 Gbps
> 3 Gbps and ⇐ 5 Gbps
> 5 Gbps
Managed System P770 DC 1-2 P770 DC 2-2
IP Destination .243 .239 .137
Source Option LS LS,9000 LS LS,9000 LS LS,9000
P770 DC 1-2 .243 LS 9624.47
10494.85
9475.76
10896.52
4114.34
4810.47
3966.08
4537.50
LS,9000 9313.00
10133.70
7549.75
7221.16
4069.94
4538.29
3699.03
4302.74
.239 LS 8484.16
10324.04
8534.12
10237.57
4000.52
4640.57
3834.55
4356.51
LS,9000 8181.51
11539.93
7267.75
6505.37
4010.89
4512.89
4056.08
4410.07
P770 DC 2-2 .137 LS 3937.75
4073.65
3945.29
4386.60
3892.89
4194.19
3906.86
4303.14
LS,9000 3849.58
4742.04
3693.16
4360.35
3423.19
4561.73
3502.64
4317.46

The results show a significant increase in throughput in the cases of intra-managed system network traffic. This is now up to almost wire speed. For the cases of inter-managed system traffic there is also a noticeable, but far less significant increase in throughput. To check if the two TCP buffer sizes selected were still too low, i did a quick check for the network latency in our environment and compared it to the reference values from IBM (see the redbook link below):

Network IBM RTT Reference RTT Measured
1GE Phys 0.144 ms 0.188 ms
10GE Phys 0.062 ms 0.074 ms
10GE Hyp 0.038 ms 0.058 ms
10GE SEA 0.274 ms 0.243 ms
10GE VMXNET3 0.149 ms

Every value besides the measurement within a managed system (“10GE Hyp”) represents the worst case RTT between two separate managed systems, one in each of the two datacenters. For reference purposes i added a test (“10GE VMXNET3”) between two Linux VMs running on two different VMware ESX hosts, one in each of the two datacenters, using the same network hardware as the IBM Power systems. The measured values themselves are well within the range of the reference values given by IBM, so the network setup in general should be fine. A quick calculation of the Bandwidth-delay product for the value of the test case “10GE SEA”:

B x D = 10^10bps x 0.243ms ~ 326kB

confirmed that the value of “524288” for the tunables tcp_recvspace and tcp_sendspace should be sufficient. Very alarming on the other hand is the fact, that processing of simple ICMP packets takes almost twice as long going through the SEA on the IBM Power systems, compared to the network virtualization layer of VMware ESX. Part of the rather sub-par throughput performance measured on IBM Power systems is most likely caused by inefficient processing of network traffic within the SEA and/or VIOS. Rumor has it, that with the advent of 40GE and things getting even worse, the IBM AIX lab finally acknowledged this as an issue and is working on a streamlined, more efficient network stack. Other sources say SR-IOV and with it passing direct access to shared hardware resources to the LPAR systems on a larger scale is considered to be the way forward. Unfortunately this currently conflicts with LPM, so it'll probably be mutually exclusive for most environments.

In any case IBM still seems to have some homework to do on the network front. It'll be interesting to see what the developers come up with in the future.

// AIX and VIOS Performance with 10 Gigabit Ethernet

Please be sure to also read the update AIX and VIOS Performance with 10 Gigabit Ethernet (Update) to this blog post.

During the earlier mentioned (LPM Performance with 802.3ad Etherchannel on VIOS) redesign of our datacenters and the associated infrastructure we – among other things – also switched the 1 Gbps links on our IBM Power systems over to 10 Gbps links. I did some preliminary research on what guidelines and best practices should be followed with regard to performance and availability when using 10 gigabit ethernet (10GE). Building on those recommendations i did systematic performance tests as well. The result of both will be presented and discussed in this post.

Initial setup

Previously our IBM Power environment was – with respect to network – set up with:

  • Dual VIOS on each IBM Power system.

  • Seven physical 1 Gbps links on each VIOS attached to different Cisco switching gear (3750 and 6509).

  • Heavily fragmented/firewalled environment with about 30 VLANs distributed unevenly over the 7+7 physical links.

  • Seven “failover SEA” (shared ethernet adapter) instances with primary and backup alternating between the two VIOS to ensure at least some load distribution.

All in all not very pretty. Not by intention though, but more of organically grown like most environments.

New setup with 10GE

After the datacenter redesign, the new IBM Power environment would – with respect to network – be set up with:

  • Dual VIOS on each IBM Power system.

  • Two or four physical 10 Gbps links on each VIOS attached to four fully meshed Cisco Nexus 5596 switches. In case of four physical links from the IBM Power systems, they would be aggregated into pairs of two links in one 802.3ad etherchannel.

  • On the P770 (9117-MMB) systems the already available 10GE IVE or HEA ports would be used. On the P550 (8204-E8A) systems newly purchased Chelsio 10GE adapters (FC 5769) would be used.

  • Still heavily fragmented/firewalled environment with about 30 VLANs distributed evenly over the 1+1 physical links or 802.3ad etherchannels.

  • One “failover SEA with load sharing” instance to concurrently utilize both of the physical links attached to the SEA.

The design goals were:

  • Better resource utilization of the individual physical links. Connections needing less than 1 Gbps being able to “donate” the unused bandwidth to connections needing more than 1 Gbps.

  • Utilization of all physical links with the “failover SEA with load sharing”. No more inactive backup links.

  • A reduced amount of copper cabling taking up network interfaces, patch panel ports and switch ports. Less obstruction of the cabinet airflow due to less and smaller fiber links. Reduced fire load due to less cabling.

Configuration best practices

  • Network options that should already be in place from the 1 Gbps network configuration:

    • TCP window scaling (aka RFC1323)

      Globally:

      $ no -o rfc1323
      rfc1323 = 1
      
      $ no -o rfc1323=1 -p
      

      Or on individual interfaces:

      $ lsattr -EH -a rfc1323 -l enX
      attribute value description                                user_settable
      rfc1323   1     Enable/Disable TCP RFC 1323 Window Scaling True
      
      $ chdev -a rfc1323=1 -l enX
      
    • TCP receive and transmit socket buffer size (at least 262144 bytes)

      Globally:

      $ no -o tcp_recvspace -o tcp_sendspace
      tcp_recvspace = 262144
      tcp_sendspace = 262144
      
      $ no -o tcp_recvspace=262144 -o tcp_sendspace=262144 -p
      

      Or on individual interfaces:

      $ lsattr -EH -a tcp_sendspace -a tcp_recvspace -l enX
      attribute     value  description                           user_settable
      tcp_sendspace 262144 Set Socket Buffer Space for Sending   True
      tcp_recvspace 262144 Set Socket Buffer Space for Receiving True
      
      $ chdev -a tcp_sendspace=262144 -a tcp_recvspace=262144 -l enX
      
    • Disable TCP nagle algorithm

      Globally:

      $ no -o tcp_nodelayack
      tcp_nodelayack = 1
      
      $ no -o tcp_nodelayack=1 -p
      

      Or on individual interfaces:

      $ lsattr -EH -a tcp_nodelay -l enX
      attribute   value description                       user_settable
      tcp_nodelay 1     Enable/Disable TCP_NODELAY Option True
      
      $ chdev -a tcp_nodelay=1 -l enX
      
    • TCP selective acknowledgment (aka “SACK” or RFC2018):

      $ no -a | grep -i sack
      sack = 1
      
      $ no -o sack=1 -p
      
  • Turn on flow-control everywhere – interfaces and switches! Below in the performance section will be an example of what performance will look like if flow-control isn't implemented end-to-end. For your network equipment, see your vendors documentation. For IVE/HEA interfaces enable flow-control in the HMC like this:

    1. Go to Systems ManagementServers and select your managed system.

    2. From the drop-down menu select Hardware InformationAdaptersHost Ethernet.

    3. In the new Host Ethernet Adapters window select your 10GE adapter and click the Configure button.

    4. In the new HEA Physical Port Configuration window check the Flow control enabled option and click the OK button.

    For other 10GE interfaces like the Chelsion FC 5769 adapters, turn on flow control on the AIX or VIOS device:

    $ lsattr -EH -a flow_ctrl -l entX
    attribute value description                              user_settable
    flow_ctrl yes   Enable transmit and receive flow control True
    
    $ chdev -a flow_ctrl=yes -l entX
    

    Changing this attribute on IVE/HEA devices has no effect though.

  • TCP checksum offload (enabled by default):

    $ lsattr -EH -a chksum_offload -l entX
    attribute      value description                          user_settable
    chksum_offload yes   Enable transmit and receive checksum True
    
    $ chdev -a chksum_offload=yes -l entX
    
  • TCP segmentation offload on AIX and VIOS hardware devices (enabled by default):

    $ lsattr -EH -a large_receive -a large_send -l entX
    attribute     value description                              user_settable
    large_receive yes   Enable receive TCP segment aggregation   True
    large_send    yes   Enable transmit TCP segmentation offload True
    
    $ chdev -a large_receive=yes -a large_send=yes -l entX
    

    TCP segmentation offload on VIOS SEA devices (could cause compatibility issues with IBM i LPARs)

    As root user:

    $ lsattr -EH -a large_receive -a largesend -l entX
    attribute     value description                                 user_settable
    large_receive yes   Enable receive TCP segment aggregation      True
    largesend     1     Enable Hardware Transmit TCP Resegmentation True
    
    $ chdev -a large_receive=yes -a largesend=1 -l entX
    

    As padmin user:

    padmin@vios$ chdev -dev entX -attr largesend=1
    padmin@vios$ chdev -dev entX -attr large_receive=yes
    

    TCP segmentation offload on AIX virtual ethernet devices:

    $ lsattr -EH -a mtu_bypass -l enX
    attribute  value description                                   user_settable
    mtu_bypass off   Enable/Disable largesend for virtual Ethernet True
    
    $ chdev -a mtu_bypass=on -l enX
    

Performance Tests

To test and evaluate the network troughput performance, Michael Perzls RPM package of netperf was used. The netperf client and server programs were started with the default options, which means a simplex, single thread throughput measurement with a duration of 10 seconds for each run. The AIX and VIOS LPARs involved in the tests had – unless specified otherwise – the following CPU and memory resource allocation configuration:

LPAR Memory (GB) vCPU EC Mode Weight
AIX 16 2 0.2 uncapped 128
VIOS 4 2 0.4 uncapped 254

AIX was at version 6.1.8.1, oslevel 6100-08-01-1245, VIOS was at ioslevel 2.2.2.1.

Hardware Tests on VIO Servers

For later reference or “calibration” if you will, the initial performance tests were done directly on the hardware devices assigned to the VIO servers. During the DLPAR assignment of the IVE/HEA ports i ran into some issues described earlier in AIX and VIOS DLPAR Operation fails on LHEA Adapters. After those issues were tackled the test environment looked like this:

Hardware Test Setup with VIOS on IBM Power and Cisco Nexus Switches

  • Two datacenters “DC 1” and “DC 2” with less than 500m fiber distance between them.

  • An overall number of four P770 (9117-MMB). Two P770 with one CEC and thus one IVE/HEA port per VIOS. Two P770 with two CEC and thus two IVE/HEA ports per VIOS. Of each kind of P770 configuration, one is located in DC 1 and the other in DC 2.

  • Four Cisco Nexus 5596 switches, two in each datacenter.

  • Test IP adresses in two different VLANs, but tests not crosssing VLANs over the firewall infrastructure.

The following table shows the results of the test runs in MBit/sec. Excluded – and denoted as grey cells with a “X” – were cases where source and destination IP would be the same and cases where VLANs would be crossed and thus the firewall infrastructure would be involved.

IP address 192.168.243 192.168.241 192.168.243 192.168.241
.11 .12 .21 .22 .31 .32 .41 .42 .51 .52 .61 .62
192.168.243 .11 X 596 X X 4551 4669 4148 4547 3319 364 X X
.12 4588 X X X 4526 4714 4192 4389 2770 560 X X
192.168.241 .21 X X X 4843 X X X X X X 555 3588
.22 X X 400 X X X X X X X 491 3400
192.168.243 .31 4753 294 X X X 4915 4250 4566 2316 602 X X
.32 4629 639 X X 4661 X 4189 4496 1153 449 X X
.41 4754 449 X X 4607 4858 X 4516 1693 603 X X
.42 4746 290 X X 4809 4898 4314 X 1615 352 X X
.51 4792 419 X X 4714 4834 4259 4491 X 370 X X
.52 4886 1046 X X 4663 4843 4300 4576 2151 X X X
192.168.241 .61 X X 454 4667 X X X X X X X 3820
.62 X X 541 4637 X X X X X X 288 X

In general the numbers weren't that bad for a first try. Still, there seemed to be a systematic performance issue in those test cases where the numbers in the table are marked red. Those are the cases where the IVE/HEA ports on the second CEC of a two-CEC system were receiving traffic. See the above image of the test environment, where the affected IVE/HEA ports are also marked red. After double checking cabling, system resources and my own configuration and also double checking with our network department with regard to the flow-control configuration, i did a tcpdump trace of the netperf test case 192.168.243.11192.168.243.12 (upper left corner of the above table; value: 596 MBit/sec). This is what the wireshark bandwidth analysis graph of the tcpdump trace data looked like:

Wireshark bandwidth analysis for a test case without proper flow-control configuration

Noticable are the large gaps of about one second inbetween very short bursts of traffic. This was due to the connection falling back to flow-control on the TCP protocol level, which has relatively high backoff timing values. So the low transfer rate in the test cases marked red was due to inproper flow-control configuration, which resulted in large periods of network inactivity, dragging down the overall average bandwidth value. After getting our network department to set up flow-control properly and doing the same netperf test case again, things improved dramatically:

Wireshark bandwidth analysis for a test case with flow-control

Now there is a continous flow of packets which is regulated in a more fine grained manner via flow-control in case one of the parties involved gets flooded with packets. The question why this behaviour is only exhibited in conjunction with the IVE/HEA ports on the second CEC is still an open question. A PMR raised with IBM led to no viable or conclusive answer.

Doing the systematic tests above again confirmed the improved results in the previously problematic cases:

IP address 192.168.243 192.168.241 192.168.243 192.168.241
.11 .12 .21 .22 .31 .32 .41 .42 .51 .52 .61 .62
192.168.243 .11 X 4624 X X 4638 4550 4347 4618 4643 4178 X X
.12 4562 X X X 4642 4861 4374 4267 4527 4210 X X
192.168.241 .21 X X X 4578 X X X X X X 4347 3953
.22 X X 3858 X X X X X X X 4283 3918
192.168.243 .31 4501 4107 X X X 4719 4331 4492 4478 4118 X X
.32 4697 4089 X X 4481 X 4189 4313 4458 4091 X X
.41 4510 4327 X X 4735 4567 X 4485 4482 4420 X X
.42 4814 4122 X X 4696 4841 4453 X 4491 4221 X X
.51 4574 4198 X X 4656 5037 4501 4526 X 4167 X X
.52 4565 4283 X X 4543 4594 4421 4487 4416 X X X
192.168.241 .61 X X 4394 4497 X X X X X X X 4038
.62 X X 4343 4445 X X X X X X 4237 X

Although those results were much better than the ones in the first test run, there were still two issues:

  • individual throughput values still had a rather large span of over 1 Gbps (3858 - 5037 MBit/sec).

  • on average the throughput was well below the theoretical maximum of 10 Gbps. In the worst case only 38.5%, in the best case 50.3% of the theoretical maximum.

In order to further improve performance, jumbo frames were enabled on the switches, in the HMC on the IVE/HEA interfaces and on the interfaces within the VIO servers. For the HMC, the jumbo frames option on the IVE/HEA can be enabled in the same panel as the flow-control option mentioned above. After enabling jumbo frames, another test set like the above was done:

IP address 192.168.243 192.168.241 192.168.243 192.168.241
.11 .12 .21 .22 .31 .32 .41 .42 .51 .52 .61 .62
192.168.243 .11 X 7265 X X 7687 7218 7785 6795 7514 7459 X X
.12 7098 X X X 7492 6794 7644 6551 6979 7059 X X
192.168.241 .21 X X X 7187 X X X X X X 7228 7312
.22 X X 7190 X X X X X X X 6937 7203
192.168.243 .31 7823 7445 X X X 7086 8183 7051 7274 7548 X X
.32 7107 6730 X X 7427 X 7635 6616 6612 6677 X X
.41 7138 6930 X X 7585 6871 X 6701 6749 7114 X X
.42 6748 6593 X X 6944 6423 7641 X 6617 6865 X X
.51 7202 6716 X X 7055 6611 7779 6827 X 6929 X X
.52 7198 6711 X X 7816 6904 7907 6907 6990 X X X
192.168.241 .61 X X 6800 6710 X X X X X X X 7080
.62 X X 7197 6971 X X X X X X 7099 X

As those results show, the throughput can be significantly improved by using jumbo frames over the standard frame size. On average the throughput is now at about 71% (7116 MBit/sec) of the theoretical maximum of 10 Gbps. In the worst case that is 64.2%, in the best case 81.8% of the theoretical maximum. This also means that the span of the throughput values is with 1.7 Gbps (6423 - 8183 MBit/sec) even larger than before. It has to be kept in mind though that this is still single thread performance, so the larger variance is probably due to CPU restrictions of a single thread. A quick singular test between one source and one destination using two parallel netperf threads showed a combined throughput of about 9400 MBit/sec and confirmed this suspicion.

LPAR and SEA Tests

After the initial hardware performance tests above, some more tests closer to the final real-world setup were done. The same setup would later be used for the production environment. The test environment looked like this:

LPAR and SEA Test Setup with AIX and VIOS on IBM Power and Cisco Nexus Switches

  • On a physical level of network and systems, the test setup was the same as the one above, except that only the two P770 with one CEC and thus one IVE/HEA port per VIOS were used in the further tests.

  • The directly assigned network interfaces and IPs on the VIO servers were replaced by a “failover SEA with load sharing” configuration.

  • An overall number of three AIX LPARs was used for the tests. LPARs #2 and #3 resided on the same P770 system. LPAR #1 resided on the other P770 system in the other datacenter.

  • All AIX LPARs had test IP adresses in the same VLAN (PVID: 251, 192.168.244.0/24). Thus, no traffic over the firewall infrastructure occured and only one physical link (blue text in the above image) of the failover SEA was used.

The following table shows the results of the test runs in MBit/sec. The result table got a bit more complex this time around, since there were three LPARs of which each could be source or destination. In addition each LPAR also had six different combinations of virtual interface options (see the options legend). Excluded – and denoted as grey cells – were cases where source and destination IP and thus LPAR would be the same. To add a visual aid for interpretation of the results, the table cells are color coded depending on the measurement value. See the color legend for the mapping of color to throughput values.

Options Legend
none tcp_sendspace 262144, tcp_recvspace 262144, tcp_nodelay 1, rfc1323 1
LS largesend
9000 MTU 9000
65390 MTU 65390
LS,9000 largesend, MTU 9000
LS,65390 largesend, MTU 65390
Color Legend
⇐ 1 Gbps
> 1 Gbps and ⇐ 3 Gbps
> 3 Gbps and ⇐ 5 Gbps
> 5 Gbps
Managed System P770 DC 1-2 P770 DC 2-2
IP Destination .243 .239 .137
Source Option none LS 9000 65390 LS,9000 LS,65390 none LS 9000 65390 LS,9000 LS,65390 none LS 9000 65390 LS,9000 LS,65390
P770 DC 1-2 .243 none 791.27 727.34 759.35 743.37 766.81 708.33 888.13 885.68 877.76 788.91 778.50 875.20
LS 707.83 5239.54 709.68 721.63 5838.78 6714.79 3214.58 3225.06 3951.26 3744.28 4009.69 3961.77
9000 774.41 798.24 3422.03 4167.58 3518.23 3326.78 904.98 899.89 3676.68 3566.73 3507.67 3669.01
65390 755.57 772.49 3399.87 5233.71 3510.81 5821.94 867.31 878.49 3575.03 3697.73 3599.02 3766.30
LS,9000 785.19 6204.71 3658.68 3561.76 6384.92 7034.94 4073.84 4007.71 3964.47 4207.72 4024.49 3910.21
LS,65390 725.01 6058.76 3364.60 5909.11 7639.95 5529.59 3826.15 3918.82 3698.29 4139.26 3909.20 3970.33
.239 none 841.27 997.21 842.15 829.07 786.50 825.14 846.17 832.81 842.15 856.19 858.69 905.92
LS 827.75 5728.69 821.22 739.40 6028.86 6323.06 3674.06 3972.93 3750.00 3545.48 3238.95 3711.50
9000 817.95 847.33 3456.38 3659.09 3522.94 4436.41 855.46 897.06 3652.22 3410.08 3546.00 3439.29
65390 774.20 809.28 4552.87 6348.45 3598.86 5707.20 884.41 844.58 3173.88 3767.90 3217.50 3273.08
LS,9000 788.39 6569.07 4311.45 4004.73 7637.28 7529.95 3261.91 2854.86 4050.66 3458.85 3811.94 3972.71
LS,65390 801.97 5801.10 4018.57 5640.19 7815.83 5752.02 2952.91 3352.97 3961.08 3384.68 3421.31 4040.02
P770 DC 2-2 .137 none 925.17 753.81 865.15 839.55 912.80 930.17 819.35 938.09 864.33 847.11 898.57 827.62
LS 3351.54 2955.23 3323.41 3209.28 3158.83 3212.72 3143.14 2796.92 3118.80 2868.22 3244.99 3166.33
9000 930.69 953.76 3610.26 3460.80 3622.97 3585.96 785.10 869.91 3421.05 3245.40 3391.89 3369.89
65390 900.80 934.22 3564.63 3622.87 3769.84 3682.52 904.77 761.53 3231.62 3630.97 3079.13 3541.30
LS,9000 3066.75 3061.99 3604.73 4024.04 4059.77 4006.04 3094.73 2841.52 3820.87 3427.57 3792.07 3525.97
LS,65390 3333.52 3277.25 4047.81 4203.61 3957.74 3913.66 3025.75 3160.08 3829.30 3804.31 3283.60 3957.68

At the first glance those results show that the throughput can vary significantly, depending on the combination of interface configuration options as well as intra- vs. inter-managed system traffic. In detail the above results indicate that:

  • single thread performance is going to be nowhere near theoretical maximum of 10 Gbps. In most cases one has to be considered lucky getting 50% of the theoretical maximum.

  • the default configuration options (test cases: none) are not suited at all for a 10 Gbps environment.

  • intra-managed system network traffic, which does not leave the hardware and should be handled only within the PowerVM hypervisor, performs better than inter-managed system traffic in most of the cases. To a certain degree this was expected, since in the latter case the traffic has to additionally go over the SEA, the physical interfaces, the network devices, and again over the other systems physical interfaces and SEA.

  • there is a quite noticable delta between the intra-managed system results and the theoretical maximum of 10 Gbps. Since this traffic does not leave the hardware and should be handled only within the PowerVM hypervisor, better results somewhere between 9 – 9.5 Gbps were expected.

  • there is a quite noticable performance drop off comparing the inter-managed system results with the previous “non-jumbo frames, but flow-control” hardware tests. The delta gets even more noticable when comparing the results with the previous jumbo frames hardware tests. It seems the now involved SEA doesn't handle larger frame sizes that well.

  • strangely the inter-managed system tests show better results than the intra-managed system tests in some cases (LSnone, LS,9000none, LS,65390none).

The somewhat dissatisfying results were discussed with IBM support within the scope of the PMR mentioned earlier. The result of this discussion was also – and even with a large amount of good will – dissatisfying. We got the usual feedback of “results are actually not that bad”, “works as designed” and the infamous “not enough CPU resources on the VIO servers, switch to dedicated CPU assignment”. Just to get the latter point out of the way i did another reduced test set similar to the above. This time only the test cases for inter-managed system traffic were performed. The resource allocation on the two involved VIO servers was changed to two dedicated CPU cores. The following table shows the results of the test runs in MBit/sec.

Managed System P770 DC 1-2 P770 DC 2-2
IP Destination .243 .137
Source Option none LS 9000 65390 LS,9000 LS,65390 none LS 9000 65390 LS,9000 LS,65390
P770 DC 1-2 .243 none 853.06 862.40 800.55 869.85 873.51 846.27
LS 3661.52 3630.96 3530.22 3016.51 2986.46 3144.03
9000 808.54 654.28 3435.51 3570.83 3475.94 3426.25
65390 643.44 880.24 3301.09 3617.22 3592.70 3511.19
LS,9000 3645.29 2487.22 3783.74 3983.06 3774.08 4072.15
LS,65390 3591.95 3506.01 3646.02 4127.02 3987.71 3716.59
P770 DC 2-2 .137 none 826.02 710.41 800.56 425.30 814.65 824.55
LS 3509.81 3327.02 3419.89 3300.43 2387.57 3654.18
9000 827.58 835.04 3354.82 3301.09 3482.56 3550.13
65390 827.28 834.30 3451.69 3131.41 3506.15 3470.74
LS,9000 3480.38 3624.86 4191.55 4197.86 4194.89 4018.06
LS,65390 3580.14 2614.46 4097.04 4234.08 4181.17 3863.07

The results do in some of the cases improve a bit over the previous test set, but not by the expected margin of ~50%, which would be needed to get anywhere near the results of the previous hardware tests. A request to open a DCR in order to further investigate possible optimizations with regard to network performance in general and with the SEA in particular is currently still in the works at IBM.

HMC and Jumbo Frames

In order to still get a working RMC connection from the HMC to the LPARs and thus be able to do DLPAR and LPM operations after switching to jumbo frames, the HMC has to be configured accordingly. See: Configure Jumbo Frames on the Hardware Management Console (HMC) for the details. If – like in our case – you have to deal with a heavily fragmented/firewalled environment and the RMC traffic from the HMC to the LPARs has to pass through several network devices, all the network equipment along the way has to be able to properly deal with jumbo frames. In our case the network department was reluctant and eventually did not reconfigure their devices, since “jumbo frames are a non-standard extension to the network protocol”. As usual this is a tradeoff between optimal performance and the possibility of strange behaviour or even error situations that are hard to debug.

Thoughts and Conclusions

It appears as if with the advent of 10GE or even higher bandwidths, the network itself is no longer the bottleneck. Rather, the components doing the necessary legwork of pre- and postprocessing (i.e. CPU and memory), as well as some parts of the TCP/IP stack are or are becoming the new bottleneck. One can no longer rely on a plug'n'play attitude like with gigabit ethernet, when it comes to network performance and optimal utilization of existing resources.

Although i'm certainly no network protocol designer, AIX kernel programmer or PowerVM expert, it still seems to me as if IBM has left considerable room for improvement in all components involved. As we see 10 gigabit ethernet being deployed more and more in the datacenters, becoming the new standard by replacing gigabit ethernet, there will be an increased demand for high troughput and low latency network communication. It'll be hard to explain to upper management why they're getting only 35% - 75% of the network performance for 100% of the costs. I hope IBM will be able to show that the IBM Power and AIX platform can at least keep up with the current technology and trends as well as the competition from other platforms.

// Nagios Performance Tuning

I'm running Nagios in a bit of an unusual setup, namely on a Debian/PPC system, which runs in a LPAR on an IBM Power System, using a dual VIOS setup for access to I/O resources (SAN disks and networks). The Nagios server runs the stock Debian “Squeeze” packages:

nagios-nrpe-plugin        2.12-4             Nagios Remote Plugin Executor Plugin
nagios-nrpe-server        2.12-4             Nagios Remote Plugin Executor Server
nagios-plugins            1.4.15-3squeeze1   Plugins for the nagios network monitoring and management system
nagios-plugins-basic      1.4.15-3squeeze1   Plugins for the nagios network monitoring and management system
nagios-plugins-standard   1.4.15-3squeeze1   Plugins for the nagios network monitoring and management system
nagios3                   3.2.1-2            A host/service/network monitoring and management system
nagios3-cgi               3.2.1-2            cgi files for nagios3
nagios3-common            3.2.1-2            support files for nagios3
nagios3-core              3.2.1-2            A host/service/network monitoring and management system core files
nagios3-doc               3.2.1-2            documentation for nagios3
ndoutils-common           1.4b9-1.1          NDOUtils common files
ndoutils-doc              1.4b9-1.1          Documentation for ndoutils
ndoutils-nagios3-mysql    1.4b9-1.1          This provides the NDOUtils for Nagios with MySQL support
pnp4nagios                0.6.12-1~bpo60+1   Nagios addon to create graphs from performance data
pnp4nagios-bin            0.6.12-1~bpo60+1   Nagios addon to create graphs from performance data (binaries)
pnp4nagios-web            0.6.12-1~bpo60+1   Nagios addon to create graphs from performance data (web interface)

It monitors about 220 hosts, which are either Unix systems (AIX or Linux), storage systems (EMC Clariion, EMC Centera, Fujitsu DX, HDS AMS, IBM DS, IBM TS), SAN devices (Brocade 48000, Brocade DCX, IBM SVC) or other hardware devices (IBM Power, IBM HMC, Rittal CMC). About 5000 services are checked in a 5 minute interval, mostly with NRPE, SNMP, TCP/UDP and custom plugins utilizing vendor specific tools. The server also runs Cacti, SNMPTT, DokuWiki and several other smaller tools, so it has always been quite busy.

In the past i had already implemented some performance optimization measures in order to mitigate the overall load on the system and to get all checks done in the 5 minute timeframe. Those were - in no specific order:

  • Use C/C++ based plugins. Try to avoid plugins that depend on additional, rather large runtime environments (e.g. Perl, Java, etc.).

  • If Perl based plugins are still necessary, use the Nagios embedded Perl interpreter to run them. See Using The Embedded Perl Interpreter for more information and Developing Plugins For Use With Embedded Perl on how to develop and debug plugins for the embedded Perl interpreter.

  • When using plugins based on shell scripts, try to minimize the number of calls to additional command line tools by using the shells built-in facilities. For example bash and ksh93 have a built-in syntax for manipulating strings, which can be used instead of calling sed or awk.

  • Create a ramdisk with a filesystem on it to hold the I/O heavy files and directories. In my case /etc/fstab contains the following line:

    # RAM disk for volatile Nagios files
    none    /var/ram/nagios3    tmpfs   defaults,size=256m,mode=750,uid=nagios,gid=nagios   0   0

    preparing an empty ramdisk based filesystem at boot time. The init script /etc/init.d/nagios-prepare is run before the Nagios init script. It creates a directory tree on the ramdisk and copies the necessary files from the disk based, non-volatile filesystems:

    Source Destination
    /var/log/nagios3/retention.dat /var/ram/nagios3/log/retention.dat
    /var/log/nagios3/nagios.log /var/ram/nagios3/log/nagios.log
    /var/cache/nagios3/ /var/ram/nagios3/cache/

    over to the ramdisk via rsync. The following Nagios configuration stanzas have been altered to use the new ramdisk:

    check_result_path=/var/ram/nagios3/spool/checkresults
    log_file=/var/ram/nagios3/log/nagios.log
    object_cache_file=/var/ram/nagios3/cache/objects.cache
    state_retention_file=/var/ram/nagios3/log/retention.dat
    temp_file=/var/ram/nagios3/cache/nagios.tmp
    temp_path=/var/ram/nagios3/tmp

    To prevent loss of data in the event of a system crash, a cronjob is run every 5 minutes to rsync some files back from the ramdisk to the disk based, non-volatile filesystems:

    Source Destination
    /var/ram/nagios3/log/ /var/log/nagios3/
    /var/ram/nagios3/cache/ /var/cache/nagios3/

    This job should also be run before the system is shut down or rebooted.

  • Disable free()ing of child process memory (see: Nagios Main Configuration File Options - free_child_process_memory) with:

    free_child_process_memory=0

    in the main Nagios configuration file. This can safely be done since the Linux OS will take care of that once the fork()ed child exits.

  • Disable fork()ing twice when creating a child process (see: Nagios Main Configuration File Options - child_processes_fork_twice) with:

    child_processes_fork_twice=0

    in the main Nagios configuration file.

Recently, after “just” adding another process to be monitored, i ran into a strange problem, where suddenly all checks would fail with error messages similar to this example:

[1348838643] Warning: Return code of 127 for check of service 'Check_process_gmond' on host
    'host1' was out of bounds. Make sure the plugin you're trying to run actually exists
[1348838643] SERVICE ALERT: host1;Check_process_gmond;CRITICAL;SOFT;1;(Return code of 127 is
    out of bounds - plugin may be missing)

Even checks totally unrelated to the new process to be monitored. After removing the new check from some hostgroups everything was be fine again. So the problem wasn't with the check per se, but rather with the number of checks, which felt like i was hitting some internal limit of the Nagios process. It turned out to be exactly that. The Nagios macros which are exported to the fork()ed child process had reached the OS limit for a processes environment. After checking all plugins for any usage of the Nagios macros in the process environment i decided to turn this option off (see Nagios Main Configuration File Options - enable_environment_macros) with:

enable_environment_macros=0

in the main Nagios configuration file. Along with the two above options i had now basically reached what is also done by the single use_large_installation_tweaks configuration option (see: Nagios Main Configuration File Options - use_large_installation_tweaks).

After reloading the Nagios process not only did the new process checks now work for all hostgroups, but there was also a very noticable drop in the systems CPU load:

Nagios CPU usage with environment macros turned off

This website uses cookies for visitor traffic analysis. By using the website, you agree with storing the cookies on your computer.More information