bityard Blog

// Check_MK Monitoring - SAP Cloud Connector

The SAP Cloud Connector provides a service which is used to connect on-premise systems – non-SAP, SAP ECC and SAP HANA – with applications running on the SAP Cloud Platform. This article introduces a new Check_MK agent and several service checks to monitor the status of the SAP Cloud Connector and its connections to the SAP Cloud Platform.

For the impatient and TL;DR here is the Check_MK package of the SAP Cloud Connector monitoring checks:

SAP Cloud Connector monitoring checks (Compatible with Check_MK versions 1.4.0p19 and later)

The sources are to be found in my Check_MK repository on GitHub


Monitoring the SAP Cloud Connector can be done in two different, not mutually exclusive, ways. The first approach uses the traditional application monitoring features, already built into Check_MK, like:

  • the presence and count of the application processes

  • the reachability of the applications TCP ports

  • the validity of the SSL Certificates

  • queries to the applications health-check URL

The first approach is covered by the section Built-in Check_MK Monitoring below.

The second approach uses a new Check_MK agent and several service checks, dedicated to monitor the internal status of the SAP Cloud Connector and its connections to the SAP Cloud Platform. The new Check_MK agent uses the monitoring API provided by the SAP Cloud Connector in order the monitor the application specific states and metrics. The monitoring endpoints on the SAP Cloud Connector currently used by this Check_MK agent are the:

  • “List of Subaccounts” (URL: https://<scchost>:<sccport>/api/monitoring/subaccounts)

  • “List of Open Connections” (URL: https://<scchost>:<sccport>/api/monitoring/connections/backends)

  • “Performance Monitor Data” (URL: https://<scchost>:<sccport>/api/monitoring/performance/backends)

  • “Top Time Consumers” (URL: https://<scchost>:<sccport>/api/monitoring/performance/toptimeconsumers)

At the time of writing, there unfortunately is no monitoring endpoint on the SAP Cloud Connector for the Most Recent Requests metric. This metric is currently only available via the SAP Cloud Connectors WebUI. The Most Recent Requests metric would be a much more interesting and useful metric than the currently available Top Time Consumers or Performance Monitor Data, both of which have limitations. The application requests covered by the Top Time Consumers metric need a manual acknowledgement inside the SAP Cloud Connector in order to reset events with the longest request runtime, which limits the metrics usability for external monitoring tools. The Performance Monitor Data metric aggregates the application requests into buckets based on their overall runtime. By itself this can be useful for external monitoring tools and is in fact used by the Check_MK agent covered in this article. In the process of runtime bucket aggregation though, the Performance Monitor Data metric hides the much more useful breakdown of each request into runtime subsections (“External (Back-end)”, “Open Connection”, “Internal (SCC)”, “SSO Handling” and “Latency Effects”). Hopefully the Most Recent Requests metric will in the future also be exposed via the monitoring API provided by the SAP Cloud Connector. The new Check_MK agent can then be extended to use the newly exposed metric in order to gain a more fine grained insight into the runtime of application requests through the SAP Cloud Connector.

The second approach is covered by the section SAP Cloud Connector Agent below.

Built-in Check_MK Monitoring

Application Processes

To monitor the SAP Cloud Connector process, use the standard Check_MK check “State and count of processes”. This can be found in the WATO WebUI under:

-> Manual Checks
   -> Applications, Processes & Services
      -> State and count of processes
         -> Create rule in folder ...
            -> Rule Options
               Description: Process monitoring of the SAP Cloud Connector
               Checktype: [ps - State and Count of Processes]
               Process Name: SAP Cloud Connector

            -> Parameters
               [x] Process Matching
               [Exact name of the process without arguments]
               [/opt/sapjvm_8/bin/java]

               [x] Name of the operating system user
               [Exact name of the operating system user]
               [sccadmin]

               [x] Levels for process count
               Critical below [1] processes
               Warning below [1] processes
               Warning above [1] processes
               Critical above [2] processes

            -> Conditions
               Folder [The folder containing the SAP Cloud Connector systems]
               and/or
               Explicit hosts [x]
               Specify explicit host names [SAP Cloud Connector systems]

Application Health Check

To implement a rudimentary monitoring of the SAP Cloud Connector application health, use the standard Check_MK check “Check HTTP service” to query the Health Check endpoint of the monitoring API provided by the SAP Cloud Connector. The “Check HTTP service” can be found in the WATO WebUI under:

-> Host & Service Parameters
   -> Active checks (HTTP, TCP, etc.)
      -> Check HTTP service
         -> Create rule in folder ...
            -> Rule Options
               Description: SAP Cloud Connector (SCC)

            -> Check HTTP service
               Name: SAP SCC
               [x] Check the URL
               [x] URI to fetch (default is /)
               [/exposed?action=ping]

               [x] TCP Port
               [8443]

               [x] Use SSL/HTTPS for the connection:
               [Use SSL with auto negotiation]

            -> Conditions
               Folder [The folder containing the SAP Cloud Connector systems]
               and/or
               Explicit hosts [x]
               Specify explicit host names [SAP Cloud Connector systems]

SSL Certificates

To monitor the validity of the SSL certificate of the SAP Cloud Connector WebUI, use the standard Check_MK check “Check HTTP service”. The “Check HTTP service” can be found in the WATO WebUI under:

-> Host & Service Parameters
   -> Active checks (HTTP, TCP, etc.)
      -> Check HTTP service
         -> Create rule in folder ...
            -> Rule Options
               Description: SAP Cloud Connector (SCC)

            -> Check HTTP service
               Name: SAP SCC Certificate
               [x] Check SSL Certificate Age
               Warning at or below [30] days
               Critical at or below [60] days

               [x] TCP Port
               [8443]

            -> Conditions
               Folder [The folder containing the SAP Cloud Connector systems]
               and/or
               Explicit hosts [x]
               Specify explicit host names [SAP Cloud Connector systems]

SAP Cloud Connector Agent

The new Check_MK package to monitor the status of the SAP Cloud Connector and its connections to the SAP Cloud Platform consists of three major parts – an agent plugin, two check plugins and several auxiliary files and plugins (WATO plugins, Perf-o-meter plugins, metrics plugins and man pages).

Prerequisites

The following prerequisites are necessary in order for the SAP Cloud Connector agent to work properly:

  • A SAP Cloud Connector application user must be created for the Check_MK agent to be able to authenticate against the SAP Cloud Connector and gain access to the protected monitoring API endpoints. See the article SAP Cloud Connector - Configuring Multiple Local Administrative Users on how to create a new application user.

  • A DNS alias or an additional IP address for the SAP Cloud Connector service.

  • An additional host in Check_MK for the SAP Cloud Connector service with the previously created DNS alias or IP address.

  • Installation of the Python requests library on the Check_MK server. This library is used in the Check_MK agent plugin agent_sapcc to perform the authentication and the HTTP requests against the monitoring API of the SAP Cloud Connector. On e.g. RHEL based systems it can be installed with:

    root@host:# yum install python-requests
  • Installation of the new Check_MK package for the SAP Cloud Connector monitoring checks on the Check_MK server.

SAP Cloud Connector Agent Plugin

The Check_MK agent plugin agent_sapcc is responsible for querying the endpoints of the monitoring API on the SAP Cloud Connector, which are described above. It transforms the data returned from the monitoring endpoints into a format digestible by Check_MK. The following example shows the – anonymized and abbreviated – agent plugin output for a SAP Cloud Connector system:

<<<check_mk>>>
Version: 0.1

<<<sapcc_connections_backends:sep(59)>>>
subaccounts,abcdefghi,locationID;Test Location
subaccounts,abcdefghi,regionHost;hana.ondemand.com
subaccounts,abcdefghi,subaccount;abcdefghi

<<<sapcc_performance_backends:sep(59)>>>
subaccounts,abcdefghi,backendPerformance,PROTOCOL/sapecc.example.com:PORT,buckets,1,minimumCallDurationMs;10
subaccounts,abcdefghi,backendPerformance,PROTOCOL/sapecc.example.com:PORT,buckets,1,numberOfCalls;1
subaccounts,abcdefghi,backendPerformance,PROTOCOL/sapecc.example.com:PORT,buckets,2,minimumCallDurationMs;20
subaccounts,abcdefghi,backendPerformance,PROTOCOL/sapecc.example.com:PORT,buckets,2,numberOfCalls;36
[...]
subaccounts,abcdefghi,backendPerformance,PROTOCOL/sapecc.example.com:PORT,buckets,20,minimumCallDurationMs;3000
subaccounts,abcdefghi,backendPerformance,PROTOCOL/sapecc.example.com:PORT,buckets,21,minimumCallDurationMs;4000
subaccounts,abcdefghi,backendPerformance,PROTOCOL/sapecc.example.com:PORT,buckets,22,minimumCallDurationMs;5000
subaccounts,abcdefghi,backendPerformance,PROTOCOL/sapecc.example.com:PORT,name;PROTOCOL/sapecc.example.com:44300
subaccounts,abcdefghi,backendPerformance,PROTOCOL/sapecc.example.com:PORT,protocol;PROTOCOL
subaccounts,abcdefghi,backendPerformance,PROTOCOL/sapecc.example.com:PORT,virtualHost;sapecc.example.com
subaccounts,abcdefghi,backendPerformance,PROTOCOL/sapecc.example.com:PORT,virtualPort;44300
subaccounts,abcdefghi,locationID;Test Location
subaccounts,abcdefghi,regionHost;hana.ondemand.com
subaccounts,abcdefghi,sinceTime;2019-02-13T08:05:36.084 +0100
subaccounts,abcdefghi,subaccount;abcdefghi

<<<sapcc_performance_toptimeconsumers:sep(59)>>>
subaccounts,abcdefghi,locationID;Test Location
subaccounts,abcdefghi,regionHost;hana.ondemand.com
subaccounts,abcdefghi,requests,0,externalTime;373
subaccounts,abcdefghi,requests,0,id;932284302
subaccounts,abcdefghi,requests,0,internalBackend;sapecc.example.com:PORT
subaccounts,abcdefghi,requests,0,openRemoteTime;121
subaccounts,abcdefghi,requests,0,protocol;PROTOCOL
subaccounts,abcdefghi,requests,0,receivedBytes;264
subaccounts,abcdefghi,requests,0,resource;/sap-webservice-url/
subaccounts,abcdefghi,requests,0,sentBytes;4650
subaccounts,abcdefghi,requests,0,startTime;2019-02-13T11:31:59.113 +0100
subaccounts,abcdefghi,requests,0,totalTime;536
subaccounts,abcdefghi,requests,0,user;RFC_USER
subaccounts,abcdefghi,requests,0,virtualBackend;sapecc.example.com:PORT
subaccounts,abcdefghi,requests,1,externalTime;290
subaccounts,abcdefghi,requests,1,id;1882731830
subaccounts,abcdefghi,requests,1,internalBackend;sapecc.example.com:PORT
subaccounts,abcdefghi,requests,1,latencyTime;77
subaccounts,abcdefghi,requests,1,openRemoteTime;129
subaccounts,abcdefghi,requests,1,protocol;PROTOCOL
subaccounts,abcdefghi,requests,1,receivedBytes;264
subaccounts,abcdefghi,requests,1,resource;/sap-webservice-url/
subaccounts,abcdefghi,requests,1,sentBytes;4639
subaccounts,abcdefghi,requests,1,startTime;2019-02-13T11:31:59.114 +0100
subaccounts,abcdefghi,requests,1,totalTime;532
subaccounts,abcdefghi,requests,1,user;RFC_USER
subaccounts,abcdefghi,requests,1,virtualBackend;sapecc.example.com:PORT
[...]
subaccounts,abcdefghi,requests,49,externalTime;128
subaccounts,abcdefghi,requests,49,id;1774317106
subaccounts,abcdefghi,requests,49,internalBackend;sapecc.example.com:PORT
subaccounts,abcdefghi,requests,49,protocol;PROTOCOL
subaccounts,abcdefghi,requests,49,receivedBytes;263
subaccounts,abcdefghi,requests,49,resource;/sap-webservice-url/
subaccounts,abcdefghi,requests,49,sentBytes;4660
subaccounts,abcdefghi,requests,49,startTime;2019-02-16T11:32:09.352 +0100
subaccounts,abcdefghi,requests,49,totalTime;130
subaccounts,abcdefghi,requests,49,user;RFC_USER
subaccounts,abcdefghi,requests,49,virtualBackend;sapecc.example.com:PORT
subaccounts,abcdefghi,sinceTime;2019-02-13T08:05:36.085 +0100
subaccounts,abcdefghi,subaccount;abcdefghi

<<<sapcc_subaccounts:sep(59)>>>
subaccounts,abcdefghi,displayName;Test Application
subaccounts,abcdefghi,locationID;Test Location
subaccounts,abcdefghi,regionHost;hana.ondemand.com
subaccounts,abcdefghi,subaccount;abcdefghi
subaccounts,abcdefghi,tunnel,applicationConnections,abcdefg:hijklmnopqr,connectionCount;8
subaccounts,abcdefghi,tunnel,applicationConnections,abcdefg:hijklmnopqr,name;abcdefg:hijklmnopqr
subaccounts,abcdefghi,tunnel,applicationConnections,abcdefg:hijklmnopqr,type;JAVA
subaccounts,abcdefghi,tunnel,connectedSince;2019-02-14T10:11:00.630 +0100
subaccounts,abcdefghi,tunnel,connections;8
subaccounts,abcdefghi,tunnel,state;Connected
subaccounts,abcdefghi,tunnel,user;P123456

The agent plugin comes with a Check_MK check plugin of the same name, which is solely responsible for the construction of the command line arguments from the WATO configuration and passing it to the Check_MK agent plugin.

With the additional WATO plugin sapcc_agent.py it is possible to configure the username and password for the SAP Cloud Connector application user which is used to connect to the monitoring API. It is also possible to configure the TCP port and the connection timeout for the connection to the monitoring API through the WATO WebUI and thus override the default values. The default value for the TCP port is 8443, the default value for the connection timeout is 30 seconds. The configuration options for the Check_MK agent plugin agent_sapcc can be found in the WATO WebUI under:

-> Host & Service Parameters
   -> Datasource Programs
      -> SAP Cloud Connector systems
         -> Create rule in folder ...
            -> Rule Options
               Description: SAP Cloud Connector (SCC)

            -> SAP Cloud Connector systems
               SAP Cloud Connector user name: [username]
               SAP Cloud Connector password: [password]
               SAP Cloud Connector TCP port: [8443]

            -> Conditions
               Folder [The folder containing the SAP Cloud Connector systems]
               and/or
               Explicit hosts [x]
               Specify explicit host names [SAP Cloud Connector systems]

After saving the new rule, restarting Check_MK and doing an inventory on the additional host for the SAP Cloud Connector service in Check_MK, several new services starting with the name prefix SAP CC should appear.

The following image shows a status output example from the WATO WebUI with the service checks HTTP SAP SCC TLS and HTTP SAP SCC TLS Certificate from the Built-in Check_MK Monitoring described above. In addition to those, the example also shows the service checks based on the data from the SAP Cloud Connector Agent. The service checks SAP CC Application Connection, SAP CC Subaccount and SAP CC Tunnel are provided by the check plugin sapcc_subaccounts, the service check SAP CC Perf Backend is provided by the plugin sapcc_performance_backends:

Status output example for the complete monitoring of the SAP Cloud Connector

SAP Cloud Connector Subaccount

The check plugin sapcc_subaccounts implements the three sub-checks sapcc_subaccounts.app_conn, sapcc_subaccounts.info and sapcc_subaccounts.tunnel.

Info

The sub-check sapcc_subaccounts.info just gathers information on several configuration options for each subaccount on the SAP Cloud Connector and displays them in the status details of the check. These configuration options are the:

  • subaccount name on the SAP Cloud Platform to which the connection is made.

  • display name of the subaccount.

  • location ID of the subaccount.

  • the region host of the SAP Cloud Platform to which the SAP Cloud Connector establishes a connection.

The sub-check sapcc_subaccounts.info always returns an OK status. No performance data is currently reported by this check.

Tunnel

The sub-check sapcc_subaccounts.tunnel is responsible for the monitoring of each tunnel connection for each subaccount on the SAP Cloud Connector. Upon inventory this sub-check creates a service check for each tunnel connection found on the SAP Cloud Connector. During normal check execution, the status of the tunnel connection is determined for each inventorized item. If the tunnel connection is not in the Connected state, an alarm is raised accordingly. Additionally, the number of currently active connections over a tunnel as well as the elapsed time in seconds since the tunnel connection was established are determined for each inventorized item. If either the value of the currently active connections or the number of seconds since the connection was established are above or below the configured warning and critical threshold values, an alarm is raised accordingly. For both values, performance data is reported by the check.

With the additional WATO plugin sapcc_subaccounts.py it is possible to configure the warning and critical levels for the sub-check sapcc_subaccounts.tunnel through the WATO WebUI and thus override the following default values:

Metric Warning Low Threshold Critical Low Threshold Warning High Threshold Critical High Threshold
Number of connections 0 0 30 40
Connection duration 0 sec 0 sec 284012568 sec 315569520 sec

The configuration options for the tunnel connection levels can be found in the WATO WebUI under:

-> Host & Service Parameters
   -> Parameters for discovered services
      -> Applications, Processes & Services
         -> SAP Cloud Connector Subaccounts
            -> Create Rule in Folder ...
               -> Rule Options
                  Description: SAP Cloud Connector Subaccounts

               -> Parameters
                  [x] Number of tunnel connections
                      Warning if equal or below [0] connections
                      Critical if equal or below [0] connections
                      Warning if equal or above [30] connections
                      Critical if equal or above [40] connections
                  [x] Connection time of tunnel connections 
                      Warning if equal or below [0] seconds
                      Critical if equal or below [0] seconds
                      Warning if equal or above [284012568] seconds
                      Critical if equal or above [315569520] seconds

               -> Conditions
                  Folder [The folder containing the SAP Cloud Connector systems]
                  and/or
                  Explicit hosts [x]
                  Specify explicit host names [SAP Cloud Connector systems]
                  and/or
                  Application Or Tunnel Name [x]
                  Specify explicit values [Tunnel name]

The above image with a status output example from the WATO WebUI shows one sapcc_subaccounts.tunnel service check as the last of the displayed items. The service name is prefixed by the string SAP CC Tunnel and followed by the subaccount name, which in this example is anonymized. For each tunnel connection the connection state, the overall number of application connections currently active over the tunnel, the time when the tunnel connection was established and the number of seconds elapsed since establishing the connection are shown. The overall number of currently active application connections is also visualized in the perf-o-meter, with a logarithmic scale growing from the left to the right.

The following image shows an example of the two metric graphs for a single sapcc_subaccounts.tunnel service check:

Example metric graph for a single sapcc_subaccounts.tunnel service check

The upper graph shows the time elapsed since the tunnel connection was established. The lower graph shows the overall number of application connections currently active over the tunnel connection. Both graphs would show warning and critical thresholds values, which in this example are currently outside the displayed range of values for the y-axis.

Application Connection

The sub-check sapcc_subaccounts.app_conn is responsible for the monitoring of each applications connection through each tunnel connection for each subaccount on the SAP Cloud Connector. Upon inventory this sub-check creates a service check for each application connection found on the SAP Cloud Connector. During normal check execution, the number of currently active connections for each application is determined for each inventorized item. If the value of the currently active connections is above or below the configured warning and critical threshold values, an alarm is raised accordingly. For the number of currently active connections, performance data is reported by the check.

With the additional WATO plugin sapcc_subaccounts.py it is possible to configure the warning and critical levels for the sub-check sapcc_subaccounts.app_conn through the WATO WebUI and thus override the following default values:

Metric Warning Low Threshold Critical Low Threshold Warning High Threshold Critical High Threshold
Number of connections 0 0 30 40

The configuration options for the tunnel connection levels can be found in the WATO WebUI under:

-> Host & Service Parameters
   -> Parameters for discovered services
      -> Applications, Processes & Services
         -> SAP Cloud Connector Subaccounts
            -> Create Rule in Folder ...
               -> Rule Options
                  Description: SAP Cloud Connector Subaccounts

               -> Parameters
                  [x] Number of application connections
                      Warning if equal or below [0] connections
                      Critical if equal or below [0] connections
                      Warning if equal or above [30] connections
                      Critical if equal or above [40] connections

               -> Conditions
                  Folder [The folder containing the SAP Cloud Connector systems]
                  and/or
                  Explicit hosts [x]
                  Specify explicit host names [SAP Cloud Connector systems]
                  and/or
                  Application Or Tunnel Name [x]
                  Specify explicit values [Application name]

The above image with a status output example from the WATO WebUI shows one sapcc_subaccounts.app_conn service check as the 5th item from top of the displayed items. The service name is prefixed by the string SAP CC Application Connection and followed by the application name, which in this example is anonymized. For each application connection the number of currently active connections and the connection type are shown. The number of currently active application connections is also visualized in the perf-o-meter, with a logarithmic scale growing from the left to the right.

The following image shows an example of the metric graph for a single sapcc_subaccounts.app_conn service check:

Example metric graph for a single sapcc_subaccounts.app_conn service check

The graph shows the number of currently active application connections. The graph would show warning and critical thresholds values, which in this example are currently outside the displayed range of values for the y-axis.

SAP Cloud Connector Performance Backends

The check sapcc_performance_backends is responsible for the monitoring of the performance of each (on-premise) backend system connected to the SAP Cloud Connector. Upon inventory this check creates a service check for each backend connection found on the SAP Cloud Connector. During normal check execution, the number of requests to the backend system, categorized in one of the 22 runtime buckets is determined for each inventorized item. From these raw values, the request rate in requests per second is derived for each of the 22 runtime buckets. Also from the raw values, the following four additional metrics are derived:

  • calls_total: the total request rate over all of the 22 runtime buckets.

  • calls_pct_ok: the relative number of requests in percent with a runtime below a given runtime warning threshold.

  • calls_pct_warn: the relative number of requests in percent with a runtime above a given runtime warning threshold.

  • calls_pct_crit: the relative number of requests in percent with a runtime above a given runtime critical threshold.

If the relative number of requests is above the configured warning and critical threshold values, an alarm is raised accordingly. For each of the 22 runtime buckets, the total number of requests and the relative number of requests (calls_pct_ok, calls_pct_warn, calls_pct_crit), performance data is reported by the check.

With the additional WATO plugin sapcc_performance_backends.py it is possible to configure the warning and critical levels for the check sapcc_performance_backends through the WATO WebUI and thus override the following default values:

Metric Warning Threshold Critical Threshold
Request runtime 500 msec 1000 msec
Percentage of requests over request runtime thresholds 10% 5%

The configuration options for the backend performance levels can be found in the WATO WebUI under:

-> Host & Service Parameters
   -> Parameters for discovered services
      -> Applications, Processes & Services
         -> SAP Cloud Connector Backend Performance
            -> Create Rule in Folder ...
               -> Rule Options
                  Description: SAP Cloud Connector Backend Performance

               -> Parameters
                  [x] Runtime bucket definition and calls per bucket in percent
                      Warning if percentage of calls in warning bucket equal or above [10.00] %
                      Assign calls to warning bucket if runtime equal or above [500] milliseconds
                      Critical if percentage of calls in critical bucket equal or above [5.00] %
                      Assign calls to critical bucket if runtime equal or above [1000] milliseconds

               -> Conditions
                  Folder [The folder containing the SAP Cloud Connector systems]
                  and/or
                  Explicit hosts [x]
                  Specify explicit host names [SAP Cloud Connector systems]
                  and/or
                  Backend Name [x]
                  Specify explicit values [Backend name]

The above image with a status output example from the WATO WebUI shows one sapcc_performance_backends service check as the 6th item from top of the displayed items. The service name is prefixed by the string SAP CC Perf Backend and followed by a string concatenated from the protocol, FQDN and TCP port of the backend system, which in this example is anonymized. For each backend connection the total number of requests, the total request rate, the percentage of requests below the runtime warning threshold, the percentage of requests above the runtime warning threshold and the percentage of requests above the runtime critical threshold are shown. The relative number of requests in percent are also visualized in the perf-o-meter.

The following image shows an example of the metric graph for the total request rate from the sapcc_performance_backends service check:

Example metric graph for the total request rate from the sapcc_performance_backends service check

The following image shows an example of the metric graph for the relative number of requests from the sapcc_performance_backends service check:

Example metric graph for the relative number of requests from the sapcc_performance_backends service check

The graph shows the percentage of requests below the runtime warning threshold in the color green at the bottom, the percentage of requests above the runtime warning threshold in the color yellow stacked above and the percentage of requests above the runtime critical threshold in the color red stacked at the top.

The following image shows an example of the combined metric graphs for the request rates to a single backend system in each of the 22 runtime buckets from the sapcc_performance_backends service check:

Example combined metric graphs for the request rates to a single backend system in each of the 22 runtime buckets from the sapcc_performance_backends service check

To provide a better overview, the individual metrics are grouped together into three graphs. The first graph shows the request rate in the runtime buckets >=10ms, >=20ms, >=30ms, >=40ms, >=50ms, >=75ms and >=100ms. The second graph shows the request rate in the runtime buckets >=125ms, >=150ms, >=200ms, >=300ms, >=400ms, >=500ms, >=750ms and >=1000ms. The third and last graph shows the request rate in the runtime buckets >=1250ms, >=1500ms, >=2000ms, >=2500ms, >=3000ms, >=4000ms and >=5000ms.

The following image shows an example of the individual metric graphs for the request rates to a single backend system in each of the 22 runtime buckets from the sapcc_performance_backends service check:

Example individual metric graphs for the request rates to a single backend system in each of the 22 runtime buckets from the sapcc_performance_backends service check

Each of the metric graphs shows exactly the same data as the previously show combined graphs. The combined metric graphs are actually based on the individual metric graphs for the request rates to a single backend system.

Conclusion

The newly introduced checks for the SAP Cloud Connector enables you to monitor several application specific aspects of the SAP Cloud Connector with your Check_MK Server. The combination of built-in Check_MK monitoring facilities and a new agent plugin for the SAP Cloud Connector complement each other in this regard. While the new SAP Cloud Connector agent plugin for Check_MK utilizes most of the data provided by the monitoring endpoints on the SAP Cloud Connector, a more in-depth monitoring could be achieved if the data from the Most Recent Requests metric would also be exposed over the monitoring API of SAP Cloud Connector. It hope this will be the case in a future release of the SAP Cloud Connector.

I hope you find the provided new check useful and enjoyed reading this blog post. Please don't hesitate to drop me a note if you have any suggestions or run into any issues with the provided checks.

// Experiences with Dell PowerConnect Switches

This blog post is going to be about my recent experiences with Broadcom FASTPATH based Dell PowerConnect M-Series M8024-k and M6348 switches. Especially with their various limitations and – in my opinion – sometimes buggy behaviour.


Recently i was given the opportunity to build a new and central storage and virtualization environment from ground up. This involved a set of hardware systems which – unfortunately – were chosen and purchased previously, before i came on board with the project.

System environment

Specifically those hardware components were:

  • Multiple Dell PowerEdge M1000e blade chassis

  • Multiple Dell PowerEdge M-Series blade servers, all equipped with Intel X520 network interfaces for LAN connectivity through fabric A of the blade chassis. Servers with additional central storage requirements were also equipped with QLogic QME/QMD8262 or QLogic/Broadcom BCM57810S iSCSI HBAs for SAN connectivity through fabric B of the blade chassis.

  • Multiple Dell PowerConnect M8024-k switches in fabric A of the blade chassis forming the LAN network. Those were configured and interconnected as a stack of switches. Each stack of switches had two uplinks, one to each of two carrier grade Cisco border routers. Since the network edge was between those two border routers on the one side and the stack of M8024-k switches on the other side, the switch stack was also used as a layer 3 device and was thus running the default gateways of the local network segments provided to the blade servers.

  • Multiple Dell PowerConnect M6348 switches, which were connected through aggregated links to the stack of M8024-k switches described above. These switches were exclusively used to provide a LAN connection for external, standalone servers and devices through their external 1 GBit ethernet interfaces. The M6348 switches were located in the slots belonging to fabric C of the blade chassis.

  • Multiple Dell PowerConnect M8024-k switches in fabric B of the blade chassis forming the SAN network. In contrast to the M8024-k LAN switches, the M8024-k SAN switches were configured and interconnected as individual switches. Since there as no need for outside SAN connectivity, the M8024-k switches in fabric B ran a flat layer 2 network without any layer 3 configuration.

  • Initially all PowerConnect switches – M8024-k both LAN and SAN and M6348 – ran the firmware version 5.1.8.2.

  • Multiple Dell EqualLogic PS Series storage systems, providing central block storage capacity for the PowerEdge M-Series blade servers via iSCSI over the SAN mentioned above. Some blade chassis based PS Series models (PS-M4110) were internally connected to the SAN formed by the M8024-k switches in fabric B. Other standalone PS Series models were connected to the same SAN utilizing the external ports of the M8024-k switches.

  • Multiple Dell EqualLogic FS Series file server appliances, providing central NFS and CIFS storage capacity over the LAN mentioned above. In the back-end those FS Series file server appliances also used the block storage capacity provided by the PS Series storage systems via iSCSI over the SAN mentioned above. Both LAN and SAN connections of the EqualLogic FS Series were made through the external ports of the M8024-k switches.

There were multiple locations with roughly the same setup composed of the hardware components described above. Each location had two daisy-chained Dell PowerEdge M1000e blade chassis systems. The layer 2 LAN and SAN networks stretched over the two blade chassis. The setup at each location is shown in the following schematic:

Schematic of the Dell PowerConnect LAN and SAN setup

All in all not an ideal setup. Instead, i would have preferred a pair of capable – both functionality and performance-wise – central top-of-rack switches to which the individual M1000e blade chassis would have been connected. Preferrably a seperate pair for LAN an SAN connectivity. But again, the mentioned components were already preselected and pre-purchased.

During the implementation and later the operational phase several limitations and issues surfaced with regard to the Dell PowerConnect switches and the networks build with them. The following – probably not exhaustive – list of limitations and issues i've encountered is in no particular order with regard to their occurrence or severity.

Limitations

  • While the Dell PowerConnect switches support VRRP as a redundancy protocol for layer 3 instances, there is only support for VRRP version 2, described in RFC 3768. This limits the use of VRRP to IPv4 only. VRRP version 3 described in RFC 5798, which is needed for the implementation of redundant layer 3 instances for both IPv4 and IPv6, is not supported by Dell PowerConnect switches. Due to this limitation and the need for full IPv6 support in the whole environment, the design decision was made to run the Dell PowerConnect M8024-k switches for the LAN as a stack of switches.

  • Limited support of routing protocols. There is only support for the routing protocols OSPF and RIP v2 in Dell PowerConnect switches. In this specific setup and triggered by the design decision to run the LAN switches as layer 3 devices, BGP would have been a more suitable routing protocol. Unfortunately there were no plans to implement BGP on the Dell PowerConnect devices.

  • Limitation in the number of secondary interface addresses. Only one IPv4 secondary address is supported per interface on a layer 3 instance running on the Dell PowerConnect switches. Opposed to e.g. Cisco based layer 3 capable switches this was a limitation that caused, in this particular setup, the need for a lot more (VLAN) interfaces than would otherwise have been necessary.

  • No IPv6 secondary interface addresses. For IPv6 based layer 3 instances there is no support at all for secondary interface addresses. Although this might be a fundamental rather than product specific limitation.

  • For layer 3 instances in general there is no support for very small IPv4 subnets (e.g. /31 with 2 IPv4 addresses) which are usually used for transfer networks. In setups using private IPv4 address ranges this is no big issue. In this case though, official IPv4 addresses were used and in conjunction with the excessive need for VLAN interfaces this limitation caused a lot of wasted official IPv4 addresses.

  • The access control list (ACL) feature is very limited and rather rudimentary in Dell PowerConnect switches. There is no support for port ranges, no statefulness and each access list has a hard limit of 256 access list entries. All three – and possibly even more – limitations in combination make the ACL feature of Dell PowerConnect switches almost useless. Especially if there are seperate layer 3 networks on the system which are in need of fine-grained traffic control.

  • From the performance aspect of ACLs i have gotten the impression, that especially IPv6 ACLs are handled by the switches CPU. If IPv6 is used in conjunction with extensive ACLs, this would dramatically impact the network performance of IPv6-based traffic. Admittedly i have no hard proof to support this suspicion.

  • The out-of-band (OOB) management interface of the Dell PowerConnect switches does not provide a true out-of-band management. Instead it is integrated into the switch as just as another IP interface – although one with a special purpose. Due to this interaction of the OOB with the IP stack of the Dell PowerConnect switch there are side-effects when the switch is running at least one layer 3 instance. In this case, the standard IP routing table of the switch is not only used for routing decisions of the payload traffic, but instead it is also used to determine the destination of packets originating from the OOB interface. This behaviour can cause an asymmetric traffic flow when the systems connecting to the OOB are covered by an entry in the switches IP routing table. Far from ideal when it comes to true OOB management, not to mention the issuses arising when there are also stateful firewall rules involved.

    I addressed this limitation with a support case at Dell and got the following statement back:

    FASTPATH can learn a default gateway for the service port, the network port,
    or a routing interface. The IP stack can only have a single default gateway.
    (The stack may accept multiple default routes, but if we let that happen we may
    end up with load balancing across the network and service port or some other
    combination we don't want.) RTO may report an ECMP default route. We only give
    the IP stack a single next hop in this case, since it's not likely we need to
    additional capacity provided by load sharing for packets originating on the
    box.

    The precedence of default gateways is as follows:
    - via routing interface
    - via service port
    - via network port

    As per the above precedence, ip stack is having the default gateway which is
    configured through RTO. When the customer is trying to ping the OOB from
    different subnet , route table donesn't have the exact route so,it prefers the
    default route and it is having the RTO default gateway as next hop ip. Due to
    this, it egresses from the data port.

    If we don't have the default route which is configured through RTO then IP
    stack is having the OOB default gateway as next hop ip. So, it egresses from
    the OOB IP only.

    In my opinion this just confirms how the OOB management of the Dell PowerConnect switches is severely broken by design.

  • Another issue with the out-of-band (OOB) management interface of the Dell PowerConnect switches is that they support only a very limited access control list (ACL) in order to protect the access to the switch. The management ACL only supports one IPv4 ACL entry. IPv6 support within the management ACL protecting the OOB interface is missing altogether.

  • The Dell PowerConnect have no support for Shortest Path Bridging (SPB) as defined in the IEEE 802.1aq standard. On layer 2 the traditional spanning-tree protocols STP (IEEE 802.1D), RSTP (IEEE 802.1w) or MSTP (IEEE 802.1s) have to be used. This is particularly a drawback in the SAN network shown in the schematic above, due to the protocol determined inactivity of one inter-switch link. With the use of SPB, all inter-switch links could be equally utilizied and a traffic interruption upon link failure and spanning-tree (re)convergence could be avoided.

  • Another SAN-specific limitation is the incomplete implementation of Data Center Bridging (DCB) in the Dell PowerConnect switches. Although the protocols Priority-based Flow Control (PFC) according to IEEE 802.1Qbb and Congestion Notification (CN) according to IEEE 802.1Qau are supportet, the third needed protocol Enhanced Transmission Selection (ETS) according to IEEE 802.1Qaz is missing in Dell PowerConnect switches. The Dell EqualLogic PS Series storage systems used in the setup shown above explicitly need ETS if DCB should be used on layer 2. Since ETS is not implemented in Dell PowerConnect switches, the traditional layer 2 protocols had to be used in the SAN.

Issues

  • Not per se an issue, but the baseline CPU utilization on Dell PowerConnect M8024-k switches running layer 3 instances is significantly higher compared to those running only as layer 2 devices. The following CPU utilization graphs show a direct comparison of a layer 3 (upper graph) and a layer 2 (lower graph) device:

    CPU utilization on a Dell PowerConnect M8024-k switch as a Layer 3 device
    CPU utilization on a Dell PowerConnect M8024-k switch as a Layer 2 device

    The CPU utilization is between 10 and 15% higher once the tasks of processing layer 3 traffic are involved. What kind of switch function or what type of traffic is causing this additional CPU utilization is completely intransparent. Documentation on such in-depth subjects or details on how the processing within the Dell PowerConnect switches works is very scarce. It would be very interesting to know what kind of traffic is sent to the switches CPU for processing instead of being handled by the hardware.

  • The very high CPU utilization plateau on the right hand side of the upper graph (approximately between 10:50 - 11:05) was due to a bug in processing of IPv6 traffic on Dell PowerConnect switches. This issue caused IPv6 packets to be sent to the switchs CPU for processing instead of doing the forwarding decision in the hardware. I narrowed down the issue by transferring a large file between two hosts via the SCP protocol. In the first case and determined by preferred name resolution via DNS a IPv6 connection was used:

    user@host1:~$ scp testfile.dmp user@host2:/var/tmp/
    testfile.dmp                                   8%  301MB 746.0KB/s 1:16:05 ETA

    The CPU utilization on the switch stack during the transfer was monitored on the switches CLI:

    stack1(config)# show process cpu
    
    Memory Utilization Report
    
    status      bytes
    ------ ----------
      free  170642152
     alloc  298144904
    
    CPU Utilization:
    
      PID      Name                    5 Secs     60 Secs    300 Secs
    -----------------------------------------------------------------
     41be030 tNet0                     27.05%      30.44%      21.13%
     41cbae0 tXbdService                2.60%       0.40%       0.09%
     43d38d0 ipnetd                     0.40%       0.11%       0.11%
     43ee580 tIomEvtMon                 0.40%       0.09%       0.22%
     43f7d98 osapiTimer                 2.00%       3.56%       3.13%
     4608b68 bcmL2X.0                   0.00%       0.08%       1.16%
     462f3a8 bcmCNTR.0                  1.00%       0.87%       1.04%
     4682d40 bcmTX                      4.20%       5.12%       3.83%
     4d403a0 bcmRX                      9.21%      12.64%      10.35%
     4d60558 bcmNHOP                    0.80%       0.21%       0.11%
     4d72e10 bcmATP-TX                  0.80%       0.24%       0.32%
     4d7c310 bcmATP-RX                  0.20%       0.12%       0.14%
     53321e0 MAC Send Task              0.20%       0.19%       0.40%
     533b6e0 MAC Age Task               0.00%       0.05%       0.09%
     5d59520 bcmLINK.0                  5.41%       2.75%       2.15%
     84add18 tL7Timer0                  0.00%       0.22%       0.23%
     84ca140 osapiWdTask                0.00%       0.05%       0.05%
     84d3640 osapiMonTask               0.00%       0.00%       0.01%
     84d8b40 serialInput                0.00%       0.00%       0.01%
     95e8a70 servPortMonTask            0.40%       0.09%       0.12%
     975a370 portMonTask                0.00%       0.06%       0.09%
     9783040 simPts_task                0.80%       0.73%       1.40%
     9b70100 dtlTask                    5.81%       7.52%       5.62%
     9dc3da8 emWeb                      0.40%       0.12%       0.09%
     a1c9400 hapiRxTask                 4.00%       8.84%       6.46%
     a65ba38 hapiL3AsyncTask            1.60%       0.45%       0.37%
     abcd0c0 DHCP snoop                 0.00%       0.00%       0.20%
     ac689d0 Dynamic ARP Inspect        0.40%       0.10%       0.05%
     ac7a6c0 SNMPTask                   0.40%       0.19%       0.95%
     b8fa268 dot1s_timer_task           1.00%       0.78%       2.74%
     b9134c8 dot1s_task                 0.20%       0.07%       0.04%
     bdb63e8 dot1xTimerTask             0.00%       0.03%       0.02%
     c520db8 radius_task                0.00%       0.02%       0.05%
     c52a0b0 radius_rx_task             0.00%       0.03%       0.03%
     c58a2e0 tacacs_rx_task             0.20%       0.06%       0.15%
     c59ce70 unitMgrTask                0.40%       0.10%       0.20%
     c5c7410 umWorkerTask               1.80%       0.27%       0.13%
     c77ef60 snoopTask                  0.60%       0.25%       0.16%
     c8025a0 dot3ad_timer_task          1.00%       0.24%       0.61%
     ca2ab58 dot3ad_core_lac_tas        0.00%       0.02%       0.00%
     d1860b0 dhcpsPingTask              0.20%       0.13%       0.39%
     d18faa0 SNTP                       0.00%       0.02%       0.01%
     d4dc3b0 sFlowTask                  0.00%       0.00%       0.03%
     d6a4448 spmTask                    0.00%       0.13%       0.14%
     d6b79c8 fftpTask                   0.40%       0.06%       0.01%
     d6dcdf0 tCkptSvc                   0.00%       0.00%       0.01%
     d7babe8 ipMapForwardingTask        0.40%       0.18%       0.29%
     dba91b8 tArpCallback               0.00%       0.04%       0.04%
     defb340 ARP Timer                  2.60%       0.92%       1.29%
     e1332f0 tRtrDiscProcessingT        0.00%       0.00%       0.11%
    12cabe30 ip6MapLocalDataTask        0.00%       0.03%       0.01%
    12cb5290 ip6MapExceptionData       11.42%      12.95%       9.41%
    12e1a0d8 lldpTask                   0.60%       0.17%       0.30%
    12f8cd10 dnsTask                    0.00%       0.00%       0.01%
    140b4e18 dnsRxTask                  0.00%       0.03%       0.03%
    14176898 DHCPv4 Client Task         0.00%       0.01%       0.02%
    1418a3f8 isdpTask                   0.00%       0.00%       0.10%
    14416738 RMONTask                   0.00%       0.20%       0.42%
    144287f8 boxs Req                   0.20%       0.09%       0.21%
    15c90a18 sshd                       0.40%       0.07%       0.07%
    15cde0e0 sshd[0]                    0.20%       0.05%       0.02%
    -----------------------------------------------------------------
     Total CPU Utilization             89.77%      92.50%      77.29%

    In second case a IPv4 connection was deliberately choosen:

    user@host1:~$ scp testfile.dmp user@10.0.0.1:/var/tmp/
    testfile.dmp                                 100% 3627MB  31.8MB/s   01:54

    Not only was the transfer rate of the SCP copy process significantly higher – and the transfer time subsequently much lower – in the second case using a IPv4 connection. But the CPU utilization on the switch stack during the transfer using a IPv4 connection was also much lower:

    stack1(config)# show process cpu
    
    Memory Utilization Report
    
    status      bytes
    ------ ----------
      free  170642384
     alloc  298144672
    
    CPU Utilization:
    
      PID      Name                    5 Secs     60 Secs    300 Secs
    -----------------------------------------------------------------
     41be030 tNet0                      0.80%      23.49%      21.10%
     41cbae0 tXbdService                0.00%       0.17%       0.08%
     43d38d0 ipnetd                     0.20%       0.14%       0.12%
     43ee580 tIomEvtMon                 0.60%       0.26%       0.24%
     43f7d98 osapiTimer                 2.20%       3.10%       3.08%
     4608b68 bcmL2X.0                   4.20%       1.10%       1.22%
     462f3a8 bcmCNTR.0                  0.80%       0.80%       0.99%
     4682d40 bcmTX                      0.20%       3.35%       3.59%
     4d403a0 bcmRX                      4.80%       9.90%      10.06%
     4d60558 bcmNHOP                    0.00%       0.11%       0.10%
     4d72e10 bcmATP-TX                  1.00%       0.30%       0.32%
     4d7c310 bcmATP-RX                  0.00%       0.14%       0.15%
     53321e0 MAC Send Task              0.80%       0.39%       0.42%
     533b6e0 MAC Age Task               0.00%       0.12%       0.10%
     5d59520 bcmLINK.0                  1.80%       2.38%       2.14%
     84add18 tL7Timer0                  0.00%       0.11%       0.20%
     84ca140 osapiWdTask                0.00%       0.05%       0.05%
     84d3640 osapiMonTask               0.00%       0.00%       0.01%
     84d8b40 serialInput                0.00%       0.00%       0.01%
     95e8a70 servPortMonTask            0.20%       0.09%       0.11%
     975a370 portMonTask                0.00%       0.06%       0.09%
     9783040 simPts_task                3.20%       1.54%       1.49%
     9b70100 dtlTask                    0.20%       5.47%       5.45%
     9dc3da8 emWeb                      0.40%       0.13%       0.09%
     a1c9400 hapiRxTask                 0.20%       6.46%       6.30%
     a65ba38 hapiL3AsyncTask            0.40%       0.37%       0.35%
     abcd0c0 DHCP snoop                 0.00%       0.02%       0.18%
     ac689d0 Dynamic ARP Inspect        0.40%       0.15%       0.07%
     ac7a6c0 SNMPTask                   0.00%       1.32%       1.12%
     b8fa268 dot1s_timer_task           7.21%       2.99%       2.97%
     b9134c8 dot1s_task                 0.00%       0.03%       0.03%
     bdb63e8 dot1xTimerTask             0.00%       0.01%       0.02%
     c520db8 radius_task                0.00%       0.01%       0.04%
     c52a0b0 radius_rx_task             0.00%       0.03%       0.03%
     c58a2e0 tacacs_rx_task             0.20%       0.21%       0.17%
     c59ce70 unitMgrTask                0.60%       0.20%       0.21%
     c5c7410 umWorkerTask               0.20%       0.17%       0.12%
     c77ef60 snoopTask                  0.20%       0.18%       0.15%
     c8025a0 dot3ad_timer_task          2.20%       0.80%       0.68%
     d1860b0 dhcpsPingTask              1.80%       0.58%       0.45%
     d18faa0 SNTP                       0.00%       0.00%       0.01%
     d4dc3b0 sFlowTask                  0.20%       0.03%       0.03%
     d6a4448 spmTask                    0.20%       0.15%       0.14%
     d6b79c8 fftpTask                   0.00%       0.02%       0.01%
     d6dcdf0 tCkptSvc                   0.00%       0.00%       0.01%
     d7babe8 ipMapForwardingTask        0.20%       0.19%       0.28%
     dba91b8 tArpCallback               0.00%       0.06%       0.05%
     defb340 ARP Timer                  4.60%       1.54%       1.36%
     e1332f0 tRtrDiscProcessingT        0.40%       0.14%       0.12%
    12cabe30 ip6MapLocalDataTask        0.00%       0.01%       0.01%
    12cb5290 ip6MapExceptionData        0.00%       8.60%       8.91%
    12cbe790 ip6MapNbrDiscTask          0.00%       0.02%       0.00%
    12e1a0d8 lldpTask                   0.80%       0.24%       0.29%
    12f8cd10 dnsTask                    0.00%       0.00%       0.01%
    140b4e18 dnsRxTask                  0.40%       0.07%       0.04%
    14176898 DHCPv4 Client Task         0.00%       0.00%       0.02%
    1418a3f8 isdpTask                   0.00%       0.00%       0.09%
    14416738 RMONTask                   1.00%       0.44%       0.44%
    144287f8 boxs Req                   0.40%       0.16%       0.21%
    15c90a18 sshd                       0.20%       0.06%       0.06%
    15cde0e0 sshd[0]                    0.00%       0.03%       0.02%
    -----------------------------------------------------------------
     Total CPU Utilization             43.28%      78.79%      76.50%

    Comparing the two above output samples by per process CPU utilization showed that the major share of the higher CPU utilization in the case of a IPv6 connection is allotted to the processes tNet0, bcmTX, bcmRX, bcmLINK.0, dtlTask, hapiRxTask and ip6MapExceptionData. In a process by process comparison, those seven processes used 60.3% more CPU time in case of a IPv6 connection compared to the case using a IPv4 connection. Unfortunately the documentation on what the individual processes are exactly doing is very sparse or not available at all. In order to further analyze this issue a support case with the collected information was opened with Dell. A fix for the described issue was made availible with firmware version 5.1.9.3

  • The LAN stack of several Dell PowerConnect M8024-k switches showed sometimes erratic behaviour. There were several occasions, where the switch stack would suddenly show a hugely increased latency in packet processing or where it would just stop passing certain types of traffic altogether. Usually a reload of the stack would restore its operation and the increased latency or the packet drops would disappear with the reload as suddenly as they had appeared. The root cause of this was unfortunately never really found. Maybe it was the combination of functions (layer 3, dual stack IPv4 and IPv6, extensive ACLs, etc.) that were running simultaneously on the stack in this setup.

  • During both planned and unplanned failovers of the master switch in the stack, there is a time period of up to 120 seconds where no packets are processed by the switch stack. This occurs even with continuous forwarding enabled. I've had a strong suspicion that this issue was related to the layer 3 instances running on the switch stack. A comparison between a pure layer 2 stack and a layer 3 enabled stack in a controlled test environment confirmed this. As soon as at least one layer 3 instance was added, the described delay occured on switch failovers. The fact that migrating layer 3 instances from the former master switch to the new one takes some time makes sense to me. What's unclear to me is why this seems to also affect the layer 2 traffic going over the stack.

  • There were several occasions where the hardware- and software MAC table of the Dell PowerConnect switches got out of sync. While the root cause (hardware defect, bit flip, power surge, cosmic radiation, etc.) of this issue is unknown, the effect was a sudden reboot of affected switch. Luckily we had console servers in place, which were storing a console output history from the time the issue occured. After raising a support case with Dell with the information from the console output, we got a firmware update (v5.1.9.4) in which the issue would not trigger a sudden reboot anymore, but instead log an appropriate message to the switches log. With this fix the out of sync MAC tables will still require a reboot of the affected switch, but this can now be done in a controlled fashion. Still, a solution requiring no reboot at all would have been much more preferrable.

  • While querying the Dell PowerConnect switches with the SNMP protocol for monitoring purposes, obscure and confusing messages containing the string MGMT_ACAL would reproducibly be logged into the switches log. See the article Check_MK Monitoring - Dell PowerConnect Switches - Global Status in this blog for the gory details.

  • With a stack of Dell PowerConnect M8024-k switches the information provided via the SNMP protocol would occasionally get out of sync with the information available from the CLI. E.g. the temperature values from the stack stack1 of LAN switches compared to the standalone SAN switches standalone{1,2,3,4,5,6}:

    user@host:# for HST in stack1 standalone1 standalone2 standalone3 stack2 standalone4 standalone5 standalone6; do 
      echo "$HST: ";
      for OID in 4 5; do
        echo -n "  ";
        snmpbulkwalk -v2c -c [...] -m '' -M '' -Cc -OQ -OU -On -Ot $HST .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.${OID};
      done;
    done
    
    stack1: 
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.4 = No Such Object available on this agent at this OID
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.5 = No Such Object available on this agent at this OID
    standalone1: 
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.4.1.0 = 0
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.5.1.0 = 40
    standalone2: 
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.4.1.0 = 0
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.5.1.0 = 37
    standalone3: 
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.4.1.0 = 0
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.5.1.0 = 32
    stack2: 
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.4.1.0 = 1
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.4.2.0 = 0
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.5.1.0 = 42
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.5.2.0 = 41
    standalone4: 
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.4.1.0 = 1
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.5.1.0 = 39
    standalone5: 
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.4.1.0 = 1
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.5.1.0 = 39
    standalone6: 
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.4.1.0 = 1
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.5.1.0 = 35

    At the same time the CLI management interface of the switch stack showed the correct temperature values:

    stack1# show system       
    
    System Description: Dell Ethernet Switch
    System Up Time: 89 days, 01h:50m:11s
    System Name: stack1
    Burned In MAC Address: F8B1.566E.4AFB
    System Object ID: 1.3.6.1.4.1.674.10895.3041
    System Model ID: PCM8024-k
    Machine Type: PowerConnect M8024-k
    Temperature Sensors:
    
    Unit     Description       Temperature    Status
                                (Celsius)
    ----     -----------       -----------    ------
    1        System            39             Good
    2        System            39             Good
    [...]

    Only after a reboot of the switch stack, the information provided via the SNMP protocol:

    user@host:# for HST in stack1 standalone1 standalone2 standalone3 stack2 standalone4 standalone5 standalone6; do 
      echo "$HST: ";
      for OID in 4 5; do
        echo -n "  ";
        snmpbulkwalk -v2c -c [...] -m '' -M '' -Cc -OQ -OU -On -Ot $HST .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.${OID};
      done;
    done
    
    stack1: 
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.4.1.0 = 1
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.4.2.0 = 1
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.5.1.0 = 37
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.5.2.0 = 37
    standalone1: 
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.4.1.0 = 0
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.5.1.0 = 39
    standalone2: 
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.4.1.0 = 1
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.5.1.0 = 37
    standalone3: 
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.4.1.0 = 1
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.5.1.0 = 32
    stack2: 
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.4.1.0 = 0
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.4.2.0 = 1
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.5.1.0 = 41
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.5.2.0 = 41
    standalone4: 
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.4.1.0 = 1
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.5.1.0 = 38
    standalone5: 
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.4.1.0 = 1
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.5.1.0 = 38
    standalone6: 
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.4.1.0 = 1
      .1.3.6.1.4.1.674.10895.5000.2.6132.1.1.43.1.8.1.5.1.0 = 34

    would again be in sync with the information available from the CLI:

    stack1# show system   
    
    System Description: Dell Ethernet Switch
    System Up Time: 0 days, 00h:05m:32s
    System Name: stack1
    Burned In MAC Address: F8B1.566E.4AFB
    System Object ID: 1.3.6.1.4.1.674.10895.3041
    System Model ID: PCM8024-k
    Machine Type: PowerConnect M8024-k
    Temperature Sensors:
    
    Unit     Description       Temperature    Status
                                (Celsius)
    ----     -----------       -----------    ------
    1        System            37             Good
    2        System            37             Good
    [...]

Conclusion

Although the setup build with the Dell PowerConnect switches and the other hardware components was working and providing its basic, intended functionality, there were some pretty big and annoying limitations associated with it. A lot of these limitations would have not been that significant to the entire setup if certain design descisions would have been made more carefully. For example if the layer 3 part of the LAN would have been implemented in external network components or if a proper fully meshed, fabric-based SAN would have been favored over what can only be described as a legacy technology. From the reliability, availability and serviceability (RAS) points of view, the setup is also far from ideal. By daisy-chaining the Dell PowerEdge M1000e blade chassis, stacking the LAN switches, stretching the LAN and SAN over both chassis and by connecting external devices through the external ports of the Dell PowerConnect switches, there are a lot of parts in the setup that are depending on each other. This makes normal operations difficult at best and can have disastrous effects in case of a failure.

In retrospect, either using pure pass-through network modules in the Dell PowerEdge M1000e blade chassis in conjunction with capcable 10GE top-of-rack switches or using the much more capable Dell Force10 MXL switches in the Dell PowerEdge M1000e blade chassis seem to be better solutions. The uptick for Dell Force10 MXL switches of about €2000 list price per device compared to the Dell PowerConnect switches seems negligible compared to the costs that arose through debugging, bugfixing and finding workarounds for the various limitations of the Dell PowerConnect switches. In either case a pair of capable, central layer 3 devices for gateway redundancy, routing and possibly fine-grained traffic control would be advisable.

For simpler setups, without some of the more special requirements of this particular setup, the Dell PowerConnect switches still offer a nice price-performance ratio. Especially with regard to their 10GE port density.

// AIX and VIOS Performance with 10 Gigabit Ethernet (Update)

Error: Class "lessc" not found

Error: Class "lessc" not found

An unforeseen error has occured. This is most likely a bug somewhere. It might be a problem in the fontcolor plugin.

More info has been written to the DokuWiki error log.