bityard Blog

// Check_MK Monitoring - SAP Cloud Connector

The SAP Cloud Connector provides a service which is used to connect on-premise systems – non-SAP, SAP ECC and SAP HANA – with applications running on the SAP Cloud Platform. This article introduces a new Check_MK agent and several service checks to monitor the status of the SAP Cloud Connector and its connections to the SAP Cloud Platform.

For the impatient and TL;DR here is the Check_MK package of the SAP Cloud Connector monitoring checks:

SAP Cloud Connector monitoring checks (Compatible with Check_MK versions 1.4.0p19 and later)

The sources are to be found in my Check_MK repository on GitHub


Monitoring the SAP Cloud Connector can be done in two different, not mutually exclusive, ways. The first approach uses the traditional application monitoring features, already built into Check_MK, like:

  • the presence and count of the application processes

  • the reachability of the applications TCP ports

  • the validity of the SSL Certificates

  • queries to the applications health-check URL

The first approach is covered by the section Built-in Check_MK Monitoring below.

The second approach uses a new Check_MK agent and several service checks, dedicated to monitor the internal status of the SAP Cloud Connector and its connections to the SAP Cloud Platform. The new Check_MK agent uses the monitoring API provided by the SAP Cloud Connector in order the monitor the application specific states and metrics. The monitoring endpoints on the SAP Cloud Connector currently used by this Check_MK agent are the:

  • “List of Subaccounts” (URL: https://<scchost>:<sccport>/api/monitoring/subaccounts)

  • “List of Open Connections” (URL: https://<scchost>:<sccport>/api/monitoring/connections/backends)

  • “Performance Monitor Data” (URL: https://<scchost>:<sccport>/api/monitoring/performance/backends)

  • “Top Time Consumers” (URL: https://<scchost>:<sccport>/api/monitoring/performance/toptimeconsumers)

At the time of writing, there unfortunately is no monitoring endpoint on the SAP Cloud Connector for the Most Recent Requests metric. This metric is currently only available via the SAP Cloud Connectors WebUI. The Most Recent Requests metric would be a much more interesting and useful metric than the currently available Top Time Consumers or Performance Monitor Data, both of which have limitations. The application requests covered by the Top Time Consumers metric need a manual acknowledgement inside the SAP Cloud Connector in order to reset events with the longest request runtime, which limits the metrics usability for external monitoring tools. The Performance Monitor Data metric aggregates the application requests into buckets based on their overall runtime. By itself this can be useful for external monitoring tools and is in fact used by the Check_MK agent covered in this article. In the process of runtime bucket aggregation though, the Performance Monitor Data metric hides the much more useful breakdown of each request into runtime subsections (“External (Back-end)”, “Open Connection”, “Internal (SCC)”, “SSO Handling” and “Latency Effects”). Hopefully the Most Recent Requests metric will in the future also be exposed via the monitoring API provided by the SAP Cloud Connector. The new Check_MK agent can then be extended to use the newly exposed metric in order to gain a more fine grained insight into the runtime of application requests through the SAP Cloud Connector.

The second approach is covered by the section SAP Cloud Connector Agent below.

Built-in Check_MK Monitoring

Application Processes

To monitor the SAP Cloud Connector process, use the standard Check_MK check “State and count of processes”. This can be found in the WATO WebUI under:

-> Manual Checks
   -> Applications, Processes & Services
      -> State and count of processes
         -> Create rule in folder ...
            -> Rule Options
               Description: Process monitoring of the SAP Cloud Connector
               Checktype: [ps - State and Count of Processes]
               Process Name: SAP Cloud Connector

            -> Parameters
               [x] Process Matching
               [Exact name of the process without arguments]
               [/opt/sapjvm_8/bin/java]

               [x] Name of the operating system user
               [Exact name of the operating system user]
               [sccadmin]

               [x] Levels for process count
               Critical below [1] processes
               Warning below [1] processes
               Warning above [1] processes
               Critical above [2] processes

            -> Conditions
               Folder [The folder containing the SAP Cloud Connector systems]
               and/or
               Explicit hosts [x]
               Specify explicit host names [SAP Cloud Connector systems]

Application Health Check

To implement a rudimentary monitoring of the SAP Cloud Connector application health, use the standard Check_MK check “Check HTTP service” to query the Health Check endpoint of the monitoring API provided by the SAP Cloud Connector. The “Check HTTP service” can be found in the WATO WebUI under:

-> Host & Service Parameters
   -> Active checks (HTTP, TCP, etc.)
      -> Check HTTP service
         -> Create rule in folder ...
            -> Rule Options
               Description: SAP Cloud Connector (SCC)

            -> Check HTTP service
               Name: SAP SCC
               [x] Check the URL
               [x] URI to fetch (default is /)
               [/exposed?action=ping]

               [x] TCP Port
               [8443]

               [x] Use SSL/HTTPS for the connection:
               [Use SSL with auto negotiation]

            -> Conditions
               Folder [The folder containing the SAP Cloud Connector systems]
               and/or
               Explicit hosts [x]
               Specify explicit host names [SAP Cloud Connector systems]

SSL Certificates

To monitor the validity of the SSL certificate of the SAP Cloud Connector WebUI, use the standard Check_MK check “Check HTTP service”. The “Check HTTP service” can be found in the WATO WebUI under:

-> Host & Service Parameters
   -> Active checks (HTTP, TCP, etc.)
      -> Check HTTP service
         -> Create rule in folder ...
            -> Rule Options
               Description: SAP Cloud Connector (SCC)

            -> Check HTTP service
               Name: SAP SCC Certificate
               [x] Check SSL Certificate Age
               Warning at or below [30] days
               Critical at or below [60] days

               [x] TCP Port
               [8443]

            -> Conditions
               Folder [The folder containing the SAP Cloud Connector systems]
               and/or
               Explicit hosts [x]
               Specify explicit host names [SAP Cloud Connector systems]

SAP Cloud Connector Agent

The new Check_MK package to monitor the status of the SAP Cloud Connector and its connections to the SAP Cloud Platform consists of three major parts – an agent plugin, two check plugins and several auxiliary files and plugins (WATO plugins, Perf-o-meter plugins, metrics plugins and man pages).

Prerequisites

The following prerequisites are necessary in order for the SAP Cloud Connector agent to work properly:

  • A SAP Cloud Connector application user must be created for the Check_MK agent to be able to authenticate against the SAP Cloud Connector and gain access to the protected monitoring API endpoints. See the article SAP Cloud Connector - Configuring Multiple Local Administrative Users on how to create a new application user.

  • A DNS alias or an additional IP address for the SAP Cloud Connector service.

  • An additional host in Check_MK for the SAP Cloud Connector service with the previously created DNS alias or IP address.

  • Installation of the Python requests library on the Check_MK server. This library is used in the Check_MK agent plugin agent_sapcc to perform the authentication and the HTTP requests against the monitoring API of the SAP Cloud Connector. On e.g. RHEL based systems it can be installed with:

    root@host:# yum install python-requests
  • Installation of the new Check_MK package for the SAP Cloud Connector monitoring checks on the Check_MK server.

SAP Cloud Connector Agent Plugin

The Check_MK agent plugin agent_sapcc is responsible for querying the endpoints of the monitoring API on the SAP Cloud Connector, which are described above. It transforms the data returned from the monitoring endpoints into a format digestible by Check_MK. The following example shows the – anonymized and abbreviated – agent plugin output for a SAP Cloud Connector system:

<<<check_mk>>>
Version: 0.1

<<<sapcc_connections_backends:sep(59)>>>
subaccounts,abcdefghi,locationID;Test Location
subaccounts,abcdefghi,regionHost;hana.ondemand.com
subaccounts,abcdefghi,subaccount;abcdefghi

<<<sapcc_performance_backends:sep(59)>>>
subaccounts,abcdefghi,backendPerformance,PROTOCOL/sapecc.example.com:PORT,buckets,1,minimumCallDurationMs;10
subaccounts,abcdefghi,backendPerformance,PROTOCOL/sapecc.example.com:PORT,buckets,1,numberOfCalls;1
subaccounts,abcdefghi,backendPerformance,PROTOCOL/sapecc.example.com:PORT,buckets,2,minimumCallDurationMs;20
subaccounts,abcdefghi,backendPerformance,PROTOCOL/sapecc.example.com:PORT,buckets,2,numberOfCalls;36
[...]
subaccounts,abcdefghi,backendPerformance,PROTOCOL/sapecc.example.com:PORT,buckets,20,minimumCallDurationMs;3000
subaccounts,abcdefghi,backendPerformance,PROTOCOL/sapecc.example.com:PORT,buckets,21,minimumCallDurationMs;4000
subaccounts,abcdefghi,backendPerformance,PROTOCOL/sapecc.example.com:PORT,buckets,22,minimumCallDurationMs;5000
subaccounts,abcdefghi,backendPerformance,PROTOCOL/sapecc.example.com:PORT,name;PROTOCOL/sapecc.example.com:44300
subaccounts,abcdefghi,backendPerformance,PROTOCOL/sapecc.example.com:PORT,protocol;PROTOCOL
subaccounts,abcdefghi,backendPerformance,PROTOCOL/sapecc.example.com:PORT,virtualHost;sapecc.example.com
subaccounts,abcdefghi,backendPerformance,PROTOCOL/sapecc.example.com:PORT,virtualPort;44300
subaccounts,abcdefghi,locationID;Test Location
subaccounts,abcdefghi,regionHost;hana.ondemand.com
subaccounts,abcdefghi,sinceTime;2019-02-13T08:05:36.084 +0100
subaccounts,abcdefghi,subaccount;abcdefghi

<<<sapcc_performance_toptimeconsumers:sep(59)>>>
subaccounts,abcdefghi,locationID;Test Location
subaccounts,abcdefghi,regionHost;hana.ondemand.com
subaccounts,abcdefghi,requests,0,externalTime;373
subaccounts,abcdefghi,requests,0,id;932284302
subaccounts,abcdefghi,requests,0,internalBackend;sapecc.example.com:PORT
subaccounts,abcdefghi,requests,0,openRemoteTime;121
subaccounts,abcdefghi,requests,0,protocol;PROTOCOL
subaccounts,abcdefghi,requests,0,receivedBytes;264
subaccounts,abcdefghi,requests,0,resource;/sap-webservice-url/
subaccounts,abcdefghi,requests,0,sentBytes;4650
subaccounts,abcdefghi,requests,0,startTime;2019-02-13T11:31:59.113 +0100
subaccounts,abcdefghi,requests,0,totalTime;536
subaccounts,abcdefghi,requests,0,user;RFC_USER
subaccounts,abcdefghi,requests,0,virtualBackend;sapecc.example.com:PORT
subaccounts,abcdefghi,requests,1,externalTime;290
subaccounts,abcdefghi,requests,1,id;1882731830
subaccounts,abcdefghi,requests,1,internalBackend;sapecc.example.com:PORT
subaccounts,abcdefghi,requests,1,latencyTime;77
subaccounts,abcdefghi,requests,1,openRemoteTime;129
subaccounts,abcdefghi,requests,1,protocol;PROTOCOL
subaccounts,abcdefghi,requests,1,receivedBytes;264
subaccounts,abcdefghi,requests,1,resource;/sap-webservice-url/
subaccounts,abcdefghi,requests,1,sentBytes;4639
subaccounts,abcdefghi,requests,1,startTime;2019-02-13T11:31:59.114 +0100
subaccounts,abcdefghi,requests,1,totalTime;532
subaccounts,abcdefghi,requests,1,user;RFC_USER
subaccounts,abcdefghi,requests,1,virtualBackend;sapecc.example.com:PORT
[...]
subaccounts,abcdefghi,requests,49,externalTime;128
subaccounts,abcdefghi,requests,49,id;1774317106
subaccounts,abcdefghi,requests,49,internalBackend;sapecc.example.com:PORT
subaccounts,abcdefghi,requests,49,protocol;PROTOCOL
subaccounts,abcdefghi,requests,49,receivedBytes;263
subaccounts,abcdefghi,requests,49,resource;/sap-webservice-url/
subaccounts,abcdefghi,requests,49,sentBytes;4660
subaccounts,abcdefghi,requests,49,startTime;2019-02-16T11:32:09.352 +0100
subaccounts,abcdefghi,requests,49,totalTime;130
subaccounts,abcdefghi,requests,49,user;RFC_USER
subaccounts,abcdefghi,requests,49,virtualBackend;sapecc.example.com:PORT
subaccounts,abcdefghi,sinceTime;2019-02-13T08:05:36.085 +0100
subaccounts,abcdefghi,subaccount;abcdefghi

<<<sapcc_subaccounts:sep(59)>>>
subaccounts,abcdefghi,displayName;Test Application
subaccounts,abcdefghi,locationID;Test Location
subaccounts,abcdefghi,regionHost;hana.ondemand.com
subaccounts,abcdefghi,subaccount;abcdefghi
subaccounts,abcdefghi,tunnel,applicationConnections,abcdefg:hijklmnopqr,connectionCount;8
subaccounts,abcdefghi,tunnel,applicationConnections,abcdefg:hijklmnopqr,name;abcdefg:hijklmnopqr
subaccounts,abcdefghi,tunnel,applicationConnections,abcdefg:hijklmnopqr,type;JAVA
subaccounts,abcdefghi,tunnel,connectedSince;2019-02-14T10:11:00.630 +0100
subaccounts,abcdefghi,tunnel,connections;8
subaccounts,abcdefghi,tunnel,state;Connected
subaccounts,abcdefghi,tunnel,user;P123456

The agent plugin comes with a Check_MK check plugin of the same name, which is solely responsible for the construction of the command line arguments from the WATO configuration and passing it to the Check_MK agent plugin.

With the additional WATO plugin sapcc_agent.py it is possible to configure the username and password for the SAP Cloud Connector application user which is used to connect to the monitoring API. It is also possible to configure the TCP port and the connection timeout for the connection to the monitoring API through the WATO WebUI and thus override the default values. The default value for the TCP port is 8443, the default value for the connection timeout is 30 seconds. The configuration options for the Check_MK agent plugin agent_sapcc can be found in the WATO WebUI under:

-> Host & Service Parameters
   -> Datasource Programs
      -> SAP Cloud Connector systems
         -> Create rule in folder ...
            -> Rule Options
               Description: SAP Cloud Connector (SCC)

            -> SAP Cloud Connector systems
               SAP Cloud Connector user name: [username]
               SAP Cloud Connector password: [password]
               SAP Cloud Connector TCP port: [8443]

            -> Conditions
               Folder [The folder containing the SAP Cloud Connector systems]
               and/or
               Explicit hosts [x]
               Specify explicit host names [SAP Cloud Connector systems]

After saving the new rule, restarting Check_MK and doing an inventory on the additional host for the SAP Cloud Connector service in Check_MK, several new services starting with the name prefix SAP CC should appear.

The following image shows a status output example from the WATO WebUI with the service checks HTTP SAP SCC TLS and HTTP SAP SCC TLS Certificate from the Built-in Check_MK Monitoring described above. In addition to those, the example also shows the service checks based on the data from the SAP Cloud Connector Agent. The service checks SAP CC Application Connection, SAP CC Subaccount and SAP CC Tunnel are provided by the check plugin sapcc_subaccounts, the service check SAP CC Perf Backend is provided by the plugin sapcc_performance_backends:

Status output example for the complete monitoring of the SAP Cloud Connector

SAP Cloud Connector Subaccount

The check plugin sapcc_subaccounts implements the three sub-checks sapcc_subaccounts.app_conn, sapcc_subaccounts.info and sapcc_subaccounts.tunnel.

Info

The sub-check sapcc_subaccounts.info just gathers information on several configuration options for each subaccount on the SAP Cloud Connector and displays them in the status details of the check. These configuration options are the:

  • subaccount name on the SAP Cloud Platform to which the connection is made.

  • display name of the subaccount.

  • location ID of the subaccount.

  • the region host of the SAP Cloud Platform to which the SAP Cloud Connector establishes a connection.

The sub-check sapcc_subaccounts.info always returns an OK status. No performance data is currently reported by this check.

Tunnel

The sub-check sapcc_subaccounts.tunnel is responsible for the monitoring of each tunnel connection for each subaccount on the SAP Cloud Connector. Upon inventory this sub-check creates a service check for each tunnel connection found on the SAP Cloud Connector. During normal check execution, the status of the tunnel connection is determined for each inventorized item. If the tunnel connection is not in the Connected state, an alarm is raised accordingly. Additionally, the number of currently active connections over a tunnel as well as the elapsed time in seconds since the tunnel connection was established are determined for each inventorized item. If either the value of the currently active connections or the number of seconds since the connection was established are above or below the configured warning and critical threshold values, an alarm is raised accordingly. For both values, performance data is reported by the check.

With the additional WATO plugin sapcc_subaccounts.py it is possible to configure the warning and critical levels for the sub-check sapcc_subaccounts.tunnel through the WATO WebUI and thus override the following default values:

Metric Warning Low Threshold Critical Low Threshold Warning High Threshold Critical High Threshold
Number of connections 0 0 30 40
Connection duration 0 sec 0 sec 284012568 sec 315569520 sec

The configuration options for the tunnel connection levels can be found in the WATO WebUI under:

-> Host & Service Parameters
   -> Parameters for discovered services
      -> Applications, Processes & Services
         -> SAP Cloud Connector Subaccounts
            -> Create Rule in Folder ...
               -> Rule Options
                  Description: SAP Cloud Connector Subaccounts

               -> Parameters
                  [x] Number of tunnel connections
                      Warning if equal or below [0] connections
                      Critical if equal or below [0] connections
                      Warning if equal or above [30] connections
                      Critical if equal or above [40] connections
                  [x] Connection time of tunnel connections 
                      Warning if equal or below [0] seconds
                      Critical if equal or below [0] seconds
                      Warning if equal or above [284012568] seconds
                      Critical if equal or above [315569520] seconds

               -> Conditions
                  Folder [The folder containing the SAP Cloud Connector systems]
                  and/or
                  Explicit hosts [x]
                  Specify explicit host names [SAP Cloud Connector systems]
                  and/or
                  Application Or Tunnel Name [x]
                  Specify explicit values [Tunnel name]

The above image with a status output example from the WATO WebUI shows one sapcc_subaccounts.tunnel service check as the last of the displayed items. The service name is prefixed by the string SAP CC Tunnel and followed by the subaccount name, which in this example is anonymized. For each tunnel connection the connection state, the overall number of application connections currently active over the tunnel, the time when the tunnel connection was established and the number of seconds elapsed since establishing the connection are shown. The overall number of currently active application connections is also visualized in the perf-o-meter, with a logarithmic scale growing from the left to the right.

The following image shows an example of the two metric graphs for a single sapcc_subaccounts.tunnel service check:

Example metric graph for a single sapcc_subaccounts.tunnel service check

The upper graph shows the time elapsed since the tunnel connection was established. The lower graph shows the overall number of application connections currently active over the tunnel connection. Both graphs would show warning and critical thresholds values, which in this example are currently outside the displayed range of values for the y-axis.

Application Connection

The sub-check sapcc_subaccounts.app_conn is responsible for the monitoring of each applications connection through each tunnel connection for each subaccount on the SAP Cloud Connector. Upon inventory this sub-check creates a service check for each application connection found on the SAP Cloud Connector. During normal check execution, the number of currently active connections for each application is determined for each inventorized item. If the value of the currently active connections is above or below the configured warning and critical threshold values, an alarm is raised accordingly. For the number of currently active connections, performance data is reported by the check.

With the additional WATO plugin sapcc_subaccounts.py it is possible to configure the warning and critical levels for the sub-check sapcc_subaccounts.app_conn through the WATO WebUI and thus override the following default values:

Metric Warning Low Threshold Critical Low Threshold Warning High Threshold Critical High Threshold
Number of connections 0 0 30 40

The configuration options for the tunnel connection levels can be found in the WATO WebUI under:

-> Host & Service Parameters
   -> Parameters for discovered services
      -> Applications, Processes & Services
         -> SAP Cloud Connector Subaccounts
            -> Create Rule in Folder ...
               -> Rule Options
                  Description: SAP Cloud Connector Subaccounts

               -> Parameters
                  [x] Number of application connections
                      Warning if equal or below [0] connections
                      Critical if equal or below [0] connections
                      Warning if equal or above [30] connections
                      Critical if equal or above [40] connections

               -> Conditions
                  Folder [The folder containing the SAP Cloud Connector systems]
                  and/or
                  Explicit hosts [x]
                  Specify explicit host names [SAP Cloud Connector systems]
                  and/or
                  Application Or Tunnel Name [x]
                  Specify explicit values [Application name]

The above image with a status output example from the WATO WebUI shows one sapcc_subaccounts.app_conn service check as the 5th item from top of the displayed items. The service name is prefixed by the string SAP CC Application Connection and followed by the application name, which in this example is anonymized. For each application connection the number of currently active connections and the connection type are shown. The number of currently active application connections is also visualized in the perf-o-meter, with a logarithmic scale growing from the left to the right.

The following image shows an example of the metric graph for a single sapcc_subaccounts.app_conn service check:

Example metric graph for a single sapcc_subaccounts.app_conn service check

The graph shows the number of currently active application connections. The graph would show warning and critical thresholds values, which in this example are currently outside the displayed range of values for the y-axis.

SAP Cloud Connector Performance Backends

The check sapcc_performance_backends is responsible for the monitoring of the performance of each (on-premise) backend system connected to the SAP Cloud Connector. Upon inventory this check creates a service check for each backend connection found on the SAP Cloud Connector. During normal check execution, the number of requests to the backend system, categorized in one of the 22 runtime buckets is determined for each inventorized item. From these raw values, the request rate in requests per second is derived for each of the 22 runtime buckets. Also from the raw values, the following four additional metrics are derived:

  • calls_total: the total request rate over all of the 22 runtime buckets.

  • calls_pct_ok: the relative number of requests in percent with a runtime below a given runtime warning threshold.

  • calls_pct_warn: the relative number of requests in percent with a runtime above a given runtime warning threshold.

  • calls_pct_crit: the relative number of requests in percent with a runtime above a given runtime critical threshold.

If the relative number of requests is above the configured warning and critical threshold values, an alarm is raised accordingly. For each of the 22 runtime buckets, the total number of requests and the relative number of requests (calls_pct_ok, calls_pct_warn, calls_pct_crit), performance data is reported by the check.

With the additional WATO plugin sapcc_performance_backends.py it is possible to configure the warning and critical levels for the check sapcc_performance_backends through the WATO WebUI and thus override the following default values:

Metric Warning Threshold Critical Threshold
Request runtime 500 msec 1000 msec
Percentage of requests over request runtime thresholds 10% 5%

The configuration options for the backend performance levels can be found in the WATO WebUI under:

-> Host & Service Parameters
   -> Parameters for discovered services
      -> Applications, Processes & Services
         -> SAP Cloud Connector Backend Performance
            -> Create Rule in Folder ...
               -> Rule Options
                  Description: SAP Cloud Connector Backend Performance

               -> Parameters
                  [x] Runtime bucket definition and calls per bucket in percent
                      Warning if percentage of calls in warning bucket equal or above [10.00] %
                      Assign calls to warning bucket if runtime equal or above [500] milliseconds
                      Critical if percentage of calls in critical bucket equal or above [5.00] %
                      Assign calls to critical bucket if runtime equal or above [1000] milliseconds

               -> Conditions
                  Folder [The folder containing the SAP Cloud Connector systems]
                  and/or
                  Explicit hosts [x]
                  Specify explicit host names [SAP Cloud Connector systems]
                  and/or
                  Backend Name [x]
                  Specify explicit values [Backend name]

The above image with a status output example from the WATO WebUI shows one sapcc_performance_backends service check as the 6th item from top of the displayed items. The service name is prefixed by the string SAP CC Perf Backend and followed by a string concatenated from the protocol, FQDN and TCP port of the backend system, which in this example is anonymized. For each backend connection the total number of requests, the total request rate, the percentage of requests below the runtime warning threshold, the percentage of requests above the runtime warning threshold and the percentage of requests above the runtime critical threshold are shown. The relative number of requests in percent are also visualized in the perf-o-meter.

The following image shows an example of the metric graph for the total request rate from the sapcc_performance_backends service check:

Example metric graph for the total request rate from the sapcc_performance_backends service check

The following image shows an example of the metric graph for the relative number of requests from the sapcc_performance_backends service check:

Example metric graph for the relative number of requests from the sapcc_performance_backends service check

The graph shows the percentage of requests below the runtime warning threshold in the color green at the bottom, the percentage of requests above the runtime warning threshold in the color yellow stacked above and the percentage of requests above the runtime critical threshold in the color red stacked at the top.

The following image shows an example of the combined metric graphs for the request rates to a single backend system in each of the 22 runtime buckets from the sapcc_performance_backends service check:

Example combined metric graphs for the request rates to a single backend system in each of the 22 runtime buckets from the sapcc_performance_backends service check

To provide a better overview, the individual metrics are grouped together into three graphs. The first graph shows the request rate in the runtime buckets >=10ms, >=20ms, >=30ms, >=40ms, >=50ms, >=75ms and >=100ms. The second graph shows the request rate in the runtime buckets >=125ms, >=150ms, >=200ms, >=300ms, >=400ms, >=500ms, >=750ms and >=1000ms. The third and last graph shows the request rate in the runtime buckets >=1250ms, >=1500ms, >=2000ms, >=2500ms, >=3000ms, >=4000ms and >=5000ms.

The following image shows an example of the individual metric graphs for the request rates to a single backend system in each of the 22 runtime buckets from the sapcc_performance_backends service check:

Example individual metric graphs for the request rates to a single backend system in each of the 22 runtime buckets from the sapcc_performance_backends service check

Each of the metric graphs shows exactly the same data as the previously show combined graphs. The combined metric graphs are actually based on the individual metric graphs for the request rates to a single backend system.

Conclusion

The newly introduced checks for the SAP Cloud Connector enables you to monitor several application specific aspects of the SAP Cloud Connector with your Check_MK Server. The combination of built-in Check_MK monitoring facilities and a new agent plugin for the SAP Cloud Connector complement each other in this regard. While the new SAP Cloud Connector agent plugin for Check_MK utilizes most of the data provided by the monitoring endpoints on the SAP Cloud Connector, a more in-depth monitoring could be achieved if the data from the Most Recent Requests metric would also be exposed over the monitoring API of SAP Cloud Connector. It hope this will be the case in a future release of the SAP Cloud Connector.

I hope you find the provided new check useful and enjoyed reading this blog post. Please don't hesitate to drop me a note if you have any suggestions or run into any issues with the provided checks.

// Check_MK Monitoring - Brocade / Broadcom Fibre Channel Switches

This article provides patches for the standard Check_MK distribution in order to fix and enhance the support for the monitoring of Brocade / Broadcom Fibre Channel Switches.


Out of the box, there is currently already monitoring support available for Brocade / Broadcom Fibre Channel Switches in the standard Check_MK distribution. Unfortunately there are several issues in the check brocade_fcport, which is used to monitor the status and several metrics of the fibre channel switch ports. Those issues prevent the check from working as intended and are fixed in the version provided in this article. Up until recently, there also was no support in the standard Check_MK distribution for monitoring CPU and memory metrics on a switch level and no support for monitoring SFP metrics on a port level. This article introduces the new checks brocade_cpu, brocade_mem and brocade_sfp to cover those metrics. With the most recent upstream version 1.5.x of the Check_MK distribution, there are now the standard checks brocade_sys (covering CPU and memory metrics) and brocade_sfp (covering the SFP port metrics) available, providing basically the same functionality.

For the impatient and TL;DR here is the enhanced version of the brocade_fcport check:

Enhanced version of the brocade_fcport check

And the new brocade_cpu, brocade_mem and brocade_sfp checks:

The new brocade_cpu check
The new brocade_mem check
The new brocade_sfp check

The sources to the new and enhanced versions of all the checks can be found in my Check_MK Plugins repository on GitHub.

Additional Checks

CPU Usage

The Check_MK service check brocade_cpu monitors the current CPU utilization of Brocade / Broadcom fibre channel switches. It uses the SNMP OID swCpuUsage from the fibre channel switch MIB (SW-MIB) in order to create one service check for each CPU found in the system. The current CPU utilization in percent is compared to either the default or configured warning and critical threshold values, and an alarm is raised accordingly. With the standard WATO plugin for CPU utilization it is possible to configure the warning and critical levels through the WATO WebUI and thus override the default values (warning: 80%; critical: 90% of CPU utilization).

The following image shows a status output example for the brocade_cpu service check from the WATO WebUI:

Status output example for the new brocade_cpu service check

This example shows the current CPU utilization of the system in percent.

The following image shows an example of the service metrics graph for the brocade_cpu service check:

Example service metrics graph for the new brocade_cpu service check

The selected example graph shows the current CPU utilization of the system in percent as well as the default warning and critical threshold values.

Memory Usage

The Check_MK service check brocade_mem monitors the current memory (RAM) usage on Brocade / Broadcom fibre channel switches. It uses the SNMP OID swMemUsage from the fibre channel switch MIB (SW-MIB) in order to create a service check for the overall memory usage on the system. The amount of currently used memory is compared to either the default or configured warning and critical threshold values, and an alarm is raised accordingly. With the added WATO plugin it is possible to configure the warning and critical levels through the WATO WebUI and thus override the default values (warning: 80%; critical: 90% of used memory). The configuration options for the used memory levels can be found under:

-> Host & Service Parameters
   -> Parameters for discovered services
      -> Operating System Resources
         -> Brocade Fibre Channel Memory Usage
            -> Create rule in folder ...
               [x] Levels for memory usage

The following image shows a status output example for the brocade_mem service check from the WATO WebUI:

Status output example for the new brocade_mem service check

This example shows the current memory utilization of the system in percent.

The following image shows an example of the service metrics graph for the brocade_mem service check:

Example service metrics graph for the new brocade_mem service check

The selected example graph shows the current memory utilization of the system in percent as well as the default warning and critical threshold values.

SFP Health

The Check_MK service check brocade_sfp monitors several metrics of the SFPs in all enabled and active ports of Brocade / Broadcom fibre channel switches. It uses several SNMP OIDs from the fibre channel switch MIB (SW-MIB) and from the Fabric Alliance Extension MIB (FA-EXT-MIB) in order to create one service check for each SFP found in an enabled and active port on the system. The OIDs from the fibre channel switch MIB (swFCPortSpecifier, swFCPortName, swFCPortPhyState, swFCPortOpStatus, swFCPortAdmStatus) are used to determine the number, name and status of the switch port. The OIDs from the Fabric Alliance Extension MIB are used to determine the actual SFP metrics. Those metrics are the SFP temperature (swSfpTemperature), voltage (swSfpVoltage) and current (swSfpCurrent), as well as the optical receive and transmit power (swSfpRxPower and swSfpTxPower) of the SFP. Unfortunately the SFP metrics are not available on all Brocade / Broadcom switch hardware platforms and FabricOS version. This check was verified to work with Gen6 hardware and FabricOS v8. Another limitation is, that the SFP metrics are not available in real-time, but are only gathered in a 5 minute interval by the switch from all the SFPs in the switch. This is most likely a precausion as not to overload the processing capacity on the SFPs with too many status requests.

The current temperature level as well as the optical receive and transmit power levels are compared to either the default or configured warning and critical threshold values, and an alarm is raised accordingly. With the added WATO plugin it is possible to configure the warning and critical levels through the WATO WebUI and thus override the default values:

Metric Warning Threshold Critical Threshold
System temperature 55°C 65°C
Optical receive power -7.0 dBm -9.0 dBm
Optical transmit power -2.0 dBm -3.0 dBm

The configuration options for the used temperature and optical receive and transmit power levels can be found under:

-> Host & Service Parameters
   -> Parameters for discovered services
      -> Networking
         -> Brocade Fibre Channel SFP
            -> Create rule in folder ...
               [x] Temperature levels in degrees celcius
               [x] Receive power levels in dBm
               [x] Transmit power levels in dBm

Currently, the electrical current and voltage metrics of the SFPs are only used for long-term trends via the respective service metric template and thus are not used to raise any alarms. This is due to the fact that there is little to no information available on which precise electrical current and voltage levels would constitute as an indicator for an immediate or impending failure state.

The following image shows several status output examples for the brocade_sfp service check from the WATO WebUI:

Status output examples for the new brocade_sfp service check

This example shows SFP Port service check items for eight consecutive ports on a switch. For each item the current temperature, voltage and current, as well as the optical receive and transmit power of the SFP are shown. The optical receive and transmit power levels are also visualized in the Perf-O-Meter, optical receive power growing from the middle to the left, optical transmit power growing from the middle to the right.

The following image shows an example of the service metrics graph for the brocade_sfp service check:

Example service metrics graph for the new brocade_sfp service check

The first graph shows the optical receive and transmit power of the SFP. The second shows the electrical current drawn by the SFP. The third graph shows the temperature of the SFP. The fourth and last graph shows the electrical voltage provided to the SFP.

Modified Check

Fibre Channel Port

The Check_MK service check brocade_fcport has several issuse which are addressed by the following set of patches. The patch is actually a monolithic one and is just broken up into individual patches here for ease of discussion.

The first patch is a simple, but ugly workaround to prevent the conversion of OID_END, which carries the port index information, from being treated as a BINARY SNMP value:

brocade_fcport.patch
--- a/checks/brocade_fcport   2018-11-25 18:06:04.674930057 +0100
+++ b/checks/brocade_fcport   2018-11-25 08:43:58.715721271 +0100
@@ -154,8 +135,8 @@
         bbcredits = None
         if len(if64_info) > 0:
             fcmgmt_portstats = []
-            for oidend, dummy, tx_elements, rx_elements, bbcredits_64 in if64_info:
-                if int(index) == int(oidend.split(".")[-1]):
+            for oidend, tx_elements, rx_elements, bbcredits_64 in if64_info:
+                if index == oidend.split(".")[-1]:
                     fcmgmt_portstats = [
                         binstring_to_int(''.join(map(chr, tx_elements))) / 4,
                         binstring_to_int(''.join(map(chr, rx_elements))) / 4,
@@ -477,7 +426,6 @@
         # Not every device supports that
         (".1.3.6.1.3.94.4.5.1", [
             OID_END,
-            "1",            # Dummy value, otherwise OID_END is also treated as a BINARY value
             BINARY("6"),    # FCMGMT-MIB::connUnitPortStatCountTxElements
             BINARY("7"),    # FCMGMT-MIB::connUnitPortStatCountRxElements
             BINARY("8"),    # FCMGMT-MIB::connUnitPortStatCountBBCreditZero

This is achieved by simply inserting a dummy value into the list of SNMP OIDs in the snmp_info variable. I haven't had time to dig out the root cause of this behaviour, but i guess it must be somewhere in the core SNMP components of Check_MK. The patch also implicitly addresses and fixes an issue where two variables with differently typed content are being compared. This is achieved by simply changing the line:

               if index == oidend.split(".")[-1]:

to:

               if int(index) == int(oidend.split(".")[-1]):

and thus forcing a conversion to integer values.

The next patch adds support for additional encoding schemes in the FC-1 layer, which are used for fibre channel beyond the speed of 8 GBit. Up to and including a speed of 8 GBit, fibre channel uses a 8/10b encoding. This means that for every 8 bits of data, 10 bits are actually send over the fibre channel link. The encoding scheme changes for speeds higher than 8 GBit. 16 GBit fibre channel – like 10 GBit ethernet – uses a 64/66b encoding scheme, 32 GBit fibre channel and higher use a 256/257b encoding scheme. Thus the wirespeed calculations of the brocade_fcport service check are incorrect for speeds above 8 GBit. The following patch addresses and fixes this issue:

brocade_fcport.patch
--- a/checks/brocade_fcport   2018-11-25 18:06:04.674930057 +0100
+++ b/checks/brocade_fcport   2018-11-25 08:43:58.715721271 +0100
@@ -283,15 +267,8 @@
 
     output.append(speedmsg)
 
-    if gbit > 16:
-        # convert gbit netto link-rate to Byte/s (256/257 enc)
-        wirespeed = gbit * 1000000000.0 * ( 256 / 257 ) / 8
-    elif gbit > 8:
-        # convert gbit netto link-rate to Byte/s (64/66 enc)
-        wirespeed = gbit * 1000000000.0 * ( 64 / 66 ) / 8
-    else:
         # convert gbit netto link-rate to Byte/s (8/10 enc)
-        wirespeed = gbit * 1000000000.0 * ( 8 / 10 ) / 8
+    wirespeed = gbit * 1000000000.0 * 0.8 / 8
     in_bytes = 4 * get_rate("brocade_fcport.rxwords.%s" % index, this_time, rxwords)
     out_bytes = 4 * get_rate("brocade_fcport.txwords.%s" % index, this_time, txwords)

The third patch simply adds the notxcredits counter and its value to the output of the service check. This information is currently missing from the status output of the check and is just added as a convenience:

brocade_fcport.patch
--- a/checks/brocade_fcport   2018-11-25 18:06:04.674930057 +0100
+++ b/checks/brocade_fcport   2018-11-25 08:43:58.715721271 +0100
@@ -408,9 +361,6 @@
             summarystate = max(1, summarystate)
             text += "(!)"
             output.append(text)
-        else:
-            if counter == "notxcredits":
-                output.append(text)
 
     # P O R T S T A T E
     for dev_state, state_key, state_info, warn_states, state_map in [

The last patch addresses and fixes a more serious issue. The details are explained in the comment of the following code snippet. The digest here being, that the metric notxcredits (“No TX buffer credits”) gathered from the SNMP OID swFCPortNoTxCredits is being calculated wrong. This is due to the fact, that the brocade_fcport service check treats this metric like all the other error metrics of a switchport (e.g. “CRC errors”, “ENC-Out”, “ENC-In” and “C3 discards”) and puts it in relation to the number of frames transmitted over a link. In reality though the definition of the metric behind the SNMP OID swFCPortNoTxCredits is, that it is actually relative to time. This issue leads to false positives in certain edge cases. For example when a little utilized switch port sees an otherwise uncritical number of swFCPortNoTxCredits due to the normal activity of the fibre channel flow control. In such a case, a relatively high number of swFCPortNoTxCredits is put into relation to the sum of a relatively low number of frames transmitted over the link and again the relatively high number of swFCPortNoTxCredits. See the last line of the code snippet below for this calculation. The result is a high value for the metric notxcredits (“No TX buffer credits”) although the fibre channel flow control was working perfectly fine, the configured switch.edgeHoldTime was most likely never reached and frames have never been dropped.

In order to address this issue, the following patch adds a few lines of code for a special treatment of the metric notxcredits to the brocade_fcport service check. This is done in the last section of the patch. The second section of the patch is just for the purpose of clarification, as it sets the metric that is being related to, to None in case of the metric notxcredits. The first section of the patch adjusts the default warning and critical thresholds to lower levels. The previous levels were rather high due to the miscalculation explained above. With the new calculation added by the third section of the patch, the default warning and critical threshold levels need to be much more sensitive.

brocade_fcport.patch
--- a/checks/brocade_fcport   2018-11-25 18:06:04.674930057 +0100
+++ b/checks/brocade_fcport   2018-11-25 08:43:58.715721271 +0100
@@ -100,11 +100,11 @@
 factory_settings["brocade_fcport_default_levels"] = {
     "rxcrcs":           (3.0, 20.0),   # allowed percentage of CRC errors
     "rxencoutframes":   (3.0, 20.0),   # allowed percentage of Enc-OUT Frames
     "rxencinframes":    (3.0, 20.0),   # allowed percentage of Enc-In Frames
-    "notxcredits":      (1.0, 3.0),    # allowed percentage of No Tx Credits
+    "notxcredits": (3.0, 20.0),  # allowed percentage of No Tx Credits
     "c3discards":       (3.0, 20.0),   # allowed percentage of C3 discards
     "assumed_speed":    2.0,           # used if speed not available in SNMP data
 }
 
 
@@ -349,7 +327,7 @@
            ("ENC-Out",              "rxencoutframes",      rxencoutframes,  rxframes_rate),
            ("ENC-In",               "rxencinframes",       rxencinframes,   rxframes_rate),
            ("C3 discards",          "c3discards",          c3discards,      txframes_rate),
-           ("No TX buffer credits", "notxcredits",         notxcredits,     None),
+        ("No TX buffer credits", "notxcredits", notxcredits, txframes_rate),
     ]:
         per_sec = get_rate("brocade_fcport.%s.%s" % (counter, index), this_time, value)
         perfdata.append((counter, per_sec))
@@ -360,31 +338,6 @@
                     (counter, item), this_time, per_sec, average)
             perfdata.append( ("%s_avg" % counter, per_sec_avg ) )
 
-        # Calculate error rates
-        if counter == "notxcredits":
-            # Calculate the error rate for "notxcredits" (buffer credit zero). Since this value
-            # is relative to time instead of the number of transmitted frames it needs special
-            # treatment.
-            # Semantics of the buffer credit zero value on Brocade / Broadcom devices:
-            # The switch ASIC checks the buffer credit value of a switch port every 2.5us and
-            # increments the buffer credit zero counter if the buffer credit value is zero.
-            # This means if in a one second interval the buffer credit zero counter increases
-            # by 400000 the link on this switch port is not allowed to send any frames.
-            # By default the edge hold time on a Brocade / Broadcom device is about 200ms:
-            #   switch.edgeHoldTime:220
-            # If a C3 frame remains in the switches queue for more than 220ms without being
-            # given any credits to be transmitted, it is subsequently dropped. Thus the buffer
-            # credit zero counter would optimally be correlated to the C3 discards counter.
-            # Unfortunately the Brocade / Broadcom devices have no egress buffering and do
-            # ingress buffering instead. Thus the C3 discards counters are increased on the
-            # ingress port, while the buffer credit zero counters are increased on the egress
-            # port. The trade-off is to correlate the buffer credit zero counters relative to
-            # the measured time interval.
-            if per_sec > 0:
-                rate = per_sec / 400000.00
-            else:
-                rate = 0
-        else:
             # compute error rate (errors in relation to number of frames) (from 0.0 to 1.0)
             if ref > 0 or per_sec > 0:
                 rate = per_sec / (ref + per_sec)

Conclusion

Adding the three new checks for Brocade / Broadcom fibre channel switches to your Check_MK server enables you to monitor additional CPU and memory aspects as well as the SFPs of your Brocade / Broadcom fibre channel devices. More recent versions of Check_MK should already include the same functionality through the now included standard checks brocade_sys and brocade_sfp.

New Brocade / Broadcom fibre channel devices should pick up the additional service checks immediately. Existing Brocade / Broadcom fibre channel devices might need a Check_MK inventory to be run explicitly on them in order to pick up the additional service checks.

The enhanced version of the brocade_fcport service check addresses and fixes several issues present in the standard Check_MK service check. It should provide you with a more complete and future-proof monitoring of your fibre channel network infrastructure.

I hope you find the provided new and enhanced checks useful and enjoyed reading this blog post. Please don't hesitate to drop me a note if you have any suggestions or run into any issues with the provided checks.

// Check_MK Monitoring - HPE Virtual Connect Fibre Channel Modules

This article provides patches for the standard Check_MK distribution in order to add support for the monitoring of HPE Virtual Connect Fibre Channel Modules.


Out of the box, there is currently no monitoring support for HPE Virtual Connect Fibre Channel Modules in the standard Check_MK distribution. Those modules, like e.g. the HPE Virtual Connect 8Gb 20-port Fibre Channel Module, are used in HPE c-Class BladeSystem to provide Fibre Channel connectivity for the individual server blades. Fortunately the modules provide status and performance data via the standard SNMP FIBRE-CHANNEL-FE-MIB defined in RFC 2837 as well as its successor, the SNMP FCMGMT-MIB defined in RFC 4044. Those two SNMP MIBs are already covered by the checks qlogic_fcport, qlogic_sanbox and qlogic_sanbox_fabric_element, which are part of the standard Check_MK distribution. This simplifies the task of adding support for the HPE Virtual Connect Fibre Channel modules and reduces it to be just a matter of extending the already existing checks with three rather simple patches.

For the impatient and TL;DR here are the enhanced versions of the qlogic_fcport, qlogic_sanbox and qlogic_sanbox_fabric_element:

Enhanced version of the qlogic_fcport check
Enhanced version of the qlogic_sanbox check
Enhanced version of the qlogic_sanbox_fabric_element check

The sources to the enhanced versions of all three checks can be found in my Check_MK Plugins repository on GitHub.

The necessary changes to qlogic_fcport and qlogic_sanbox_fabric_element are limited to the snmp_scan_function used by the Check_MK inventory. Here, the vendor specific OIDs for the HPE Virtual Connect Fibre Channel modules are added. The following patches show the respective lines for qlogic_fcport:

qlogic_fcport.patch
--- a/checks/qlogic_fcport   2017-03-06 21:00:07.397607946 +0100
+++ b/checks/qlogic_fcport   2017-10-01 14:34:48.153710776 +0200
@@ -218,12 +218,14 @@
     # .1.3.6.1.4.1.3873.1.12 QLogic 8 Gb and 4/8 Gb Intelligent Pass-thru Module
     # .1.3.6.1.4.1.3873.1.9  QLogic SANBox 5802 FC Switch
     # .1.3.6.1.4.1.3873.1.11 HP StorageWorks 8/20q Fibre Channel Switch
+    # .1.3.6.1.4.1.3873.1.16 HPE Virtual Connect FlexFabric
     'snmp_scan_function'    : lambda oid: \
            oid(".1.3.6.1.2.1.1.2.0").startswith(".1.3.6.1.4.1.3873.1.14") \
         or oid(".1.3.6.1.2.1.1.2.0").startswith(".1.3.6.1.4.1.3873.1.8") \
         or oid(".1.3.6.1.2.1.1.2.0").startswith(".1.3.6.1.4.1.3873.1.11") \
         or oid(".1.3.6.1.2.1.1.2.0").startswith(".1.3.6.1.4.1.3873.1.12") \
-        or oid(".1.3.6.1.2.1.1.2.0").startswith(".1.3.6.1.4.1.3873.1.9"),
+        or oid(".1.3.6.1.2.1.1.2.0").startswith(".1.3.6.1.4.1.3873.1.9") \
+        or oid(".1.3.6.1.2.1.1.2.0").startswith(".1.3.6.1.4.1.3873.1.16"),
     'group':                   'qlogic_fcport',
     'default_levels_variable': 'qlogic_fcport_default_levels',
 }

and for qlogic_sanbox_fabric_element:

qlogic_sanbox_fabric_element.patch
--- a/checks/qlogic_sanbox_fabric_element    2017-03-06 21:00:07.397607946 +0100
+++ b/checks/qlogic_sanbox_fabric_element    2017-10-01 14:47:35.000003198 +0200
@@ -54,7 +54,9 @@
                                                            OID_END]),
     # .1.3.6.1.4.1.3873.1.14 Qlogic-Switch
     # .1.3.6.1.4.1.3873.1.8  Qlogic-4Gb SAN Switch Module for IBM BladeCenter
+    # .1.3.6.1.4.1.3873.1.16 HPE Virtual Connect FlexFabric
     'snmp_scan_function'    : lambda oid: \
            oid(".1.3.6.1.2.1.1.2.0").startswith(".1.3.6.1.4.1.3873.1.14") \
-        or oid(".1.3.6.1.2.1.1.2.0").startswith(".1.3.6.1.4.1.3873.1.8"),
+        or oid(".1.3.6.1.2.1.1.2.0").startswith(".1.3.6.1.4.1.3873.1.8") \
+        or oid(".1.3.6.1.2.1.1.2.0").startswith(".1.3.6.1.4.1.3873.1.16"),
 }

In both cases, the relevant lines being:

        or oid(".1.3.6.1.2.1.1.2.0").startswith(".1.3.6.1.4.1.3873.1.16"),

After those two simple changes, the checks will now be able to successfully inventorize the overall fabric status as well as the status of individual ports of HPE Virtual Connect Fibre Channel modules.

The necessary changes to qlogic_sanbox also require the extension of the snmp_scan_function used by the Check_MK inventory as shown by the patches above. In addition to that, the string operations on the sensor_id need to be adjusted in order to get a more user-friendly name for the temperature and power supply sensors which are also present in the HPE Virtual Connect Fibre Channel modules. Since the sensor IDs are encoded in the SNMP OIDs and the SNMP tree for those OIDs can vary from module to module, the simple string replacement in the original qlogic_sanbox check was exchanged for a more general, regular expression based substitution. The following patch shows the respective lines for the combined changes to qlogic_sanbox:

qlogic_sanbox.patch
--- a/checks/qlogic_sanbox   2017-03-06 21:00:07.397607946 +0100
+++ b/checks/qlogic_sanbox   2017-10-01 14:47:51.348002546 +0200
@@ -44,7 +44,7 @@
     inventory = []
     for sensor_name, sensor_status, sensor_message, sensor_type, \
         sensor_characteristic, sensor_id in info:
-        sensor_id = sensor_id.replace("16.0.0.192.221.48.", "").replace(".0.0.0.0.0.0.0.0", "")
+        sensor_id = re.sub('^(16\.0\.0\.192\.221\.48|16\.0\.116\.70\.160\.113)\..*0\.0\.0\.0\.0\.0\.0\.0\.', '', sensor_id)
         if sensor_type == "8" and sensor_characteristic == "3" and \
             sensor_name != "Temperature Status":
             inventory.append( (sensor_id, None) )
@@ -53,7 +53,7 @@
 def check_qlogic_sanbox_temp(item, _no_params, info):
     for sensor_name, sensor_status, sensor_message, sensor_type, \
         sensor_characteristic, sensor_id in info:
-        sensor_id = sensor_id.replace("16.0.0.192.221.48.", "").replace(".0.0.0.0.0.0.0.0", "")
+        sensor_id = re.sub('^(16\.0\.0\.192\.221\.48|16\.0\.116\.70\.160\.113)\..*0\.0\.0\.0\.0\.0\.0\.0\.', '', sensor_id)
         if sensor_id == item:
             sensor_status = int(sensor_status)
             if sensor_status < 0 or sensor_status >= len(qlogic_sanbox_status_map):
@@ -93,9 +93,11 @@
                                                        OID_END]),
     # .1.3.6.1.4.1.3873.1.14 Qlogic-Switch
     # .1.3.6.1.4.1.3873.1.8  Qlogic-4Gb SAN Switch Module for IBM BladeCenter
+    # .1.3.6.1.4.1.3873.1.16 HPE Virtual Connect FlexFabric
     'snmp_scan_function'    : lambda oid: \
            oid(".1.3.6.1.2.1.1.2.0").startswith(".1.3.6.1.4.1.3873.1.14") \
-        or oid(".1.3.6.1.2.1.1.2.0").startswith(".1.3.6.1.4.1.3873.1.8"),
+        or oid(".1.3.6.1.2.1.1.2.0").startswith(".1.3.6.1.4.1.3873.1.8") \
+        or oid(".1.3.6.1.2.1.1.2.0").startswith(".1.3.6.1.4.1.3873.1.16"),
 }
 
 #.
@@ -113,7 +115,7 @@
     inventory = []
     for sensor_name, sensor_status, sensor_message, sensor_type, \
         sensor_characteristic, sensor_id in info:
-        sensor_id = sensor_id.replace("16.0.0.192.221.48.", "").replace(".0.0.0.0.0.0.0.0", "")
+        sensor_id = re.sub('^(16\.0\.0\.192\.221\.48|16\.0\.116\.70\.160\.113)\..*0\.0\.0\.0\.0\.0\.0\.0\.', '', sensor_id)
         if sensor_type == "5":
             inventory.append( (sensor_id, None) )
     return inventory
@@ -121,7 +123,7 @@
 def check_qlogic_sanbox_psu(item, _no_params, info):
     for sensor_name, sensor_status, sensor_message, sensor_type, \
         sensor_characteristic, sensor_id in info:
-        sensor_id = sensor_id.replace("16.0.0.192.221.48.", "").replace(".0.0.0.0.0.0.0.0", "")
+        sensor_id = re.sub('^(16\.0\.0\.192\.221\.48|16\.0\.116\.70\.160\.113)\..*0\.0\.0\.0\.0\.0\.0\.0\.', '', sensor_id)
         if sensor_id == item:
             sensor_status = int(sensor_status)
             if sensor_status < 0 or sensor_status >= len(qlogic_sanbox_status_map):
@@ -153,7 +155,9 @@
                                                        OID_END]),
     # .1.3.6.1.4.1.3873.1.14 Qlogic-Switch
     # .1.3.6.1.4.1.3873.1.8  Qlogic-4Gb SAN Switch Module for IBM BladeCenter
+    # .1.3.6.1.4.1.3873.1.16 HPE Virtual Connect FlexFabric
     'snmp_scan_function'    : lambda oid: \
            oid(".1.3.6.1.2.1.1.2.0").startswith(".1.3.6.1.4.1.3873.1.14") \
-        or oid(".1.3.6.1.2.1.1.2.0").startswith(".1.3.6.1.4.1.3873.1.8"),
+        or oid(".1.3.6.1.2.1.1.2.0").startswith(".1.3.6.1.4.1.3873.1.8") \
+        or oid(".1.3.6.1.2.1.1.2.0").startswith(".1.3.6.1.4.1.3873.1.16"),
 }

After those additional, but still simple, changes, the check will now be able to successfully inventorize the temperature and power supply sensors of HPE Virtual Connect Fibre Channel modules.

// Check_MK - Race Condition in Processing of Piggyback Data

In current versions of Check_MK – namely 1.2.8p24 and 1.4.0p8 – which are included in the Open Monitoring Distribution, there are several race conditions in the code parts responsible for processing of piggyback data sent by the agents. These race conditions are usually triggered when a host is monitored more than once, e.g. by the use of cluster services or if by chance a manual check on the command line coincides with a scheduled check from the monitoring core. While not critical to the whole monitoring process, the effect of these races are intermittent and annoying UNKNOWN - [Errno 2] No such file or directory errors for the monitored host. This article provides patches for both Check_MK versions in order to deal with the spurious error messages.

The normal order of processing of piggybacked data received from agents seems to be as follows:

  1. The Check_MK server calls the agent on the monitored host.

  2. The agent on the monitored host responds with the output of its own host as well as the piggybacked output from other entities.

  3. The Check_MK server processes the agents response and extracts the piggybacked data.

  4. If there is already data stored from the piggybacked host, it's considered stale and is thus removed.

  5. The piggybacked data is stored in a file named <AGENT_HOSTNAME> in the directory named <PIGGYBACKED_HOSTNAME> under the path ${OMD_ROOT}/var/check_mk/piggyback/ (e.g.: ${OMD_ROOT}/var/check_mk/piggyback/<PIGGYBACKED_HOSTNAME>/<AGENT_HOSTNAME>).

  6. The Check_MK server lists the contents of the directory ${OMD_ROOT}/var/check_mk/piggyback/<PIGGYBACKED_HOSTNAME>/<AGENT_HOSTNAME>. For each file item in the list it processes the data contained in the file.

  7. The Check_MK server removes the files in and subsequently the directory ${OMD_ROOT}/var/check_mk/piggyback/<PIGGYBACKED_HOSTNAME>/ itself.

In case a host is monitored more than once, e.g. through the use of cluster services, the steps 4 through 7 above can overlap for two or more concurrent monitoring processes. Such an overlap can cause one process at step 4 to delete the directory ${OMD_ROOT}/var/check_mk/piggyback/<PIGGYBACKED_HOSTNAME>/<AGENT_HOSTNAME> while another process is still working on the directory at step 5 or step 6.

The effect of this issue are mainly intermittent and annoying UNKNOWN - [Errno 2] No such file or directory errors in the Check_MK WebUI as well as errors like the following examples in the log files:

2017-07-18 06:32:07 [4] Check_MK helper [16533]: : >> An exception occured while processing host "hostA"
2017-07-18 06:32:07 [4] Check_MK helper [16533]: : >> Traceback (most recent call last):
2017-07-18 06:32:07 [4] Check_MK helper [16533]: : >>   File "/omd/sites/SITE/share/check_mk/modules/keepalive.py", line 125, in do_keepalive
2017-07-18 06:32:07 [4] Check_MK helper [16533]: : >>     status = command_function(command_tuple)
2017-07-18 06:32:07 [4] Check_MK helper [16533]: : >>   File "/omd/sites/SITE/share/check_mk/modules/keepalive.py", line 437, in execute_keepalive_command
2017-07-18 06:32:07 [4] Check_MK helper [16533]: : >>     return mode_function(hostname, ipaddress)
2017-07-18 06:32:07 [4] Check_MK helper [16533]: : >>   File "/omd/sites/SITE/share/check_mk/modules/check_mk_base.py", line 1290, in do_check
2017-07-18 06:32:07 [4] Check_MK helper [16533]: : >>     do_all_checks_on_host(hostname, ipaddress, only_check_types)
2017-07-18 06:32:07 [4] Check_MK helper [16533]: : >>   File "/omd/sites/SITE/share/check_mk/modules/check_mk_base.py", line 1506, in do_all_checks_on_host
2017-07-18 06:32:07 [4] Check_MK helper [16533]: : >>     info = get_info_for_check(hostname, ipaddress, infotype)
2017-07-18 06:32:07 [4] Check_MK helper [16533]: : >>   File "/omd/sites/SITE/share/check_mk/modules/check_mk_base.py", line 319, in get_info_for_check
2017-07-18 06:32:07 [4] Check_MK helper [16533]: : >>     info = apply_parse_function(get_host_info(hostname, ipaddress, section_name, max_cachefile_age, ignore_check_interval), section_name)
2017-07-18 06:32:07 [4] Check_MK helper [16533]: : >>   File "/omd/sites/SITE/share/check_mk/modules/check_mk_base.py", line 371, in get_host_info
2017-07-18 06:32:07 [4] Check_MK helper [16533]: : >>     ignore_check_interval=True)
2017-07-18 06:32:07 [4] Check_MK helper [16533]: : >>   File "/omd/sites/SITE/share/check_mk/modules/check_mk_base.py", line 541, in get_realhost_info
2017-07-18 06:32:07 [4] Check_MK helper [16533]: : >>     store_persisted_info(hostname, persisted)
2017-07-18 06:32:07 [4] Check_MK helper [16533]: : >>   File "/omd/sites/SITE/share/check_mk/modules/check_mk_base.py", line 567, in store_persisted_info
2017-07-18 06:32:07 [4] Check_MK helper [16533]: : >>     os.rename("%s.#new" % file_path, file_path)
2017-07-18 06:32:07 [4] Check_MK helper [16533]: : >> OSError: [Errno 2] No such file or directory
2017-03-27 00:02:21 [4] Check_MK helper [27005]: : >> An exception occured while processing host "hostA"
2017-03-27 00:02:21 [4] Check_MK helper [27005]: : >> Traceback (most recent call last):
2017-03-27 00:02:21 [4] Check_MK helper [27005]: : >>   File "/omd/sites/SITE/share/check_mk/modules/keepalive.py", line 125, in do_keepalive
2017-03-27 00:02:21 [4] Check_MK helper [27005]: : >>     status = command_function(command_tuple)
2017-03-27 00:02:21 [4] Check_MK helper [27005]: : >>   File "/omd/sites/SITE/share/check_mk/modules/keepalive.py", line 437, in execute_keepalive_command
2017-03-27 00:02:21 [4] Check_MK helper [27005]: : >>     return mode_function(hostname, ipaddress)
2017-03-27 00:02:21 [4] Check_MK helper [27005]: : >>   File "/omd/sites/SITE/share/check_mk/modules/check_mk_base.py", line 1241, in do_check
2017-03-27 00:02:21 [4] Check_MK helper [27005]: : >>     do_all_checks_on_host(hostname, ipaddress, only_check_types)
2017-03-27 00:02:21 [4] Check_MK helper [27005]: : >>   File "/omd/sites/SITE/share/check_mk/modules/check_mk_base.py", line 1457, in do_all_checks_on_host
2017-03-27 00:02:21 [4] Check_MK helper [27005]: : >>     info = get_info_for_check(hostname, ipaddress, infotype)
2017-03-27 00:02:21 [4] Check_MK helper [27005]: : >>   File "/omd/sites/SITE/share/check_mk/modules/check_mk_base.py", line 319, in get_info_for_check
2017-03-27 00:02:21 [4] Check_MK helper [27005]: : >>     info = apply_parse_function(get_host_info(hostname, ipaddress, section_name, max_cachefile_age, ignore_check_interval), section_name)
2017-03-27 00:02:21 [4] Check_MK helper [27005]: : >>   File "/omd/sites/SITE/share/check_mk/modules/check_mk_base.py", line 371, in get_host_info
2017-03-27 00:02:21 [4] Check_MK helper [27005]: : >>     ignore_check_interval=True)
2017-03-27 00:02:21 [4] Check_MK helper [27005]: : >>   File "/omd/sites/SITE/share/check_mk/modules/check_mk_base.py", line 540, in get_realhost_info
2017-03-27 00:02:21 [4] Check_MK helper [27005]: : >>     store_piggyback_info(hostname, piggybacked)
2017-03-27 00:02:21 [4] Check_MK helper [27005]: : >>   File "/omd/sites/SITE/share/check_mk/modules/check_mk_base.py", line 651, in store_piggyback_info
2017-03-27 00:02:21 [4] Check_MK helper [27005]: : >>     os.rename(dir + "/.new." + sourcehost, dir + "/" + sourcehost)
2017-03-27 00:02:21 [4] Check_MK helper [27005]: : >> OSError: [Errno 2] No such file or directory

In version 1.4.0p8 of Check_MK there is already a partial fix for the described issue in the form of Werk 4755 (Git commit 77b3bfc3). For both Check_MK versions the following two trivial patches are provided in order to deal with the spurious error messages:

In the current development version of Check_MK – probably to be named 1.5 – there is a rather large amount of code rewrite and movement. From a first glance i couldn't determine if the issue still exists there.

// Check_MK Monitoring - Open-iSCSI

The Open-iSCSI project provides a high-performance, transport independent, implementation of RFC 3720 iSCSI for Linux. It allows remote access to SCSI targets via TCP/IP over several different transport technologies. This article introduces a new Check_MK service check to monitor the status of Open-iSCSI sessions as well as the monitoring of several statistical metrics on Open-iSCSI sessions and iSCSI hardware initiator hosts.

For the impatient and TL;DR here is the Check_MK package of the Open-iSCSI monitoring checks:

Open-iSCSI monitoring checks (Compatible with Check_MK versions 1.2.8 and later)

The sources are to be found in my Check_MK repository on GitHub


The Check_MK service check to monitor Open-iSCSI consists of two major parts, an agent plugin and three check plugins.

The first part, a Check_MK agent plugin named open-iscsi, is a simple Bash shell script. It calls the Open-iSCSI administration tool iscsiadm in order to retrieve a list of currently active iSCSI sessions. The exact call to iscsiadm to retrieve the session list is:

/usr/bin/iscsiadm -m session -P 1

If there are any active iSCSI sessions, the open-iscsi agent plugin also tries to collect several statistics for each iSCSI session. This is done by another call to iscsiadm for each iSCSI Session ${SID}, which is shown in the following example:

/usr/bin/iscsiadm -m session -r ${SID} -s

Unfortunately, the iSCSI session statistics are currently only supported for Open-iSCSI software initiators or dependent hardware iSCSI initiators like the Broadcom BCM577xx or BCM578xx adapters which are covered by the bnx2i kernel module. See Debugging Segfaults in Open-iSCSIs iscsiuio on Intel Broadwell and Backporting Open-iSCSI to Debian 8 "Jessie" for additional information on those dependent hardware iSCSI initiators.

For hardware iSCSI initiators, like the QLogic 4000 and QLogic 8200 Series network adapters and iSCSI HBAs, which provide a full iSCSI offload engine (iSOE) implementation in the adapters firmware, there is currently no support for iSCSI session statistics. Instead, the open-iscsi agent plugin collects several global statistics on each iSOE host ${HST} which is covered by the qla4xxx kernel module with the command shown in the following example:

/usr/bin/iscsiadm -m host -H ${HST} -C stats

The output of the above commands is parsed and reformated by the agent plugin for easier processing in the check plugins. The following example shows the agent plugin output for a system with two BCM578xx dependent hardware iSCSI initiators:

<<<open-iscsi_sessions>>>
bnx2i 10.0.3.4:3260,1 iqn.2001-05.com.equallogic:8-da6616-807572d50-5080000001758a32-<ISCSI-ALIAS> bnx2i.f8:ca:b8:7d:bf:2d eth2 10.0.3.52 LOGGED_IN LOGGED_IN NO_CHANGE
bnx2i 10.0.3.4:3260,1 iqn.2001-05.com.equallogic:8-da6616-807572d50-5080000001758a32-<ISCSI-ALIAS> bnx2i.f8:ca:b8:7d:c2:34 eth3 10.0.3.53 LOGGED_IN LOGGED_IN NO_CHANGE

<<<open-iscsi_session_stats>>>
[session stats f8:ca:b8:7d:bf:2d iqn.2001-05.com.equallogic:8-da6616-807572d50-5080000001758a32-<ISCSI-ALIAS>]
txdata_octets: 40960
rxdata_octets: 461171313
noptx_pdus: 0
scsicmd_pdus: 153967
tmfcmd_pdus: 0
login_pdus: 0
text_pdus: 0
dataout_pdus: 0
logout_pdus: 0
snack_pdus: 0
noprx_pdus: 0
scsirsp_pdus: 153967
tmfrsp_pdus: 0
textrsp_pdus: 0
datain_pdus: 112420
logoutrsp_pdus: 0
r2t_pdus: 0
async_pdus: 0
rjt_pdus: 0
digest_err: 0
timeout_err: 0

[session stats f8:ca:b8:7d:c2:34 iqn.2001-05.com.equallogic:8-da6616-807572d50-5080000001758a32-<ISCSI-ALIAS>]
txdata_octets: 16384
rxdata_octets: 255666052
noptx_pdus: 0
scsicmd_pdus: 84312
tmfcmd_pdus: 0
login_pdus: 0
text_pdus: 0
dataout_pdus: 0
logout_pdus: 0
snack_pdus: 0
noprx_pdus: 0
scsirsp_pdus: 84312
tmfrsp_pdus: 0
textrsp_pdus: 0
datain_pdus: 62418
logoutrsp_pdus: 0
r2t_pdus: 0
async_pdus: 0
rjt_pdus: 0
digest_err: 0
timeout_err: 0

The next example shows the agent plugin output for a system with two QLogic 8200 Series hardware iSCSI initiators:

<<<open-iscsi_sessions>>>
qla4xxx 10.0.3.4:3260,1 iqn.2001-05.com.equallogic:8-da6616-57e572d50-80e0000001458a32-v-sto2-tst-000001 qla4xxx.f8:ca:b8:7d:c1:7d.ipv4.0 none 10.0.3.50 LOGGED_IN Unknown Unknown
qla4xxx 10.0.3.4:3260,1 iqn.2001-05.com.equallogic:8-da6616-57e572d50-80e0000001458a32-v-sto2-tst-000001 qla4xxx.f8:ca:b8:7d:c1:7e.ipv4.0 none 10.0.3.51 LOGGED_IN Unknown Unknown

<<<open-iscsi_host_stats>>>
[host stats f8:ca:b8:7d:c1:7d iqn.2000-04.com.qlogic:isp8214.000e1e3574ac.4]
mactx_frames: 563454
mactx_bytes: 52389948
mactx_multicast_frames: 877513
mactx_broadcast_frames: 0
mactx_pause_frames: 0
mactx_control_frames: 0
mactx_deferral: 0
mactx_excess_deferral: 0
mactx_late_collision: 0
mactx_abort: 0
mactx_single_collision: 0
mactx_multiple_collision: 0
mactx_collision: 0
mactx_frames_dropped: 0
mactx_jumbo_frames: 0
macrx_frames: 1573455
macrx_bytes: 440845678
macrx_unknown_control_frames: 0
macrx_pause_frames: 0
macrx_control_frames: 0
macrx_dribble: 0
macrx_frame_length_error: 0
macrx_jabber: 0
macrx_carrier_sense_error: 0
macrx_frame_discarded: 0
macrx_frames_dropped: 1755017
mac_crc_error: 0
mac_encoding_error: 0
macrx_length_error_large: 0
macrx_length_error_small: 0
macrx_multicast_frames: 0
macrx_broadcast_frames: 0
iptx_packets: 508160
iptx_bytes: 29474232
iptx_fragments: 0
iprx_packets: 401785
iprx_bytes: 354673156
iprx_fragments: 0
ip_datagram_reassembly: 0
ip_invalid_address_error: 0
ip_error_packets: 0
ip_fragrx_overlap: 0
ip_fragrx_outoforder: 0
ip_datagram_reassembly_timeout: 0
ipv6tx_packets: 0
ipv6tx_bytes: 0
ipv6tx_fragments: 0
ipv6rx_packets: 0
ipv6rx_bytes: 0
ipv6rx_fragments: 0
ipv6_datagram_reassembly: 0
ipv6_invalid_address_error: 0
ipv6_error_packets: 0
ipv6_fragrx_overlap: 0
ipv6_fragrx_outoforder: 0
ipv6_datagram_reassembly_timeout: 0
tcptx_segments: 508160
tcptx_bytes: 19310736
tcprx_segments: 401785
tcprx_byte: 346637456
tcp_duplicate_ack_retx: 1
tcp_retx_timer_expired: 1
tcprx_duplicate_ack: 0
tcprx_pure_ackr: 0
tcptx_delayed_ack: 106449
tcptx_pure_ack: 106489
tcprx_segment_error: 0
tcprx_segment_outoforder: 0
tcprx_window_probe: 0
tcprx_window_update: 695915
tcptx_window_probe_persist: 0
ecc_error_correction: 0
iscsi_pdu_tx: 401697
iscsi_data_bytes_tx: 29225
iscsi_pdu_rx: 401697
iscsi_data_bytes_rx: 327355963
iscsi_io_completed: 101
iscsi_unexpected_io_rx: 0
iscsi_format_error: 0
iscsi_hdr_digest_error: 0
iscsi_data_digest_error: 0
iscsi_sequence_error: 0

[host stats f8:ca:b8:7d:c1:7e iqn.2000-04.com.qlogic:isp8214.000e1e3574ad.5]
mactx_frames: 563608
mactx_bytes: 52411412
mactx_multicast_frames: 877517
mactx_broadcast_frames: 0
mactx_pause_frames: 0
mactx_control_frames: 0
mactx_deferral: 0
mactx_excess_deferral: 0
mactx_late_collision: 0
mactx_abort: 0
mactx_single_collision: 0
mactx_multiple_collision: 0
mactx_collision: 0
mactx_frames_dropped: 0
mactx_jumbo_frames: 0
macrx_frames: 1573572
macrx_bytes: 441630442
macrx_unknown_control_frames: 0
macrx_pause_frames: 0
macrx_control_frames: 0
macrx_dribble: 0
macrx_frame_length_error: 0
macrx_jabber: 0
macrx_carrier_sense_error: 0
macrx_frame_discarded: 0
macrx_frames_dropped: 1755017
mac_crc_error: 0
mac_encoding_error: 0
macrx_length_error_large: 0
macrx_length_error_small: 0
macrx_multicast_frames: 0
macrx_broadcast_frames: 0
iptx_packets: 508310
iptx_bytes: 29490504
iptx_fragments: 0
iprx_packets: 401925
iprx_bytes: 355436636
iprx_fragments: 0
ip_datagram_reassembly: 0
ip_invalid_address_error: 0
ip_error_packets: 0
ip_fragrx_overlap: 0
ip_fragrx_outoforder: 0
ip_datagram_reassembly_timeout: 0
ipv6tx_packets: 0
ipv6tx_bytes: 0
ipv6tx_fragments: 0
ipv6rx_packets: 0
ipv6rx_bytes: 0
ipv6rx_fragments: 0
ipv6_datagram_reassembly: 0
ipv6_invalid_address_error: 0
ipv6_error_packets: 0
ipv6_fragrx_overlap: 0
ipv6_fragrx_outoforder: 0
ipv6_datagram_reassembly_timeout: 0
tcptx_segments: 508310
tcptx_bytes: 19323952
tcprx_segments: 401925
tcprx_byte: 347398136
tcp_duplicate_ack_retx: 2
tcp_retx_timer_expired: 4
tcprx_duplicate_ack: 0
tcprx_pure_ackr: 0
tcptx_delayed_ack: 106466
tcptx_pure_ack: 106543
tcprx_segment_error: 0
tcprx_segment_outoforder: 0
tcprx_window_probe: 0
tcprx_window_update: 696035
tcptx_window_probe_persist: 0
ecc_error_correction: 0
iscsi_pdu_tx: 401787
iscsi_data_bytes_tx: 37970
iscsi_pdu_rx: 401791
iscsi_data_bytes_rx: 328112050
iscsi_io_completed: 127
iscsi_unexpected_io_rx: 0
iscsi_format_error: 0
iscsi_hdr_digest_error: 0
iscsi_data_digest_error: 0
iscsi_sequence_error: 0

Although a simple Bash shell script, the agent plugin open-iscsi has several dependencies which need to be installed in order for the agent plugin to work properly. Namely those are the commands iscsiadm, sed, tr and egrep. On Debian based systems, the necessary packages can be installed with the following command:

root@host:~# apt-get install coreutils grep open-iscsi sed

The second part of the Check_MK service check for Open-iSCSI provides the necessary check logic through individual inventory and check functions. This is implemented in the three Check_MK check plugins open-iscsi_sessions, open-iscsi_host_stats and open-iscsi_session_stats, which will be discussed separately in the following sections.

Open-iSCSI Session Status

The check plugin open-iscsi_sessions is responsible for the monitoring of individual iSCSI sessions and their internal session states. Upon inventory this check plugin creates a service check for each pair of iSCSI network interface name and IQN of the iSCSI target volume. Unlike the iSCSI session ID, which changes over time (e.g. after iSCSI logout and login), this pair uniquely identifies a iSCSI session on a host. During normal check execution, the list of currently active iSCSI sessions on a host is compared to the list of active iSCSI sessions gathered during inventory on that host. If a session is missing or if the session has an erroneous internal state, an alarm is raised accordingly.

For all types of initiators – software, dependent hardware and hardware – there is the state session_state which can take on the following values:

ISCSI_STATE_FREE
ISCSI_STATE_LOGGED_IN
ISCSI_STATE_FAILED
ISCSI_STATE_TERMINATE
ISCSI_STATE_IN_RECOVERY
ISCSI_STATE_RECOVERY_FAILED
ISCSI_STATE_LOGGING_OUT

An alarm is raised if the session is in any state other than ISCSI_STATE_LOGGED_IN. For software and dependent hardware initiators there are two additional states – connection_state and internal_state. The state connection_state can take on the values:

FREE
TRANSPORT WAIT
IN LOGIN
LOGGED IN
IN LOGOUT
LOGOUT REQUESTED
CLEANUP WAIT

and internal_state can take on the values:

NO CHANGE
CLEANUP       
REOPEN
REDIRECT

In addition to the above session_state, an alarm is raised if the connection_state is in any other state than LOGGED IN and internal_state is in any other state than NO CHANGE.

No performance data is currently reported by this check.

Open-iSCSI Hosts Statistics

The check plugin open-iscsi_host_stats is responsible for the monitoring of the global statistics on a iSOE host. Upon inventory this check plugin creates a service check for each pair of MAC address and iSCSI network interface name. During normal check execution, an extensive list of statistics – see the above example output of the Check_MK agent plugin – is determined for each inventorized item. If the rate of one of the statistics values is above the configured warning and critical threshold values, an alarm is raised accordingly. For all statistics, performance data is reported by the check.

With the additional WATO plugin open-iscsi_host_stats.py it is possible to configure the warning and critical levels through the WATO WebUI and thus override the default values. The default values for all statistics are a rate of zero (0) units per second for both warning and critical thresholds. The configuration options for the iSOE host statistics levels can be found in the WATO WebUI under:

-> Host & Service Parameters
   -> Parameters for discovered services
      -> Storage, Filesystems and Files
         -> Open-iSCSI Host Statistics
            -> Create Rule in Folder ...
               -> The levels for the Open-iSCSI host statistics values
                  [x] The levels for the number of transmitted MAC/Layer2 frames on an iSOE host.
                  [x] The levels for the number of received MAC/Layer2 frames on an iSOE host.
                  [x] The levels for the number of transmitted MAC/Layer2 bytes on an iSOE host.
                  [x] The levels for the number of received MAC/Layer2 bytes on an iSOE host.
                  [x] The levels for the number of transmitted MAC/Layer2 multicast frames on an iSOE host.
                  [x] The levels for the number of received MAC/Layer2 multicast frames on an iSOE host.
                  [x] The levels for the number of transmitted MAC/Layer2 broadcast frames on an iSOE host.
                  [x] The levels for the number of received MAC/Layer2 broadcast frames on an iSOE host.
                  [x] The levels for the number of transmitted MAC/Layer2 pause frames on an iSOE host.
                  [x] The levels for the number of received MAC/Layer2 pause frames on an iSOE host.
                  [x] The levels for the number of transmitted MAC/Layer2 control frames on an iSOE host.
                  [x] The levels for the number of received MAC/Layer2 control frames on an iSOE host.
                  [x] The levels for the number of transmitted MAC/Layer2 dropped frames on an iSOE host.
                  [x] The levels for the number of received MAC/Layer2 dropped frames on an iSOE host.
                  [x] The levels for the number of transmitted MAC/Layer2 deferral frames on an iSOE host.
                  [x] The levels for the number of transmitted MAC/Layer2 deferral frames on an iSOE host.
                  [x] The levels for the number of transmitted MAC/Layer2 abort frames on an iSOE host.
                  [x] The levels for the number of transmitted MAC/Layer2 jumbo frames on an iSOE host.
                  [x] The levels for the number of MAC/Layer2 late transmit collisions on an iSOE host.
                  [x] The levels for the number of MAC/Layer2 single transmit collisions on an iSOE host.
                  [x] The levels for the number of MAC/Layer2 multiple transmit collisions on an iSOE host.
                  [x] The levels for the number of MAC/Layer2 collisions on an iSOE host.
                  [x] The levels for the number of received MAC/Layer2 control frames on an iSOE host.
                  [x] The levels for the number of received MAC/Layer2 dribble on an iSOE host.
                  [x] The levels for the number of received MAC/Layer2 frame length errors on an iSOE host.
                  [x] The levels for the number of discarded received MAC/Layer2 frames on an iSOE host.
                  [x] The levels for the number of received MAC/Layer2 jabber on an iSOE host.
                  [x] The levels for the number of received MAC/Layer2 carrier sense errors on an iSOE host.
                  [x] The levels for the number of received MAC/Layer2 CRC errors on an iSOE host.
                  [x] The levels for the number of received MAC/Layer2 encoding errors on an iSOE host.
                  [x] The levels for the number of received MAC/Layer2 length too large errors on an iSOE host.
                  [x] The levels for the number of received MAC/Layer2 length too small errors on an iSOE host.
                  [x] The levels for the number of transmitted IP packets on an iSOE host.
                  [x] The levels for the number of received IP packets on an iSOE host.
                  [x] The levels for the number of transmitted IP bytes on an iSOE host.
                  [x] The levels for the number of received IP bytes on an iSOE host.
                  [x] The levels for the number of transmitted IP fragments on an iSOE host.
                  [x] The levels for the number of received IP fragments on an iSOE host.
                  [x] The levels for the number of IP datagram reassemblies on an iSOE host.
                  [x] The levels for the number of IP invalid address errors on an iSOE host.
                  [x] The levels for the number of IP packet errors on an iSOE host.
                  [x] The levels for the number of IP fragmentation overlaps on an iSOE host.
                  [x] The levels for the number of IP fragmentation out-of-order on an iSOE host.
                  [x] The levels for the number of IP datagram reassembly timeouts on an iSOE host.
                  [x] The levels for the number of transmitted IPv6 packets on an iSOE host.
                  [x] The levels for the number of received IPv6 packets on an iSOE host.
                  [x] The levels for the number of transmitted IPv6 bytes on an iSOE host.
                  [x] The levels for the number of received IPv6 bytes on an iSOE host.
                  [x] The levels for the number of transmitted IPv6 fragments on an iSOE host.
                  [x] The levels for the number of received IPv6 fragments on an iSOE host.
                  [x] The levels for the number of IPv6 datagram reassemblies on an iSOE host.
                  [x] The levels for the number of IPv6 invalid address errors on an iSOE host.
                  [x] The levels for the number of IPv6 packet errors on an iSOE host.
                  [x] The levels for the number of IPv6 fragmentation overlaps on an iSOE host.
                  [x] The levels for the number of IPv6 fragmentation out-of-order on an iSOE host.
                  [x] The levels for the number of IPv6 datagram reassembly timeouts on an iSOE host.
                  [x] The levels for the number of transmitted TCP segments on an iSOE host.
                  [x] The levels for the number of received TCP segments on an iSOE host.
                  [x] The levels for the number of transmitted TCP bytes on an iSOE host.
                  [x] The levels for the number of received TCP bytes on an iSOE host.
                  [x] The levels for the number of duplicate TCP ACK retransmits on an iSOE host.
                  [x] The levels for the number of received TCP retransmit timer expiries on an iSOE host.
                  [x] The levels for the number of received TCP duplicate ACKs on an iSOE host.
                  [x] The levels for the number of received TCP pure ACKs on an iSOE host.
                  [x] The levels for the number of transmitted TCP delayed ACKs on an iSOE host.
                  [x] The levels for the number of transmitted TCP pure ACKs on an iSOE host.
                  [x] The levels for the number of received TCP segment errors on an iSOE host.
                  [x] The levels for the number of received TCP segment out-of-order on an iSOE host.
                  [x] The levels for the number of received TCP window probe on an iSOE host.
                  [x] The levels for the number of received TCP window update on an iSOE host.
                  [x] The levels for the number of transmitted TCP window probe persist on an iSOE host.
                  [x] The levels for the number of transmitted iSCSI PDUs on an iSOE host.
                  [x] The levels for the number of received iSCSI PDUs on an iSOE host.
                  [x] The levels for the number of transmitted iSCSI Bytes on an iSOE host.
                  [x] The levels for the number of received iSCSI Bytes on an iSOE host.
                  [x] The levels for the number of iSCSI I/Os completed on an iSOE host.
                  [x] The levels for the number of iSCSI unexpected I/Os on an iSOE host.
                  [x] The levels for the number of iSCSI format errors on an iSOE host.
                  [x] The levels for the number of iSCSI header digest (CRC) errors on an iSOE host.
                  [x] The levels for the number of iSCSI data digest (CRC) errors on an iSOE host.
                  [x] The levels for the number of iSCSI sequence errors on an iSOE host.
                  [x] The levels for the number of ECC error corrections on an iSOE host.

The following image shows a status output example from the WATO WebUI with several open-iscsi_sessions (iSCSI Session Status) and open-iscsi_host_stats (iSCSI Host Stats) service checks over two QLogic 8200 Series hardware iSCSI initiators:

Status output example for open-iscsi_sessions and open-iscsi_host_stats service checks over QLogic 8200 Series hardware iSCSI initiators

This example shows six iSCSI Session Status service check items, which are pairs of iSCSI network interface names and – here anonymized – IQNs of the iSCSI target volumes. For each item the current session_state – in this example LOGGED_IN – is shown. There are also two iSCSI Host Stats service check items in the example, which are pairs of MAC addresses and iSCSI network interface names. For each of those items the current throughput rate on the MAC, IP/IPv6, TCP and iSCSI protocol layer is shown. The throughput rate on the MAC protocol layer is also visualized in the Perf-O-Meter, received traffic growing from the middle to the left, transmitted traffic growing from the middle to the right.

The following three images show examples of the PNP4Nagios graphs for the open-iscsi_host_stats (iSCSI Host Stats) service check.

Example PNP4Nagios graphs for a open-iscsi_host_stats service check (MAC Frames, Traffic, MAC Errors)

The middle graph shows a combined view of the throughput rate for received and transmitted traffic on the different MAC, IP/IPv6, TCP and iSCSI protocol layers. The upper graph shows the throughput rate for various frame types on the MAC protocol layer. The lower graph shows the rate for various error frame types on the MAC protocol layer.

Example PNP4Nagios graphs for a open-iscsi_host_stats service check (IP Packets and Fragments, IP Errors, TCP Segments)

The upper graph shows the throughput rate for received and transmitted traffic on the IP/IPv6 protocol layer. The middle graph shows the rate for various error packet types on the IP/IPv6 protocol layer. The lower graph shows the throughput rate for received and transmitted traffic on the TCP protocol layer.

Example PNP4Nagios graphs for a open-iscsi_host_stats service check (TCP Errors, ECC Error Correction, iSCSI PDUs, iSCSI Errors)

The first graph shows the rate for various protocol control and error segment types on the TCP protocol layer. The second graph shows the rate of ECC error corrections that occured on the QLogic 8200 Series hardware iSCSI initiator. The third graph shows the throughput rate for received and transmitted traffic on the iSCSI protocol layer. The fourth and last graph shows the rate for various control and error PDUs on the iSCSI protocol layer.

Open-iSCSI Session Statistics

The check plugin open-iscsi_session_stats is responsible for the monitoring of the statistics on individual iSCSI sessions. Upon inventory this check plugin creates a service check for each pair of MAC address of the network interface and IQN of the iSCSI target volume. During normal check execution, an extensive list of statistics – see the above example output of the Check_MK agent plugin – is collected for each inventorized item. If the rate of one of the statistics values is above the configured warning and critical threshold values, an alarm is raised accordingly. For all statistics, performance data is reported by the check.

With the additional WATO plugin open-iscsi_session_stats.py it is possible to configure the warning and critical levels through the WATO WebUI and thus override the default values. The default values for all statistics are a rate of zero (0) units per second for both warning and critical thresholds. The configuration options for the iSCSI session statistics levels can be found in the WATO WebUI under:

-> Host & Service Parameters
   -> Parameters for discovered services
      -> Storage, Filesystems and Files
         -> Open-iSCSI Session Statistics
            -> Create Rule in Folder ...
               -> The levels for the Open-iSCSI session statistics values
                  [x] The levels for the number of transmitted bytes in an Open-iSCSI session
                  [x] The levels for the number of received bytes in an Open-iSCSI session
                  [x] The levels for the number of digest (CRC) errors in an Open-iSCSI session
                  [x] The levels for the number of timeout errors in an Open-iSCSI session
                  [x] The levels for the number of transmitted NOP commands in an Open-iSCSI session
                  [x] The levels for the number of received NOP commands in an Open-iSCSI session
                  [x] The levels for the number of transmitted SCSI command requests in an Open-iSCSI session
                  [x] The levels for the number of received SCSI command reponses in an Open-iSCSI session
                  [x] The levels for the number of transmitted task management function commands in an Open-iSCSI session
                  [x] The levels for the number of received task management function responses in an Open-iSCSI session
                  [x] The levels for the number of transmitted login requests in an Open-iSCSI session
                  [x] The levels for the number of transmitted logout requests in an Open-iSCSI session
                  [x] The levels for the number of received logout responses in an Open-iSCSI session
                  [x] The levels for the number of transmitted text PDUs in an Open-iSCSI session
                  [x] The levels for the number of received text PDUs in an Open-iSCSI session
                  [x] The levels for the number of transmitted data PDUs in an Open-iSCSI session
                  [x] The levels for the number of received data PDUs in an Open-iSCSI session
                  [x] The levels for the number of transmitted single negative ACKs in an Open-iSCSI session
                  [x] The levels for the number of received ready to transfer PDUs in an Open-iSCSI session
                  [x] The levels for the number of received reject PDUs in an Open-iSCSI session
                  [x] The levels for the number of received asynchronous messages in an Open-iSCSI session

The following image shows a status output example from the WATO WebUI with several open-iscsi_sessions (iSCSI Session Status) and open-iscsi_session_stats (iSCSI Session Stats) service checks over two BCM578xx dependent hardware iSCSI initiators:

Status output example for open-iscsi_sessions and open-iscsi_session_stats service checks over BCM578xx dependent hardware iSCSI initiators

This example shows six iSCSI Session Status service check items, which are pairs of iSCSI network interface names and – here anonymized – IQNs of the iSCSI target volumes. For each item the current session_state, connection_state and internal_state – in this example with the respective values LOGGED_IN, LOGGED_IN and NO_CHANGE – are shown. There are also an equivalent number of iSCSI Session Stats service check items in the example, which are also pairs of MAC addresses of the network interfaces and IQNs of the iSCSI target volumes. For each of those items the current throughput rate of the individual iSCSI session is shown. As long as the rate of the digest (CRC) and timeout error counters is zero, the string no protocol errors is displayed. Otherwise the name and throughput rate of any non-zero error counter is shown. The throughput rate of the iSCSI session is also visualized in the Perf-O-Meter, received traffic growing from the middle to the left, transmitted traffic growing from the middle to the right.

The following image shows an example of the three PNP4Nagios graphs for a single open-iscsi_session_stats (iSCSI Session Stats) service check.

Example PNP4Nagios graphs for a single open-iscsi_session_stats service check

The upper graph shows the throughput rate for received and transmitted traffic of the iSCSI session. The middle graph shows the rate for received and transmitted iSCSI PDUs, broken down by the different types of PDUs on the iSCSI protocol layer. The lower graph shows the rate for the digest (CRC) and timeout errors on the iSCSI protocol layer.

The described Check_MK service check to monitor the status of Open-iSCSI sessions, Open-iSCSI session metrics and iSCSI hardware initiator host metrics has been verified to work with version 2.0.874-2~bpo8+1 of the open-iscsi package from the backports repository of Debian stable (Jessie) on the client side and the Check_MK versions 1.2.6 and 1.2.8 on the server side.

I hope you find the provided new check useful and enjoyed reading this blog post. Please don't hesitate to drop me a note if you have any suggestions or run into any issues with the provided checks.

This website uses cookies. By using the website, you agree with storing cookies on your computer. Also you acknowledge that you have read and understand our Privacy Policy. If you do not agree leave the website. More information about cookies