bityard Blog

// Check_MK Monitoring - Open-iSCSI

The Open-iSCSI project provides a high-performance, transport independent, implementation of RFC 3720 iSCSI for Linux. It allows remote access to SCSI targets via TCP/IP over several different transport technologies. This article introduces a new Check_MK service check to monitor the status of Open-iSCSI sessions as well as the monitoring of several statistical metrics on Open-iSCSI sessions and iSCSI hardware initiator hosts.

For the impatient and TL;DR here is the Check_MK package of the Open-iSCSI monitoring checks:

Open-iSCSI monitoring checks (Compatible with Check_MK versions 1.2.8 and later)

The sources are to be found in my Check_MK repository on GitHub


The Check_MK service check to monitor Open-iSCSI consists of two major parts, an agent plugin and three check plugins.

The first part, a Check_MK agent plugin named open-iscsi, is a simple Bash shell script. It calls the Open-iSCSI administration tool iscsiadm in order to retrieve a list of currently active iSCSI sessions. The exact call to iscsiadm to retrieve the session list is:

/usr/bin/iscsiadm -m session -P 1

If there are any active iSCSI sessions, the open-iscsi agent plugin also tries to collect several statistics for each iSCSI session. This is done by another call to iscsiadm for each iSCSI Session ${SID}, which is shown in the following example:

/usr/bin/iscsiadm -m session -r ${SID} -s

Unfortunately, the iSCSI session statistics are currently only supported for Open-iSCSI software initiators or dependent hardware iSCSI initiators like the Broadcom BCM577xx or BCM578xx adapters which are covered by the bnx2i kernel module. See Debugging Segfaults in Open-iSCSIs iscsiuio on Intel Broadwell and Backporting Open-iSCSI to Debian 8 "Jessie" for additional information on those dependent hardware iSCSI initiators.

For hardware iSCSI initiators, like the QLogic 4000 and QLogic 8200 Series network adapters and iSCSI HBAs, which provide a full iSCSI offload engine (iSOE) implementation in the adapters firmware, there is currently no support for iSCSI session statistics. Instead, the open-iscsi agent plugin collects several global statistics on each iSOE host ${HST} which is covered by the qla4xxx kernel module with the command shown in the following example:

/usr/bin/iscsiadm -m host -H ${HST} -C stats

The output of the above commands is parsed and reformated by the agent plugin for easier processing in the check plugins. The following example shows the agent plugin output for a system with two BCM578xx dependent hardware iSCSI initiators:

<<<open-iscsi_sessions>>>
bnx2i 10.0.3.4:3260,1 iqn.2001-05.com.equallogic:8-da6616-807572d50-5080000001758a32-<ISCSI-ALIAS> bnx2i.f8:ca:b8:7d:bf:2d eth2 10.0.3.52 LOGGED_IN LOGGED_IN NO_CHANGE
bnx2i 10.0.3.4:3260,1 iqn.2001-05.com.equallogic:8-da6616-807572d50-5080000001758a32-<ISCSI-ALIAS> bnx2i.f8:ca:b8:7d:c2:34 eth3 10.0.3.53 LOGGED_IN LOGGED_IN NO_CHANGE

<<<open-iscsi_session_stats>>>
[session stats f8:ca:b8:7d:bf:2d iqn.2001-05.com.equallogic:8-da6616-807572d50-5080000001758a32-<ISCSI-ALIAS>]
txdata_octets: 40960
rxdata_octets: 461171313
noptx_pdus: 0
scsicmd_pdus: 153967
tmfcmd_pdus: 0
login_pdus: 0
text_pdus: 0
dataout_pdus: 0
logout_pdus: 0
snack_pdus: 0
noprx_pdus: 0
scsirsp_pdus: 153967
tmfrsp_pdus: 0
textrsp_pdus: 0
datain_pdus: 112420
logoutrsp_pdus: 0
r2t_pdus: 0
async_pdus: 0
rjt_pdus: 0
digest_err: 0
timeout_err: 0

[session stats f8:ca:b8:7d:c2:34 iqn.2001-05.com.equallogic:8-da6616-807572d50-5080000001758a32-<ISCSI-ALIAS>]
txdata_octets: 16384
rxdata_octets: 255666052
noptx_pdus: 0
scsicmd_pdus: 84312
tmfcmd_pdus: 0
login_pdus: 0
text_pdus: 0
dataout_pdus: 0
logout_pdus: 0
snack_pdus: 0
noprx_pdus: 0
scsirsp_pdus: 84312
tmfrsp_pdus: 0
textrsp_pdus: 0
datain_pdus: 62418
logoutrsp_pdus: 0
r2t_pdus: 0
async_pdus: 0
rjt_pdus: 0
digest_err: 0
timeout_err: 0

The next example shows the agent plugin output for a system with two QLogic 8200 Series hardware iSCSI initiators:

<<<open-iscsi_sessions>>>
qla4xxx 10.0.3.4:3260,1 iqn.2001-05.com.equallogic:8-da6616-57e572d50-80e0000001458a32-v-sto2-tst-000001 qla4xxx.f8:ca:b8:7d:c1:7d.ipv4.0 none 10.0.3.50 LOGGED_IN Unknown Unknown
qla4xxx 10.0.3.4:3260,1 iqn.2001-05.com.equallogic:8-da6616-57e572d50-80e0000001458a32-v-sto2-tst-000001 qla4xxx.f8:ca:b8:7d:c1:7e.ipv4.0 none 10.0.3.51 LOGGED_IN Unknown Unknown

<<<open-iscsi_host_stats>>>
[host stats f8:ca:b8:7d:c1:7d iqn.2000-04.com.qlogic:isp8214.000e1e3574ac.4]
mactx_frames: 563454
mactx_bytes: 52389948
mactx_multicast_frames: 877513
mactx_broadcast_frames: 0
mactx_pause_frames: 0
mactx_control_frames: 0
mactx_deferral: 0
mactx_excess_deferral: 0
mactx_late_collision: 0
mactx_abort: 0
mactx_single_collision: 0
mactx_multiple_collision: 0
mactx_collision: 0
mactx_frames_dropped: 0
mactx_jumbo_frames: 0
macrx_frames: 1573455
macrx_bytes: 440845678
macrx_unknown_control_frames: 0
macrx_pause_frames: 0
macrx_control_frames: 0
macrx_dribble: 0
macrx_frame_length_error: 0
macrx_jabber: 0
macrx_carrier_sense_error: 0
macrx_frame_discarded: 0
macrx_frames_dropped: 1755017
mac_crc_error: 0
mac_encoding_error: 0
macrx_length_error_large: 0
macrx_length_error_small: 0
macrx_multicast_frames: 0
macrx_broadcast_frames: 0
iptx_packets: 508160
iptx_bytes: 29474232
iptx_fragments: 0
iprx_packets: 401785
iprx_bytes: 354673156
iprx_fragments: 0
ip_datagram_reassembly: 0
ip_invalid_address_error: 0
ip_error_packets: 0
ip_fragrx_overlap: 0
ip_fragrx_outoforder: 0
ip_datagram_reassembly_timeout: 0
ipv6tx_packets: 0
ipv6tx_bytes: 0
ipv6tx_fragments: 0
ipv6rx_packets: 0
ipv6rx_bytes: 0
ipv6rx_fragments: 0
ipv6_datagram_reassembly: 0
ipv6_invalid_address_error: 0
ipv6_error_packets: 0
ipv6_fragrx_overlap: 0
ipv6_fragrx_outoforder: 0
ipv6_datagram_reassembly_timeout: 0
tcptx_segments: 508160
tcptx_bytes: 19310736
tcprx_segments: 401785
tcprx_byte: 346637456
tcp_duplicate_ack_retx: 1
tcp_retx_timer_expired: 1
tcprx_duplicate_ack: 0
tcprx_pure_ackr: 0
tcptx_delayed_ack: 106449
tcptx_pure_ack: 106489
tcprx_segment_error: 0
tcprx_segment_outoforder: 0
tcprx_window_probe: 0
tcprx_window_update: 695915
tcptx_window_probe_persist: 0
ecc_error_correction: 0
iscsi_pdu_tx: 401697
iscsi_data_bytes_tx: 29225
iscsi_pdu_rx: 401697
iscsi_data_bytes_rx: 327355963
iscsi_io_completed: 101
iscsi_unexpected_io_rx: 0
iscsi_format_error: 0
iscsi_hdr_digest_error: 0
iscsi_data_digest_error: 0
iscsi_sequence_error: 0

[host stats f8:ca:b8:7d:c1:7e iqn.2000-04.com.qlogic:isp8214.000e1e3574ad.5]
mactx_frames: 563608
mactx_bytes: 52411412
mactx_multicast_frames: 877517
mactx_broadcast_frames: 0
mactx_pause_frames: 0
mactx_control_frames: 0
mactx_deferral: 0
mactx_excess_deferral: 0
mactx_late_collision: 0
mactx_abort: 0
mactx_single_collision: 0
mactx_multiple_collision: 0
mactx_collision: 0
mactx_frames_dropped: 0
mactx_jumbo_frames: 0
macrx_frames: 1573572
macrx_bytes: 441630442
macrx_unknown_control_frames: 0
macrx_pause_frames: 0
macrx_control_frames: 0
macrx_dribble: 0
macrx_frame_length_error: 0
macrx_jabber: 0
macrx_carrier_sense_error: 0
macrx_frame_discarded: 0
macrx_frames_dropped: 1755017
mac_crc_error: 0
mac_encoding_error: 0
macrx_length_error_large: 0
macrx_length_error_small: 0
macrx_multicast_frames: 0
macrx_broadcast_frames: 0
iptx_packets: 508310
iptx_bytes: 29490504
iptx_fragments: 0
iprx_packets: 401925
iprx_bytes: 355436636
iprx_fragments: 0
ip_datagram_reassembly: 0
ip_invalid_address_error: 0
ip_error_packets: 0
ip_fragrx_overlap: 0
ip_fragrx_outoforder: 0
ip_datagram_reassembly_timeout: 0
ipv6tx_packets: 0
ipv6tx_bytes: 0
ipv6tx_fragments: 0
ipv6rx_packets: 0
ipv6rx_bytes: 0
ipv6rx_fragments: 0
ipv6_datagram_reassembly: 0
ipv6_invalid_address_error: 0
ipv6_error_packets: 0
ipv6_fragrx_overlap: 0
ipv6_fragrx_outoforder: 0
ipv6_datagram_reassembly_timeout: 0
tcptx_segments: 508310
tcptx_bytes: 19323952
tcprx_segments: 401925
tcprx_byte: 347398136
tcp_duplicate_ack_retx: 2
tcp_retx_timer_expired: 4
tcprx_duplicate_ack: 0
tcprx_pure_ackr: 0
tcptx_delayed_ack: 106466
tcptx_pure_ack: 106543
tcprx_segment_error: 0
tcprx_segment_outoforder: 0
tcprx_window_probe: 0
tcprx_window_update: 696035
tcptx_window_probe_persist: 0
ecc_error_correction: 0
iscsi_pdu_tx: 401787
iscsi_data_bytes_tx: 37970
iscsi_pdu_rx: 401791
iscsi_data_bytes_rx: 328112050
iscsi_io_completed: 127
iscsi_unexpected_io_rx: 0
iscsi_format_error: 0
iscsi_hdr_digest_error: 0
iscsi_data_digest_error: 0
iscsi_sequence_error: 0

Although a simple Bash shell script, the agent plugin open-iscsi has several dependencies which need to be installed in order for the agent plugin to work properly. Namely those are the commands iscsiadm, sed, tr and egrep. On Debian based systems, the necessary packages can be installed with the following command:

root@host:~# apt-get install coreutils grep open-iscsi sed

The second part of the Check_MK service check for Open-iSCSI provides the necessary check logic through individual inventory and check functions. This is implemented in the three Check_MK check plugins open-iscsi_sessions, open-iscsi_host_stats and open-iscsi_session_stats, which will be discussed separately in the following sections.

Open-iSCSI Session Status

The check plugin open-iscsi_sessions is responsible for the monitoring of individual iSCSI sessions and their internal session states. Upon inventory this check plugin creates a service check for each pair of iSCSI network interface name and IQN of the iSCSI target volume. Unlike the iSCSI session ID, which changes over time (e.g. after iSCSI logout and login), this pair uniquely identifies a iSCSI session on a host. During normal check execution, the list of currently active iSCSI sessions on a host is compared to the list of active iSCSI sessions gathered during inventory on that host. If a session is missing or if the session has an erroneous internal state, an alarm is raised accordingly.

For all types of initiators – software, dependent hardware and hardware – there is the state session_state which can take on the following values:

ISCSI_STATE_FREE
ISCSI_STATE_LOGGED_IN
ISCSI_STATE_FAILED
ISCSI_STATE_TERMINATE
ISCSI_STATE_IN_RECOVERY
ISCSI_STATE_RECOVERY_FAILED
ISCSI_STATE_LOGGING_OUT

An alarm is raised if the session is in any state other than ISCSI_STATE_LOGGED_IN. For software and dependent hardware initiators there are two additional states – connection_state and internal_state. The state connection_state can take on the values:

FREE
TRANSPORT WAIT
IN LOGIN
LOGGED IN
IN LOGOUT
LOGOUT REQUESTED
CLEANUP WAIT

and internal_state can take on the values:

NO CHANGE
CLEANUP       
REOPEN
REDIRECT

In addition to the above session_state, an alarm is raised if the connection_state is in any other state than LOGGED IN and internal_state is in any other state than NO CHANGE.

No performance data is currently reported by this check.

Open-iSCSI Hosts Statistics

The check plugin open-iscsi_host_stats is responsible for the monitoring of the global statistics on a iSOE host. Upon inventory this check plugin creates a service check for each pair of MAC address and iSCSI network interface name. During normal check execution, an extensive list of statistics – see the above example output of the Check_MK agent plugin – is determined for each inventorized item. If the rate of one of the statistics values is above the configured warning and critical threshold values, an alarm is raised accordingly. For all statistics, performance data is reported by the check.

With the additional WATO plugin open-iscsi_host_stats.py it is possible to configure the warning and critical levels through the WATO WebUI and thus override the default values. The default values for all statistics are a rate of zero (0) units per second for both warning and critical thresholds. The configuration options for the iSOE host statistics levels can be found in the WATO WebUI under:

-> Host & Service Parameters
   -> Parameters for discovered services
      -> Storage, Filesystems and Files
         -> Open-iSCSI Host Statistics
            -> Create Rule in Folder ...
               -> The levels for the Open-iSCSI host statistics values
                  [x] The levels for the number of transmitted MAC/Layer2 frames on an iSOE host.
                  [x] The levels for the number of received MAC/Layer2 frames on an iSOE host.
                  [x] The levels for the number of transmitted MAC/Layer2 bytes on an iSOE host.
                  [x] The levels for the number of received MAC/Layer2 bytes on an iSOE host.
                  [x] The levels for the number of transmitted MAC/Layer2 multicast frames on an iSOE host.
                  [x] The levels for the number of received MAC/Layer2 multicast frames on an iSOE host.
                  [x] The levels for the number of transmitted MAC/Layer2 broadcast frames on an iSOE host.
                  [x] The levels for the number of received MAC/Layer2 broadcast frames on an iSOE host.
                  [x] The levels for the number of transmitted MAC/Layer2 pause frames on an iSOE host.
                  [x] The levels for the number of received MAC/Layer2 pause frames on an iSOE host.
                  [x] The levels for the number of transmitted MAC/Layer2 control frames on an iSOE host.
                  [x] The levels for the number of received MAC/Layer2 control frames on an iSOE host.
                  [x] The levels for the number of transmitted MAC/Layer2 dropped frames on an iSOE host.
                  [x] The levels for the number of received MAC/Layer2 dropped frames on an iSOE host.
                  [x] The levels for the number of transmitted MAC/Layer2 deferral frames on an iSOE host.
                  [x] The levels for the number of transmitted MAC/Layer2 deferral frames on an iSOE host.
                  [x] The levels for the number of transmitted MAC/Layer2 abort frames on an iSOE host.
                  [x] The levels for the number of transmitted MAC/Layer2 jumbo frames on an iSOE host.
                  [x] The levels for the number of MAC/Layer2 late transmit collisions on an iSOE host.
                  [x] The levels for the number of MAC/Layer2 single transmit collisions on an iSOE host.
                  [x] The levels for the number of MAC/Layer2 multiple transmit collisions on an iSOE host.
                  [x] The levels for the number of MAC/Layer2 collisions on an iSOE host.
                  [x] The levels for the number of received MAC/Layer2 control frames on an iSOE host.
                  [x] The levels for the number of received MAC/Layer2 dribble on an iSOE host.
                  [x] The levels for the number of received MAC/Layer2 frame length errors on an iSOE host.
                  [x] The levels for the number of discarded received MAC/Layer2 frames on an iSOE host.
                  [x] The levels for the number of received MAC/Layer2 jabber on an iSOE host.
                  [x] The levels for the number of received MAC/Layer2 carrier sense errors on an iSOE host.
                  [x] The levels for the number of received MAC/Layer2 CRC errors on an iSOE host.
                  [x] The levels for the number of received MAC/Layer2 encoding errors on an iSOE host.
                  [x] The levels for the number of received MAC/Layer2 length too large errors on an iSOE host.
                  [x] The levels for the number of received MAC/Layer2 length too small errors on an iSOE host.
                  [x] The levels for the number of transmitted IP packets on an iSOE host.
                  [x] The levels for the number of received IP packets on an iSOE host.
                  [x] The levels for the number of transmitted IP bytes on an iSOE host.
                  [x] The levels for the number of received IP bytes on an iSOE host.
                  [x] The levels for the number of transmitted IP fragments on an iSOE host.
                  [x] The levels for the number of received IP fragments on an iSOE host.
                  [x] The levels for the number of IP datagram reassemblies on an iSOE host.
                  [x] The levels for the number of IP invalid address errors on an iSOE host.
                  [x] The levels for the number of IP packet errors on an iSOE host.
                  [x] The levels for the number of IP fragmentation overlaps on an iSOE host.
                  [x] The levels for the number of IP fragmentation out-of-order on an iSOE host.
                  [x] The levels for the number of IP datagram reassembly timeouts on an iSOE host.
                  [x] The levels for the number of transmitted IPv6 packets on an iSOE host.
                  [x] The levels for the number of received IPv6 packets on an iSOE host.
                  [x] The levels for the number of transmitted IPv6 bytes on an iSOE host.
                  [x] The levels for the number of received IPv6 bytes on an iSOE host.
                  [x] The levels for the number of transmitted IPv6 fragments on an iSOE host.
                  [x] The levels for the number of received IPv6 fragments on an iSOE host.
                  [x] The levels for the number of IPv6 datagram reassemblies on an iSOE host.
                  [x] The levels for the number of IPv6 invalid address errors on an iSOE host.
                  [x] The levels for the number of IPv6 packet errors on an iSOE host.
                  [x] The levels for the number of IPv6 fragmentation overlaps on an iSOE host.
                  [x] The levels for the number of IPv6 fragmentation out-of-order on an iSOE host.
                  [x] The levels for the number of IPv6 datagram reassembly timeouts on an iSOE host.
                  [x] The levels for the number of transmitted TCP segments on an iSOE host.
                  [x] The levels for the number of received TCP segments on an iSOE host.
                  [x] The levels for the number of transmitted TCP bytes on an iSOE host.
                  [x] The levels for the number of received TCP bytes on an iSOE host.
                  [x] The levels for the number of duplicate TCP ACK retransmits on an iSOE host.
                  [x] The levels for the number of received TCP retransmit timer expiries on an iSOE host.
                  [x] The levels for the number of received TCP duplicate ACKs on an iSOE host.
                  [x] The levels for the number of received TCP pure ACKs on an iSOE host.
                  [x] The levels for the number of transmitted TCP delayed ACKs on an iSOE host.
                  [x] The levels for the number of transmitted TCP pure ACKs on an iSOE host.
                  [x] The levels for the number of received TCP segment errors on an iSOE host.
                  [x] The levels for the number of received TCP segment out-of-order on an iSOE host.
                  [x] The levels for the number of received TCP window probe on an iSOE host.
                  [x] The levels for the number of received TCP window update on an iSOE host.
                  [x] The levels for the number of transmitted TCP window probe persist on an iSOE host.
                  [x] The levels for the number of transmitted iSCSI PDUs on an iSOE host.
                  [x] The levels for the number of received iSCSI PDUs on an iSOE host.
                  [x] The levels for the number of transmitted iSCSI Bytes on an iSOE host.
                  [x] The levels for the number of received iSCSI Bytes on an iSOE host.
                  [x] The levels for the number of iSCSI I/Os completed on an iSOE host.
                  [x] The levels for the number of iSCSI unexpected I/Os on an iSOE host.
                  [x] The levels for the number of iSCSI format errors on an iSOE host.
                  [x] The levels for the number of iSCSI header digest (CRC) errors on an iSOE host.
                  [x] The levels for the number of iSCSI data digest (CRC) errors on an iSOE host.
                  [x] The levels for the number of iSCSI sequence errors on an iSOE host.
                  [x] The levels for the number of ECC error corrections on an iSOE host.

The following image shows a status output example from the WATO WebUI with several open-iscsi_sessions (iSCSI Session Status) and open-iscsi_host_stats (iSCSI Host Stats) service checks over two QLogic 8200 Series hardware iSCSI initiators:

Status output example for open-iscsi_sessions and open-iscsi_host_stats service checks over QLogic 8200 Series hardware iSCSI initiators

This example shows six iSCSI Session Status service check items, which are pairs of iSCSI network interface names and – here anonymized – IQNs of the iSCSI target volumes. For each item the current session_state – in this example LOGGED_IN – is shown. There are also two iSCSI Host Stats service check items in the example, which are pairs of MAC addresses and iSCSI network interface names. For each of those items the current throughput rate on the MAC, IP/IPv6, TCP and iSCSI protocol layer is shown. The throughput rate on the MAC protocol layer is also visualized in the Perf-O-Meter, received traffic growing from the middle to the left, transmitted traffic growing from the middle to the right.

The following three images show examples of the PNP4Nagios graphs for the open-iscsi_host_stats (iSCSI Host Stats) service check.

Example PNP4Nagios graphs for a open-iscsi_host_stats service check (MAC Frames, Traffic, MAC Errors)

The middle graph shows a combined view of the throughput rate for received and transmitted traffic on the different MAC, IP/IPv6, TCP and iSCSI protocol layers. The upper graph shows the throughput rate for various frame types on the MAC protocol layer. The lower graph shows the rate for various error frame types on the MAC protocol layer.

Example PNP4Nagios graphs for a open-iscsi_host_stats service check (IP Packets and Fragments, IP Errors, TCP Segments)

The upper graph shows the throughput rate for received and transmitted traffic on the IP/IPv6 protocol layer. The middle graph shows the rate for various error packet types on the IP/IPv6 protocol layer. The lower graph shows the throughput rate for received and transmitted traffic on the TCP protocol layer.

Example PNP4Nagios graphs for a open-iscsi_host_stats service check (TCP Errors, ECC Error Correction, iSCSI PDUs, iSCSI Errors)

The first graph shows the rate for various protocol control and error segment types on the TCP protocol layer. The second graph shows the rate of ECC error corrections that occured on the QLogic 8200 Series hardware iSCSI initiator. The third graph shows the throughput rate for received and transmitted traffic on the iSCSI protocol layer. The fourth and last graph shows the rate for various control and error PDUs on the iSCSI protocol layer.

Open-iSCSI Session Statistics

The check plugin open-iscsi_session_stats is responsible for the monitoring of the statistics on individual iSCSI sessions. Upon inventory this check plugin creates a service check for each pair of MAC address of the network interface and IQN of the iSCSI target volume. During normal check execution, an extensive list of statistics – see the above example output of the Check_MK agent plugin – is collected for each inventorized item. If the rate of one of the statistics values is above the configured warning and critical threshold values, an alarm is raised accordingly. For all statistics, performance data is reported by the check.

With the additional WATO plugin open-iscsi_session_stats.py it is possible to configure the warning and critical levels through the WATO WebUI and thus override the default values. The default values for all statistics are a rate of zero (0) units per second for both warning and critical thresholds. The configuration options for the iSCSI session statistics levels can be found in the WATO WebUI under:

-> Host & Service Parameters
   -> Parameters for discovered services
      -> Storage, Filesystems and Files
         -> Open-iSCSI Session Statistics
            -> Create Rule in Folder ...
               -> The levels for the Open-iSCSI session statistics values
                  [x] The levels for the number of transmitted bytes in an Open-iSCSI session
                  [x] The levels for the number of received bytes in an Open-iSCSI session
                  [x] The levels for the number of digest (CRC) errors in an Open-iSCSI session
                  [x] The levels for the number of timeout errors in an Open-iSCSI session
                  [x] The levels for the number of transmitted NOP commands in an Open-iSCSI session
                  [x] The levels for the number of received NOP commands in an Open-iSCSI session
                  [x] The levels for the number of transmitted SCSI command requests in an Open-iSCSI session
                  [x] The levels for the number of received SCSI command reponses in an Open-iSCSI session
                  [x] The levels for the number of transmitted task management function commands in an Open-iSCSI session
                  [x] The levels for the number of received task management function responses in an Open-iSCSI session
                  [x] The levels for the number of transmitted login requests in an Open-iSCSI session
                  [x] The levels for the number of transmitted logout requests in an Open-iSCSI session
                  [x] The levels for the number of received logout responses in an Open-iSCSI session
                  [x] The levels for the number of transmitted text PDUs in an Open-iSCSI session
                  [x] The levels for the number of received text PDUs in an Open-iSCSI session
                  [x] The levels for the number of transmitted data PDUs in an Open-iSCSI session
                  [x] The levels for the number of received data PDUs in an Open-iSCSI session
                  [x] The levels for the number of transmitted single negative ACKs in an Open-iSCSI session
                  [x] The levels for the number of received ready to transfer PDUs in an Open-iSCSI session
                  [x] The levels for the number of received reject PDUs in an Open-iSCSI session
                  [x] The levels for the number of received asynchronous messages in an Open-iSCSI session

The following image shows a status output example from the WATO WebUI with several open-iscsi_sessions (iSCSI Session Status) and open-iscsi_session_stats (iSCSI Session Stats) service checks over two BCM578xx dependent hardware iSCSI initiators:

Status output example for open-iscsi_sessions and open-iscsi_session_stats service checks over BCM578xx dependent hardware iSCSI initiators

This example shows six iSCSI Session Status service check items, which are pairs of iSCSI network interface names and – here anonymized – IQNs of the iSCSI target volumes. For each item the current session_state, connection_state and internal_state – in this example with the respective values LOGGED_IN, LOGGED_IN and NO_CHANGE – are shown. There are also an equivalent number of iSCSI Session Stats service check items in the example, which are also pairs of MAC addresses of the network interfaces and IQNs of the iSCSI target volumes. For each of those items the current throughput rate of the individual iSCSI session is shown. As long as the rate of the digest (CRC) and timeout error counters is zero, the string no protocol errors is displayed. Otherwise the name and throughput rate of any non-zero error counter is shown. The throughput rate of the iSCSI session is also visualized in the Perf-O-Meter, received traffic growing from the middle to the left, transmitted traffic growing from the middle to the right.

The following image shows an example of the three PNP4Nagios graphs for a single open-iscsi_session_stats (iSCSI Session Stats) service check.

Example PNP4Nagios graphs for a single open-iscsi_session_stats service check

The upper graph shows the throughput rate for received and transmitted traffic of the iSCSI session. The middle graph shows the rate for received and transmitted iSCSI PDUs, broken down by the different types of PDUs on the iSCSI protocol layer. The lower graph shows the rate for the digest (CRC) and timeout errors on the iSCSI protocol layer.

The described Check_MK service check to monitor the status of Open-iSCSI sessions, Open-iSCSI session metrics and iSCSI hardware initiator host metrics has been verified to work with version 2.0.874-2~bpo8+1 of the open-iscsi package from the backports repository of Debian stable (Jessie) on the client side and the Check_MK versions 1.2.6 and 1.2.8 on the server side.

I hope you find the provided new check useful and enjoyed reading this blog post. Please don't hesitate to drop me a note if you have any suggestions or run into any issues with the provided checks.

// Check_MK Monitoring - XFS Filesystem Quotas

XFS, like most other modern filesystems, offers the ability to configure and use disk quotas. Usually those quotas set limits with regard to the amount of allocatable disk space or number of filesystem objects. Those limits are in most filesystems bound to users or groups of users. XFS is one of the few filesystems which offer quota limits on a per directory basis. In the case of XFS directory quotas are implemented via projects. This article introduces a new Check_MK service check to monitor the use of XFS filesystem quotas with a strong focus on directory based quotas.

For the impatient and TL;DR here is the Check_MK package of the XFS filesystem quota monitoring checks:

XFS filesystem quota monitoring checks (Compatible with Check_MK versions 1.2.6 and earlier)
XFS filesystem quota monitoring checks (Compatible with Check_MK versions 1.2.8 and later)

The sources are to be found in my Check_MK repository on GitHub


The Check_MK service check to monitor XFS filesystem quotas consists of two major parts, an agent plugin and a check plugin.

The Check_MK agent plugin named xfs_quota is a simple Bash shell script. It calls the XFS quota administration tool xfs_quota in order to retrieve a report of the current quota usage. The exact call to xfs_quota is:

/usr/sbin/xfs_quota -x -c 'report -p -b -i -a'

which currently reports only block and inode quotas of projects or directories on all availables XFS filesystems. An example output of the above command is:

Project quota on /srv/xfs (/dev/mapper/vg00-xfs)
                               Blocks                                          Inodes                     
Project ID       Used       Soft       Hard    Warn/Grace           Used       Soft       Hard    Warn/ Grace     
---------- -------------------------------------------------- -------------------------------------------------- 
test1               0          0       1024     00 [--------]          1          0          0     00 [--------]
test2               0          0       2048     00 [--------]          1          0          0     00 [--------]
test3               0          0       3072     00 [--------]          1          0          0     00 [--------]

The output of the above command is parsed and reformated by the agent plugin for easier processing in the check plugin. The above example output would thus be transformed into the agent plugin output shown in the following example:

<<<xfs_quota>>>
/srv/xfs:/dev/mapper/vg00-xfs:test1:0:0:1024:1:0:0
/srv/xfs:/dev/mapper/vg00-xfs:test2:0:0:2048:1:0:0
/srv/xfs:/dev/mapper/vg00-xfs:test3:0:0:3072:1:0:0

Although a simple Bash shell script, the agent plugin xfs_quota has several dependencies which need to be installed in order for the agent plugin to work properly. Namely those are the commands xfs_quota, sed and egrep. On Debian based systems, the necessary packages can be installed with the following command:

root@host:~# apt-get install xfsprogs sed grep

The second part, a Check_MK check plugin also named xfs_quota, provides the necessary inventory and check functions. Upon inventory it creates a service check for each pair of XFS filesystem mountpoint and quota project ID. During normal check execution, the number of used XFS filesystem blocks and inodes are determined for each inventorized item (pair of XFS filesystem mountpoint and quota project ID). If the hard and soft, block and inode quotas for particular item are all set to zero, no further checks are carried out and only performance data is reported by the check. If either one of hard or soft, block or inode quota is set to a non-zero value, the number of remaining free XFS filesystem blocks or inodes is compared to warning and critical threshold values and an alarm is raised accordingly.

With the additional WATO plugin it is possible to configure the warning and critical levels through the WATO WebUI and thus override the default values. The default values for blocks_hard and blocks_soft are zero (0) free blocks for both warning and critical thresholds. The default values for inodes_hard and inodes_soft are zero (0) free inodes for both warning and critical thresholds. The configuration options for the free block or inode levels can be found under:

-> Host & Service Parameters
   -> Parameters for discovered services
      -> Storage, Filesystems and Files
         -> XFS Quota Utilization
            -> Create Rule in Folder ...
               -> The levels for the soft/hard block/inode quotas on XFS filesystems
                  [x] The levels for the hard block quotas on XFS filesystems
                  [x] The levels for the soft block quotas on XFS filesystems
                  [x] The levels for the hard inode quotas on XFS filesystems
                  [x] The levels for the soft inode quotas on XFS filesystems 

The following image shows a status output example for several xfs_quota service checks from the WATO WebUI:

Status output example for xfs_quota service checks

This example shows several service check items, which again are pairs of XFS filesystem mountpoints (here: /backup) and anonymized quota project IDs. For each item the number of blocks and inodes used are shown along with the appropriate hard and soft quota values. The number of blocks and inodes used are also visualized in the perf-o-meter, blocks on a logarithmic scale growing from the middle to the left, inodes on a logarithmic scale growing from the middle to the right.

The following image shows an example of the two PNP4Nagios graphs for a single service check:

Example PNP4Nagios graph for a single xfs_quota service check

The upper graph shows the number of used XFS filesystem blocks for a pair of XFS filesystem mountpoints (here: /backup) and a - again anonymized - quota project ID. The lower graph shows the number of used inodes for the same pair. Both graphs show warning and critical thresholds values, which are in this example at their default value of zero. If configured - like in the upper graph of the example - block or inode quotas are also shown as blue horizontal lines in the respective graphs.

The described Check_MK service check to monitor XFS filesystem quotas has been verified to work with version 3.2.1 of the xfsprogs package on Debian stable (Jessie) on the client side and the Check_MK versions 1.2.6 and 1.2.8 on the server side.

I hope you find the provided new check useful and enjoyed reading this blog post. Please don't hesitate to drop me a note if you have any suggestions or run into any issues with the provided checks.

// Check_MK Monitoring - Dell PowerConnect Switches

Dell PowerConnect and Dell PowerConnect M-Series switches can – with regard to their most important aspects like CPU, fans, PSU and temperature – already be monitored with the standard Check_MK distribution. This article introduces an enhanced version and additional Check_MK service checks to monitor additional aspects of Dell PowerConnect switches. It is targeted mainly towards the Dell PowerConnect M-Series switches used in Dell PowerEdge M1000e blade chassis, but can probably be used on standalone Dell PowerConnect switches as well.

For the impatient and TL;DR here is the Check_MK package of the enhanced version of the Dell PowerConnect monitoring checks:

Enhanced version of the Dell PowerConnect monitoring checks (Compatible with Check_MK versions 1.2.6 and earlier)
Enhanced version of the Dell PowerConnect monitoring checks (Compatible with Check_MK versions 1.2.8 and later)

The sources are to be found in my Check_MK repository on GitHub


The Dell PowerConnect M-Series switches to be used in Dell PowerEdge M1000e blade chassis – and possibly some newer Dell PowerConnect standalone switches too – are based on Broadcom FASTPATH silicon. While this hardware base introduces a plethora of other issues to be covered in detail in a separate article, it also introduces the possibility of breaking backwards compatibility with older Dell PowerConnect models from a monitoring point of view. Therefore, the new checks to cover the Broadcom FASTPATH based hardware were moved to a entirely new namespace. The file names of the new checks now use the prefix dell_powerconnect_bcm_ in contrast to the already existing stock Check_MK checks with their prefix dell_powerconnect_. Another difference to the stock Check_MK checks is the use of the FASTPATH Enterprise MIBs, which are specific to devices based on Broadcom silicon. The only exemptions are the checks dell_powerconnect_bcm_global_status and the dell_powerconnect_bcm_dnsstats, both monitor items which are not covered by the FASTPATH Enterprise MIBs.

All checks have been verified to work with the firmware versions 5.1.8.x and 5.1.9.x. For the newly introduced check dell_powerconnect_bcm_global_status the firmware version 5.1.9.4 or later is needed in order to avoid spurious error messages in the switch event log. See the section Additional Checks below for a more detailed explanation.

The discontinued, modified and additional checks are described in greater detail in the following three respective sections:

Discontinued Checks

The two service checks dell_powerconnect_fans and dell_powerconnect_psu provided by the standard Check_MK distribution have become redundant for the Dell PowerConnect M-Series switches. The items to be monitored by both are not present in those devices, since the Dell PowerEdge M1000e blade chassis provides both central cooling and power supply facilities. Accordingly, the cooling and power supply facilities should be monitored via the Dell Chassis Managment Controller.

Modified Checks

The two service checks dell_powerconnect_cpu and dell_powerconnect_temp have been renamed to dell_powerconnect_bcm_cpu and dell_powerconnect_bcm_temp respectively. They both have been modified to use the new Dictionary based parameters and factory settings for the CPU and temperature warning and critical levels. A SNMP example output for all OIDs used has been added to both service checks for documentation purposes. Manual pages, PNP4Nagios templates, WATO and Perf-O-Meter plugins have also been added for both service checks. With the added WATO plugins it is now possible to configure the CPU and temperature warning and critical levels through the WATO WebUI. The configuration options for the CPU levels can be found under:

-> Host & Service Parameters
   -> Parameters for discovered services
      -> Operating System Resources
         -> Dell PowerConnect CPU usage
            -> Create rule in folder ...
               [x] The levels for the overall CPU usage on Dell PowerConnect switches

The configuration options for the temperature levels can be found under:

-> Host & Service Parameters
   -> Parameters for discovered services
      -> Temperature, Humidity, Electrical Parameters, etc.
         -> Dell PowerConnect temperature
            -> Create rule in folder ...
               [x] Temperature levels for Dell PowerConnect switches

The following image shows a status output example for the dell_powerconnect_bcm_cpu service check from the WATO WebUI:

Status output example for the modified dell_powerconnect_bcm_cpu service check

Three average CPU utilization values for the time sample intervals 5, 60 and 300 seconds are checked. The Perf-O-Meter is split accordingly into three sections in order to be able to display all three average CPU utilization values at once.

The following image shows a status output example for the dell_powerconnect_bcm_temp service check from the WATO WebUI:

Status output example for the modified dell_powerconnect_bcm_temp service check

This example shows the status and the current values of the temperature sensors in a switch stack with two switch members.

The following two images show examples of PNP4Nagios graphs for both service checks:

Example PNP4Nagios graph for the modified dell_powerconnect_bcm_cpu service check
Example PNP4Nagios graph for the modified dell_powerconnect_bcm_temp service check

Additional Checks

Overview

The following table shows a condensed overview of the additional Check_MK service checks and their available components.

Service check name Description Alarm Manpage PNP4Nagios template Perf-O-Meter plugin WATO plugin
dell_powerconnect_bcm_arp_cache Checks the current number of entries in the ARP cache against default or configured warning and critical threshold values. yes yes yes yes yes
dell_powerconnect_bcm_cos_queue Determines the number of packets dropped at each CoS queue for the CPU. yes yes
dell_powerconnect_bcm_cpu_proc Monitors the CPU utilization on a per process level. yes yes yes yes
dell_powerconnect_bcm_dnsstats Determines the number of DNS queries (total and several error states defined by RFC 1035) of the systems resolver. yes yes
dell_powerconnect_bcm_global_status Determines the global status of the “product”, via a Dell-specific SNMP OID. yes yes
dell_powerconnect_bcm_ip_conflict Determines if an IP address conflict has been detected on the switch. yes yes
dell_powerconnect_bcm_logstats Determines the number of log messages (total, dropped, relayed to syslog hosts) generated on the system. yes yes
dell_powerconnect_bcm_mbuf Determines the number of memory/message buffer allocations – or failures thereof – for packets arriving at the systems CPU. yes yes
dell_powerconnect_bcm_memory Monitors the current memory usage. yes yes yes yes yes
dell_powerconnect_bcm_sntp Checks the current status of the SNTP client on the switch. yes yes yes yes
dell_powerconnect_bcm_ssh_sessions Checks the number of currently active SSH sessions against the default limit of five allowed SSH sessions. yes yes yes yes yes

The first two columns should be pretty self-explanatory.

The Alarm column shows which checks will generate alarms based on the particular parameters monitored. Checks without an entry in the Alarm column are designed purely for long-term trends via their respective PNP4Nagios templates. All checks with an entry in the Alarm column use the new Dictionary based parameters and factory settings for their respective warning and critical levels. Where reasonable, those warning and critical levels are configurable through the WATO WebUI via an appropriate WATO plugin. See the last column, titled WATO plugin for the checks this applies to.

Manual pages are provided for each service check for documentation purposes. A SNMP example output is provided as a comment within the check script for all the OIDs used in the service check.

For all checks with an entry in the PNP4Nagios template column, a PNP4Nagios templates is provided in order to properly display the performance data delivered by the service check. Perf-O-Meter plugins are provided where reasonable, in order to display selected performance metrics in the service check overview of a host.

The specifics of each additional Check_MK service check are described in greater detail in the following sections.

ARP Cache

The Check_MK service check dell_powerconnect_bcm_arp_cache monitors the current total number of entries in the ARP cache on Dell PowerConnect switches. This number is compared to either the default or configured warning and critical threshold values, and an alarm is raised accordingly. With the added WATO plugin it is possible to configure the warning and critical levels through the WATO WebUI and thus override the default values (warning: 3072; critical: 3584). The configuration options for the ARP cache levels can be found under:

-> Host & Service Parameters
   -> Parameters for discovered services
      -> Operating System Resources
         -> Dell PowerConnect ARP cache
            -> Create rule in folder ...
               [x] The levels for the number of ARP cache entries on Dell PowerConnect switches

The following image shows a status output example for the dell_powerconnect_bcm_arp_cache service check from the WATO WebUI:

Status output example for the new dell_powerconnect_bcm_arp_cache service check

This example shows the current number of entries in the ARP cache along with the warning and critical threshold values.

In addition to the already mentioned total number of entries in the ARP cache, several other metrics are also collected as performance data. These are the overall ARP cache size, the number of static ARP entries and the peak values for both the current and the static number of ARP entries. The following image shows an example of the PNP4Nagios graph for the service check:

Example PNP4Nagios graph for the new dell_powerconnect_bcm_arp_cache service check

CoS Queue

The Check_MK service check dell_powerconnect_bcm_cos_queue monitors the number of packets dropped at each CoS queue for the CPU (quoted from the FASTPATH Enterprise MIB). Unfortunately the only other description available in the FASTPATH Enterprise MIBs is almost as cryptic as the first one: Number of packets dropped at this CPU CoS queue because the queue was full. The metric probably relates to the switches Class of Service (CoS) feature in a Quality of Service (QoS) setup. Currently, the dell_powerconnect_bcm_cos_queue service check is used purely for long-term trends via its respective PNP4Nagios template and thus only gathers its metrics as performance data.

The following image shows a status output example for the dell_powerconnect_bcm_cos_queue service check from the WATO WebUI:

Status output example for the new dell_powerconnect_bcm_cos_queue service check

The following image shows an example of the PNP4Nagios graph for the service check:

Example PNP4Nagios graph for the new dell_powerconnect_bcm_cos_queue service check

Process CPU Usage

The Check_MK service check dell_powerconnect_bcm_cpu_proc monitors the same CPU utilization metrics as the previously described dell_powerconnect_bcm_cpu service check, but on a more detailed, per process level. The three average CPU utilization values for the time sample intervals 5, 60 and 300 seconds are for each process compared to either the default or configured warning and critical threshold values and an alarm is raised accordingly. There is currently the limitation in the checks logic that warning and critical threshold values apply globally to all processes. Individual warning and critical threshold values for each process are currently not supported. With the added WATO plugin it is possible to configure the warning and critical levels through the WATO WebUI and thus override the default values (warning: 80%; critical: 90%) for the average CPU utilization. The configuration options for the per process CPU utilization levels can be found under:

-> Host & Service Parameters
   -> Parameters for discovered services
      -> Operating System Resources
         -> Dell PowerConnect CPU usage (per process)
            -> Create rule in folder ...
               [x] The levels for the per process CPU usage on Dell PowerConnect switches

The following image shows a status output example for the dell_powerconnect_bcm_cpu_proc service check from the WATO WebUI:

Status output example for the new dell_powerconnect_bcm_cpu_proc service check

The following image shows only four examples of PNP4Nagios graphs for the service checks:

Example PNP4Nagios graph for the new dell_powerconnect_bcm_cpu_proc service check

The selected example graphs show the average CPU utilization over the 5, 60 and 300 seconds time sample intervals for the processes SNMPTask, bcmRX, dot1s_timer_task and osapiTimer. Mind though that this is only a small selection of the various processes that can be found running on the Broadcom FASTPATH based Dell PowerConnect switches. Some processes are always to be found, others appear only after a specific feature – covered by appropriate processs – is enabled on the switch. Unfortunately i've not been able to find a complete list of the possible processes nor a good and comprehensive description of the purpose of each process. Sometimes – like in the case of the process SNMPTask – the purpose can be guessed from process name. So overall i'd say the per process CPU utilization metric is probably best used as a metric for long-term trends in conjunction with support from Broadcom or Dell, when dealing with a specific issue on the switch or an unusually high CPU utilization of a specific process.

DNS Statistics

The Check_MK service check dell_powerconnect_bcm_dnsstats monitors various aspects and metrics of the switches local DNS resolver. The metrics gathered can be grouped into three categories:

  • DNS Resolver: The number of DNS resolver queries and the number of DNS responses to those queries. For the DNS responses the number of responses in each response category. The response categories are: Non-auth Answers, Non-auth No-answer, Received Responses, Unparsable Responses, Martians Responses and Fallbacks.

  • DNS Resolver RCODE: The number of DNS resolver responses by resonse code. See 1035 for the details on DNS response codes.

  • DNS Cache: The number of DNS resouce records that have been successfully added or have failed to be added to the DNS resolver cache.

See 1612, 1035 and the service checks man page for a detailed description of the metrics covered by the dell_powerconnect_bcm_dnsstats service check. Currently, the dell_powerconnect_bcm_dnsstats service check is used purely for long-term trends via its respective PNP4Nagios template and thus only gathers its metrics as performance data.

The following image shows a status output example for the dell_powerconnect_bcm_dnsstats service check from the WATO WebUI:

Status output example for the new dell_powerconnect_bcm_dnsstats service check

The following image shows an example of the three PNP4Nagios graphs for the service check:

Example PNP4Nagios graph for the new dell_powerconnect_bcm_dnsstats service check

Global Status

The Check_MK service check dell_powerconnect_bcm_global_status monitors just one metric, the productStatusGlobalStatus from the Dell Vendor MIB for PowerConnect devices. As the name of the metric suggests, it represents an aggregated global status for a Dell PowerConnect device. The global status can assume one of the three values, shown in the following table:

Numeric Value Textual Value Description
3 OK “If fans and power supplies are functioning and the system did not reboot because of a HW watchdog failure or a SW fatal error condition.”
4 Non-critical “If at least one power supply is not functional or the system rebooted at least once because of a HW watchdog failure or a SW fatal error condition.”
5 Critical “If at least one fan is not functional, possibly causing a dangerous warming up of the device.”

While the information about the fan and PSU status is redundant for the Dell PowerConnect M-Series switches, the information about hard- and software error conditions might be quite valueable.

When we first implemented the enhanced version of the Dell PowerConnect monitoring checks, we noticed spurious error messages suddenly appearing in the switch event log and subsequently in our syslog servers. The messages showing up looked like the following example:

<189> OCT 14 12:48:50 <Management IP address>-1 MGMT_ACAL[251047504]: macal_api.c(873) 38462 %% macalRuleActionGet(): List  does not exist.

Disabling one check after another, we narrowed the source of this error message down to the dell_powerconnect_bcm_global_status service check. Logging a support case with Dell eventually lead to the following explaination from Dell PowerConnect engineering:

Hi Frank,

I got an update from our engineering team and they can see the problem
when snmpwalk is executed against switch but issue is not seen if snmpget
is executed on all OIDs.

They are working on a fix.

Once fix is available it will be included in the next FW patch release
for this switch. […]

At the time we were running the newest available firmware, which back then was version 5.1.9.3. After updating to the firmware version 5.1.9.4 which was released later on, the above error messages stopped showing up.

IP Address Conflict Detection

The Check_MK service check dell_powerconnect_bcm_ip_conflict monitors the status of the built-in IP address conflict detection feature of a Dell PowerConnect switch. If an IP address conflict is detected, an alarm with the status warning is raised. In addition to the alarm status, the service check will also report the conflicting IP, (if available) the MAC address of the device causing the conflict and the date and time the conflict was detected. The last bit of information is relative to the switches date and time settings. Needless to say, a properly configured date and time or a time synchronisation via NTP on the siwtch is quite helpful in such a case.

Once an IP address conflict is detected by a Dell PowerConnect switch, this status will not resolve itself automatically or time out in any way. The issue has to be acknowledged manually on the Dell PowerConnect switch. This can be achieved e.g. on the switchs' CLI with the following commands:

switch> enable
switch# clear ip address-conflict-detect

Log Statistics

The Check_MK service check dell_powerconnect_bcm_logstats monitors several metrics of the logging facility on Dell PowerConnect switches. These are the:

  • total number of log messages received by the log process, including dropped and ignored messages.

  • number of dropped log messages, which could not be processed by the log process due to an error or lack of resources.

  • number of relayed log messages. These are log messages which have been forwarded to a remote syslog host by the log process. If multiple remote syslog hosts are configured, each message is counted multiple times, once for each of the configured syslog hosts.

Currently, the dell_powerconnect_bcm_logstats service check is used purely for long-term trends via its respective PNP4Nagios template and thus only gathers its metrics as performance data.

The following image shows a status output example for the dell_powerconnect_bcm_logstats service check from the WATO WebUI:

Status output example for the new dell_powerconnect_bcm_logstats service check

The following image shows an example of the PNP4Nagios graph for the service check:

Example PNP4Nagios graph for the new dell_powerconnect_bcm_logstats service check

Unfortunately the information as to why log messages might have been erroneous or which resources (CPU cycles, free memory, etc.) were missing at the time of processsing the log message is scarce. The metrics about the logging facility are therefore – again – probably best used as metrics for long-term trends in conjunction with support from Broadcom or Dell, when dealing with a specific issue on the switch.

Memory Buffers

The Check_MK service check dell_powerconnect_bcm_mbuf monitors two groups of metrics regarding the memory or message buffers on Dell PowerConnect switches. The first group is the overall number of currently available memory or message buffers on the switch. This group consists of just one metric. The second group is the number of total and the number of failed memory or message buffer allocation attempts for packets arriving at the switches CPU. Those two metrics are gathered for each of memory or message buffer classes. The names of the currently available memory or message buffer classes are “Transmit”, “Rx High”, “Rx Mid0”, “Rx Mid1”, “Rx Mid2” and “Rx Normal”.

The dell_powerconnect_bcm_mbuf service check is currently used only for long-term trends via its respective PNP4Nagios template and thus only gathers its metrics as performance data.

The following image shows a status output example for the dell_powerconnect_bcm_mbuf service check from the WATO WebUI:

Status output example for the new dell_powerconnect_bcm_mbuf service check

The following image shows an example of the seven PNP4Nagios graphs for the service check. One graph for the overall available memory or message buffers and one graph for the allocation attempts on each of the six memory or message buffer classes:

Example PNP4Nagios graph for the new dell_powerconnect_bcm_mbuf service check

Similarly to the process names described in the previous section Process CPU Usage, i've also not been able to find a good and comprehensive description of the memory or message buffer classes defined on the Broadcom FASTPATH based Dell PowerConnect switches. Some meaning can again be derived from the name of the particular memory or message buffer class, but it is much more limited than in case of the process names. Beyond that, questions like the following – but not limited to – immediately come to mind:

  • which type of packets are forwarded to the CPU instead of being directly processed by the switching silicon of the device?

  • why are there several receive classes (“Rx …”) but only one transmit class?

  • what is the difference between the multiple receive classes and by what algorithm are packets assigned to a specific receive class?

  • what is likely the root cause of a failed memory or message buffer allocation attempt?

  • what are the effects of a failed memory or message buffer allocation attempt. Are packets going to be dropped due to this, or is the allocation attempt retried?

  • what design, implementation and configuration options should be taken into consideration in order to avoid failed memory or message buffer allocation attempts?

Unfortunately they remain unanswered due to the lack of comprehensive documentation. The metrics regarding the memory or message buffers are therefore – again – probably best used for long-term trends in conjunction with support from Broadcom or Dell, when dealing with a specific issue on the switch.

Memory Usage

The Check_MK service check dell_powerconnect_bcm_memory monitors the current memory (RAM) usage on Dell PowerConnect switches. The amount of currently free memory is compared to either the default or configured warning and critical threshold values, and an alarm is raised accordingly. With the added WATO plugin it is possible to configure the warning and critical levels through the WATO WebUI and thus override the default values (warning: 51200 KBytes; critical: 25600 KBytes of free memory). The configuration options for the free memory levels can be found under:

-> Host & Service Parameters
   -> Parameters for discovered services
      -> Operating System Resources
         -> Dell PowerConnect memory usage
            -> Create rule in folder ...
               [x] The levels for the amount of free memory on Dell PowerConnect switches

The following image shows a status output example for the dell_powerconnect_bcm_memory service check from the WATO WebUI:

Status output example for the new dell_powerconnect_bcm_memory service check

This example shows the current amount of free memory and the total memory size both measured in kilobytes.

The following image shows an example of the PNP4Nagios graph for the service check:

Example PNP4Nagios graph for the new dell_powerconnect_bcm_memory service check

SNTP Statistics

For this Check_MK service check to be useful and work properly, there needs to be at least one SNTP server configured on the monitored Dell PowerConnect devices. Setting up a SNTP configuration is generally a good idea for a number of reasons. Please refer to the Dell PowerConnect User's Configuration Guide or the Dell PowerConnect M8024-K Command Line Interface Guide on how to do this.

The Check_MK service check dell_powerconnect_bcm_sntp monitors the current status of the SNTP client on Dell PowerConnect switches. In order to achieve this, the check iterates over the list of SNTP servers configured as time references for the SNTP client on the switch. For each configured SNTP server, the status of the last connection attempt from the SNTP client on the switch to that particular SNTP server is evaluated. The overall number of SNTP servers with a connection status equal to success is counted and this number is compared to either the default or configured warning and critical threshold values, and an alarm is raised accordingly. With the added WATO plugin it is possible to configure the warning and critical levels through the WATO WebUI and thus override the default values (warning: 1 ; critical: 0 servers successfully connected). The configuration options for the levels of successful SNTP server connections can be found under:

-> Host & Service Parameters
   -> Parameters for discovered services
      -> Applications, Processes & Services
         -> Dell PowerConnect SNTP status
            -> Create rule in folder ...
               [x] Successful SNTP server connections on Dell PowerConnect switches

The following image shows a status output example for the dell_powerconnect_bcm_sntp service check from the WATO WebUI:

Status output example for the new dell_powerconnect_bcm_sntp service check

This example shows the current status of the SNTP client which has successfully connected the one configured SNTP server.

In addition to the aggregated current connection status of the SNTP client to all configured SNTP servers, two other metrics are – for each SNTP server – collected as performance data. These are the overall number of SNTP requests – including retries – and the number of failed SNTP requests the client made to a particular SNTP server. The following image shows an example of the PNP4Nagios graph for the service check:

Example PNP4Nagios graph for the new dell_powerconnect_bcm_sntp service check

SSH Sessions

The Check_MK service check dell_powerconnect_bcm_ssh_sessions monitors just one metric, the number of currently active SSH sessions on the Dell PowerConnect device. This number is compared to either the default or configured warning and critical threshold values, and an alarm is raised accordingly. With the added WATO plugin it is possible to configure the warning and critical levels through the WATO WebUI and thus override the default values (warning: 5; critical: 5 active SSH sessions). The configuration options for the number of active SSH sessions can be found under:

-> Host & Service Parameters
   -> Parameters for discovered services
      -> Applications, Processes & Services
         -> Dell PowerConnect SSH sessions
            -> Create rule in folder ...
               [x] Active SSH sessions on Dell PowerConnect switches

The following image shows a status output example for the dell_powerconnect_bcm_ssh_sessions service check from the WATO WebUI:

Status output example for the new dell_powerconnect_bcm_ssh_sessions service check

This example shows the current number of active SSH sessions along with the warning and critical threshold values.

The following image shows an example of the PNP4Nagios graph for the service check:

Example PNP4Nagios graph for the new dell_powerconnect_bcm_ssh_sessions service check

Conclusion

Adding the enhanced version of the Dell PowerConnect monitoring checks to your Check_MK server enables you to monitor various additional aspects of your Dell PowerConnect devices. New Dell PowerConnect devices should pick up the additional service checks immediately. Existing Dell PowerConnect devices might need a Check_MK inventory to be run explicitly on them in order to pick up the additional service checks.

Along with the built-in Check_MK monitoring of interfaces of network equipment, the monitoring of services (SSH, HTTP and HTTPS) and the status of certificates, as well as the previously described monitoring of RMON Interface Statistics, this enhanced version of the Dell PowerConnect monitoring checks enables you to create a complete monitoring solution for your Dell PowerConnect M-Series switches.

I hope you find the provided new and enhanced checks useful and enjoyed reading this blog post. Please don't hesitate to drop me a note if you have any suggestions or run into any issues with the provided checks.

// Check_MK Monitoring - RMON Interface Statistics

Check_MK provides the check rmon_stats to collect and monitor Remote Network MONitoring (RMON) statistics for interfaces of network equipment. While this stock check probably works fine with Cisco network devices, it does – out of the box – not work with network equipment from other vendors like e.g. Dell PowerConnect switches. This article shows the modifications necessary to make the rmon_stats check work with non-Cisco network equipment. Along the way, other shortcomings of the stock rmon_stats are discussed and the necessary modifications to address those issues are shown.

For the impatient and TL;DR here are the enhanced versions of the rmon_stats check here along with a slightly beautified version of the accompanying PNP4Nagios template:

Enhanced version of the rmon_stats check
Slightly beautified version of the rmon_stats PNP4Nagios template

The limitation of the rmon_stats check to Cisco network devices is due to the fact that the Check_MK inventory is being limited to vendor specific OIDs in the checks snmp_scan_function. The following code snippet shows the respective lines:

rmon_stats
check_info["rmon_stats"] = {
    'check_function'        : check_rmon_stats,
    'inventory_function'    : inventory_rmon_stats,
    'service_description'   : 'RMON Stats IF %s',
    'has_perfdata'          : True,
    'snmp_info'             : ('.1.3.6.1.2.1.16.1.1.1', [ #
                                        '1',    # etherStatsIndex = Item
                                        '6',    # etherStatsBroadcastPkts
                                        '7',    # etherStatsMulticastPkts
                                        '14',   # etherStatsPkts64Octets
                                        '15',   # etherStatsPkts65to127Octets
                                        '16',   # etherStatsPkts128to255Octets
                                        '17',   # etherStatsPkts256to511Octets
                                        '18',   # etherStatsPkts512to1023Octets
                                        '19',   # etherStatsPkts1024to1518Octets
                                        ]),
    # for the scan we need to check for any single object in the RMON tree,
    # we choose netDefaultGateway in the hope that it will always be present
    'snmp_scan_function'    : lambda oid: ( oid(".1.3.6.1.2.1.1.1.0").lower().startswith("cisco") \
                    or oid(".1.3.6.1.2.1.1.2.0") == ".1.3.6.1.4.1.11863.1.1.3" \
                    ) and oid(".1.3.6.1.2.1.16.19.12.0") != None,
}

By replacing the snmp_scan_function lines at the bottom of the code above with the following lines:

rmon_stats
    # if at least one interface within the RMON tree is present (the previous
    # "netDefaultGateway" is not present in every device implementing the RMON
    # MIB
    'snmp_scan_function'    : lambda oid: oid(".1.3.6.1.2.1.16.1.1.1.1.1") > 0,

the check will now be able to successfully inventorize the RMON representation of interfaces even on non-Cisco network equipment.

To actually execute such an inventory, there first needs to be a Check_MK configuration rule in place, which enables the appropriate code in the inventory_rmon_stats function of the check. The following code snippet shows the respective lines:

rmon_stats
def inventory_rmon_stats(info):
    settings = host_extra_conf_merged(g_hostname, inventory_if_rules)
    if settings.get("rmon"):
        [...]

I guess this is supposed to be sort of a safeguard, since the number of interfaces can grow quite large in RMON and querying them can put a lot of strain on the service processor of the network device.

The configuration option to enable RMON based checks is neatly tucked away in the WATO WebGUI at:

-> Host & Service Parameters
   -> Parameters for discovered services
      -> Inventory - automatic service detection
         -> Network Interface and Switch Port Discovery
            -> Create rule in folder ...
               -> Value
                  [x] Collect RMON statistics data
                      [x] Create extra service with RMON statistics data (if available for the device)

After creating a global or folder specific configuration rule, the next run of the Check_MK inventory should discover a – possibly large – number of new interfaces and create the individual service checks, one for each new interface. If not, the network devices probably has disabled the RMON statistics feature by default or has no such feature at all. Since the procedure to enable the collection and presentation of RMON statistics via SNMP on a network device is different for each vendor, one has to check with the corresponding vendor documentation.

Taking a look at the list of newly discovered interfaces and the resulting service checks reveals two other issues of the rmon_stats check.

One is, the check is rather simple in the way that it indiscriminately collects RMON interface statistics on all interfaces of a network device, regardless of the actual link state of each interface. While the RMON-MIB lacks direct information about the link state of an interface, it fortunately carries a reference to the IF-MIB for each interface. The IF-MIB in turn provides the information about the link state (ifOperStatus), which can be used to determine if RMON statistics should be collected for a particular interface.

The other issue is, the checks use of the RMON interface index (etherStatsIndex) as a base for the name of the associated service. E.g. RMON Stats IF <etherStatsIndex> in the following example output:

OK  RMON Stats IF 1     OK - bcast=0 mcast=0 0-63b=5 64-127b=124 128-255b=28 256-511b=8 512-1023b=2 1024-1518b=94 octets/sec
OK  RMON Stats IF 10    OK - bcast=0 mcast=0 0-63b=0 64-127b=0 128-255b=0 256-511b=0 512-1023b=0 1024-1518b=0 octets/sec
OK  RMON Stats IF 100   OK - bcast=0 mcast=0 0-63b=0 64-127b=0 128-255b=0 256-511b=0 512-1023b=0 1024-1518b=0 octets/sec
OK  RMON Stats IF 101   OK - bcast=0 mcast=0 0-63b=0 64-127b=0 128-255b=0 256-511b=0 512-1023b=0 1024-1518b=0 octets/sec
OK  RMON Stats IF 102   OK - bcast=0 mcast=0 0-63b=0 64-127b=0 128-255b=0 256-511b=0 512-1023b=0 1024-1518b=0 octets/sec
OK  RMON Stats IF 103   OK - bcast=0 mcast=0 0-63b=0 64-127b=0 128-255b=0 256-511b=0 512-1023b=0 1024-1518b=0 octets/sec
OK  RMON Stats IF 104   OK - bcast=0 mcast=0 0-63b=0 64-127b=11 128-255b=3 256-511b=0 512-1023b=0 1024-1518b=0 octets/sec
OK  RMON Stats IF 105   OK - bcast=0 mcast=0 0-63b=0 64-127b=0 128-255b=0 256-511b=0 512-1023b=0 1024-1518b=0 octets/sec
OK  RMON Stats IF 106   OK - bcast=0 mcast=0 0-63b=0 64-127b=0 128-255b=0 256-511b=0 512-1023b=0 1024-1518b=0 octets/sec
OK  RMON Stats IF 107   OK - bcast=0 mcast=1 0-63b=1 64-127b=8 128-255b=0 256-511b=0 512-1023b=0 1024-1518b=0 octets/sec
OK  RMON Stats IF 108   OK - bcast=0 mcast=0 0-63b=0 64-127b=0 128-255b=0 256-511b=0 512-1023b=0 1024-1518b=0 octets/sec
OK  RMON Stats IF 109   OK - bcast=0 mcast=0 0-63b=0 64-127b=0 128-255b=0 256-511b=0 512-1023b=0 1024-1518b=0 octets/sec
OK  RMON Stats IF 11    OK - bcast=0 mcast=0 0-63b=0 64-127b=0 128-255b=0 256-511b=0 512-1023b=0 1024-1518b=0 octets/sec
OK  RMON Stats IF 110   OK - bcast=0 mcast=0 0-63b=0 64-127b=0 128-255b=0 256-511b=0 512-1023b=0 1024-1518b=0 octets/sec
OK  RMON Stats IF 111   OK - bcast=0 mcast=0 0-63b=0 64-127b=0 128-255b=0 256-511b=0 512-1023b=0 1024-1518b=0 octets/sec
OK  RMON Stats IF 112   OK - bcast=0 mcast=0 0-63b=0 64-127b=0 128-255b=0 256-511b=0 512-1023b=0 1024-1518b=0 octets/sec
[...]

Unfortunately in RMON the values of the interface index etherStatsIndex are not guaranteed to be consistent across reboots of the network device, much less the addition of new or the removal of existing interfaces. The interface numbering definition according to the original MIB-II on the other hand is much stricter in this regard. Although subsequent RFCs like the IF-MIB (see section “3.1.5. Interface Numbering”) have weakened the original definition to some degree.

Another drawback of the pure RMON interface index number as an interface identifier in the service check name is that it is less descriptive than e.g. the interface description ifDescr from the IF-MIB. This makes the visual mapping of the service check to the respective interface of the network device rather tedious and error-prone.

An obvious solution to both issues would be to use the already mentioned reference to the IF-MIB for each interface and – in the process of the Check_MK inventory – check for the interface link state (ifOperStatus). The Check_MK inventory process would return only those interfaces with a value of up(1) in the ifOperStatus variable. For those particular interfaces it would also return the interface description ifDescr to be used as a service name, instead of the previsously used RMON interface index.

As a side effect, this approach unfortunately breaks the necessary mapping from the service name – now based on the IF-MIB interface description (ifDescr) instead of the previously used RMON interface index (etherStatsIndex) – back to the RMON MIB where the interface statistic values are stored. This is mainly due to the fact that the IF-MIB itself does not provide a native reference to the RMON MIB. Without such a back reference, the statistics values collected from the RMON MIB can – during normal service check runs – not distinctly be assigned to the appropriate interface. It is therefore necessary to manually implement such a back reference within the service check. In this case the solution was to store the value of the RMON interface index etherStatsIndex in the parameters section (params) of each interface service checks inventory entry. The following excerpt shows an example of such a new inventory data structure for several RMON interface service checks:

[
  [...]
  ('rmon_stats', 'Link Aggregate 1', {'rmon_if_idx': 89}),
  ('rmon_stats', 'Link Aggregate 16', {'rmon_if_idx': 104}),
  ('rmon_stats', 'Link Aggregate 19', {'rmon_if_idx': 107}),
  ('rmon_stats', 'Link Aggregate 2', {'rmon_if_idx': 90}),
  ('rmon_stats', 'Link Aggregate 31', {'rmon_if_idx': 119}),
  ('rmon_stats', 'Link Aggregate 32', {'rmon_if_idx': 120}),
  ('rmon_stats', 'Link Aggregate 7', {'rmon_if_idx': 95}),
  ('rmon_stats', 'Link Aggregate 8', {'rmon_if_idx': 96}),
  ('rmon_stats', 'Unit: 1 Slot: 0 Port: 1 10G - Level', {'rmon_if_idx': 1}),
  ('rmon_stats', 'Unit: 1 Slot: 0 Port: 10 10G - Level', {'rmon_if_idx': 10}),
  ('rmon_stats', 'Unit: 1 Slot: 0 Port: 16 10G - Level', {'rmon_if_idx': 16}),
  ('rmon_stats', 'Unit: 1 Slot: 0 Port: 19 10G - Level', {'rmon_if_idx': 19}),
  ('rmon_stats', 'Unit: 1 Slot: 0 Port: 2 10G - Level', {'rmon_if_idx': 2}),
  ('rmon_stats', 'Unit: 1 Slot: 0 Port: 20 10G - Level', {'rmon_if_idx': 20}),
  ('rmon_stats', 'Unit: 1 Slot: 0 Port: 7 10G - Level', {'rmon_if_idx': 7}),
  ('rmon_stats', 'Unit: 1 Slot: 0 Port: 8 10G - Level', {'rmon_if_idx': 8}),
  ('rmon_stats', 'Unit: 1 Slot: 0 Port: 9 10G - Level', {'rmon_if_idx': 9}),
  ('rmon_stats', 'Unit: 1 Slot: 1 Port: 3 10G - Level', {'rmon_if_idx': 23}),
  ('rmon_stats', 'Unit: 1 Slot: 1 Port: 4 10G - Level', {'rmon_if_idx': 24}),
  ('rmon_stats', 'Unit: 2 Slot: 0 Port: 1 10G - Level', {'rmon_if_idx': 25}),
  ('rmon_stats', 'Unit: 2 Slot: 0 Port: 10 10G - Level', {'rmon_if_idx': 34}),
  ('rmon_stats', 'Unit: 2 Slot: 0 Port: 16 10G - Level', {'rmon_if_idx': 40}),
  ('rmon_stats', 'Unit: 2 Slot: 0 Port: 19 10G - Level', {'rmon_if_idx': 43}),
  ('rmon_stats', 'Unit: 2 Slot: 0 Port: 2 10G - Level', {'rmon_if_idx': 26}),
  ('rmon_stats', 'Unit: 2 Slot: 0 Port: 20 10G - Level', {'rmon_if_idx': 44}),
  ('rmon_stats', 'Unit: 2 Slot: 0 Port: 7 10G - Level', {'rmon_if_idx': 31}),
  ('rmon_stats', 'Unit: 2 Slot: 0 Port: 8 10G - Level', {'rmon_if_idx': 32}),
  ('rmon_stats', 'Unit: 2 Slot: 0 Port: 9 10G - Level', {'rmon_if_idx': 33}),
  ('rmon_stats', 'Unit: 2 Slot: 1 Port: 3 10G - Level', {'rmon_if_idx': 47}),
  ('rmon_stats', 'Unit: 2 Slot: 1 Port: 4 10G - Level', {'rmon_if_idx': 48}),
  ('rmon_stats', 'Unit: 3 Slot: 0 Port: 1 10G - Level', {'rmon_if_idx': 49}),
  ('rmon_stats', 'Unit: 3 Slot: 0 Port: 2 10G - Level', {'rmon_if_idx': 50}),
  ('rmon_stats', 'Unit: 3 Slot: 0 Port: 9 10G - Level', {'rmon_if_idx': 57}),
  ('rmon_stats', 'Unit: 4 Slot: 0 Port: 1 10G - Level', {'rmon_if_idx': 69}),
  ('rmon_stats', 'Unit: 4 Slot: 0 Port: 2 10G - Level', {'rmon_if_idx': 70}),
  ('rmon_stats', 'Unit: 4 Slot: 0 Port: 9 10G - Level', {'rmon_if_idx': 77}),
  [...]
]

The changes to the rmon_stats check, necessary to implement the features described above are shown in the following patch:

srancid.patch
--- rmon_stats.orig 2015-12-21 13:41:46.000000000 +0100
+++ rmon_stats  2016-02-02 11:18:56.789188249 +0100
@@ -34,20 +34,43 @@
     settings = host_extra_conf_merged(g_hostname, inventory_if_rules)
     if settings.get("rmon"):
         inventory = []
-        for line in info:
-            inventory.append((line[0], None))
+        for line in info[1]:
+            rmon_if_idx = int(re.sub('.1.3.6.1.2.1.2.2.1.1.','',line[1]))
+            for iface in info[0]:
+                if int(iface[0]) == rmon_if_idx and int(iface[2]) == 1:
+                    params = {}
+                    params["rmon_if_idx"] = int(line[0])
+                    inventory.append((iface[1], "%r" % params))
         return inventory
 
-def check_rmon_stats(item, _no_params, info):
-    bytes = { 1: 'bcast', 2: 'mcast', 3: '0-63b', 4: '64-127b', 5: '128-255b', 6: '256-511b', 7: '512-1023b', 8: '1024-1518b' }
-    perfdata = []
+def check_rmon_stats(item, params, info):
+    bytes = { 2: 'bcast', 3: 'mcast', 4: '0-63b', 5: '64-127b', 6: '128-255b', 7: '256-511b', 8: '512-1023b', 9: '1024-1518b' }
+    rmon_if_idx = str(params.get("rmon_if_idx"))
+    if_alias = ''
     infotext = ''
+    perfdata = []
     now = time.time()
-    for line in info:
-        if line[0] == item:
+
+    for line in info[0]:
+        ifIndex, ifDescr, ifOperStatus, ifAlias = line
+        if item == ifDescr:
+            if_alias = ifAlias
+            if_index = int(ifIndex)
+
+    for line in info[1]:
+        if line[0] == rmon_if_idx:
+            if_mib_idx = int(re.sub('.1.3.6.1.2.1.2.2.1.1.','',line[1]))
+            if if_mib_idx != if_index:
+                return (3, "RMON interface index mapping to IF interface index mapping changed. Re-run Check_MK inventory.")
+
+            if item != if_alias and if_alias != '':
+                infotext = "[%s, RMON: %s] " % (if_alias, rmon_if_idx)
+            else:
+                infotext = "[RMON: %s] " % rmon_if_idx
+
             for i, val in bytes.items():
                 octets = int(re.sub(' Packets','',line[i]))
-                rate = get_rate("%s-%s" % (item, val), now, octets)
+                rate = get_rate("%s-%s" % (rmon_if_idx, val), now, octets)
                 perfdata.append((val, rate, 0, 0, 0))
                 infotext += "%s=%.0f " % (val, rate)
             infotext += 'octets/sec'
@@ -58,10 +81,18 @@
 check_info["rmon_stats"] = {
     'check_function'        : check_rmon_stats,
     'inventory_function'    : inventory_rmon_stats,
-    'service_description'   : 'RMON Stats IF %s',
+    'service_description'   : 'RMON Interface %s',
     'has_perfdata'          : True,
-    'snmp_info'             : ('.1.3.6.1.2.1.16.1.1.1', [ #
+    'snmp_info'             : [
+        ( '.1.3.6.1.2.1', [ #
+                                        '2.2.1.1',      # ifNumber
+                                        '2.2.1.2',      # ifDescr
+                                        '2.2.1.8',      # ifOperStatus
+                                        '31.1.1.1.18',  # ifAlias
+                          ]),
+        ('.1.3.6.1.2.1.16.1.1.1', [ #
                                         '1',    # etherStatsIndex = Item
+                                        '2',    # etherStatsDataSource = Interface in the RFC1213-MIB
                                         '6',    # etherStatsBroadcastPkts
                                         '7',    # etherStatsMulticastPkts
                                         '14',   # etherStatsPkts64Octets
@@ -70,10 +101,11 @@
                                         '17',   # etherStatsPkts256to511Octets
                                         '18',   # etherStatsPkts512to1023Octets
                                         '19',   # etherStatsPkts1024to1518Octets
-                                        ]),
-    # for the scan we need to check for any single object in the RMON tree,
-    # we choose netDefaultGateway in the hope that it will always be present
-    'snmp_scan_function'    : lambda oid: ( oid(".1.3.6.1.2.1.1.1.0").lower().startswith("cisco") \
-                    or oid(".1.3.6.1.2.1.1.2.0") == ".1.3.6.1.4.1.11863.1.1.3" \
-                    ) and oid(".1.3.6.1.2.1.16.19.12.0") != None,
+                                  ]),
+        ],
+    # if at least one interface within the RMON tree is present (the previous
+    # "netDefaultGateway" is not present in every device implementing the RMON
+    # MIB
+    'snmp_scan_function'    : lambda oid: oid(".1.3.6.1.2.1.16.1.1.1.1.1") > 0,
+
 }

The following output shows examples of the new service check names. The ifAlias and the etherStatsIndex have been added as auxiliary information to the service check output. The ifAlias entries in the examples have – in order to protect the innocent – been redacted with xxxxxxxx though.

OK  RMON Interface Link Aggregate 1                        OK - [xxxxxxxx, RMON: 89] bcast=0 mcast=0 0-63b=85 64-127b=390 128-255b=136 256-511b=25 512-1023b=20 1024-1518b=407 octets/sec
OK  RMON Interface Link Aggregate 16                       OK - [xxxxxxxx, RMON: 104] bcast=0 mcast=0 0-63b=1 64-127b=11 128-255b=3 256-511b=0 512-1023b=0 1024-1518b=0 octets/sec
UNKN    RMON Interface Link Aggregate 19                       UNKNOWN - RMON interface index mapping to IF interface index mapping changed. Re-run Check_MK inventory.
OK  RMON Interface Link Aggregate 2                        OK - [xxxxxxxx, RMON: 90] bcast=0 mcast=0 0-63b=111 64-127b=480 128-255b=207 256-511b=44 512-1023b=41 1024-1518b=484 octets/sec
[...]
OK  RMON Interface Unit: 1 Slot: 0 Port: 1 10G - Level     OK - [xxxxxxxx, RMON: 1] bcast=0 mcast=0 0-63b=49 64-127b=232 128-255b=57 256-511b=14 512-1023b=4 1024-1518b=185 octets/sec
OK  RMON Interface Unit: 1 Slot: 0 Port: 10 10G - Level    OK - [xxxxxxxx, RMON: 10] bcast=0 mcast=0 0-63b=0 64-127b=0 128-255b=0 256-511b=0 512-1023b=0 1024-1518b=0 octets/sec
OK  RMON Interface Unit: 1 Slot: 0 Port: 16 10G - Level    OK - [xxxxxxxx, RMON: 16] bcast=0 mcast=0 0-63b=0 64-127b=2 128-255b=1 256-511b=0 512-1023b=0 1024-1518b=0 octets/sec
OK  RMON Interface Unit: 1 Slot: 0 Port: 19 10G - Level    OK - [xxxxxxxx, RMON: 19] bcast=0 mcast=0 0-63b=0 64-127b=0 128-255b=0 256-511b=0 512-1023b=0 1024-1518b=0 octets/sec
OK  RMON Interface Unit: 1 Slot: 0 Port: 2 10G - Level     OK - [xxxxxxxx, RMON: 2] bcast=0 mcast=0 0-63b=62 64-127b=293 128-255b=89 256-511b=28 512-1023b=9 1024-1518b=252 octets/sec
OK  RMON Interface Unit: 1 Slot: 0 Port: 20 10G - Level    OK - [xxxxxxxx, RMON: 20] bcast=0 mcast=0 0-63b=75 64-127b=1640 128-255b=217 256-511b=41 512-1023b=79 1024-1518b=2096 octets/sec
OK  RMON Interface Unit: 1 Slot: 0 Port: 7 10G - Level     OK - [xxxxxxxx, RMON: 7] bcast=0 mcast=0 0-63b=0 64-127b=0 128-255b=0 256-511b=0 512-1023b=0 1024-1518b=0 octets/sec
OK  RMON Interface Unit: 1 Slot: 0 Port: 8 10G - Level     OK - [xxxxxxxx, RMON: 8] bcast=0 mcast=0 0-63b=0 64-127b=5 128-255b=1 256-511b=0 512-1023b=0 1024-1518b=0 octets/sec
OK  RMON Interface Unit: 1 Slot: 0 Port: 9 10G - Level     OK - [xxxxxxxx, RMON: 9] bcast=0 mcast=0 0-63b=0 64-127b=0 128-255b=0 256-511b=0 512-1023b=0 1024-1518b=0 octets/sec
OK  RMON Interface Unit: 1 Slot: 1 Port: 3 10G - Level     OK - [xxxxxxxx, RMON: 23] bcast=0 mcast=0 0-63b=0 64-127b=3 128-255b=0 256-511b=0 512-1023b=0 1024-1518b=0 octets/sec
OK  RMON Interface Unit: 1 Slot: 1 Port: 4 10G - Level     OK - [xxxxxxxx, RMON: 24] bcast=0 mcast=0 0-63b=0 64-127b=3 128-255b=0 256-511b=0 512-1023b=0 1024-1518b=0 octets/sec
OK  RMON Interface Unit: 2 Slot: 0 Port: 1 10G - Level     OK - [xxxxxxxx, RMON: 25] bcast=0 mcast=0 0-63b=0 64-127b=0 128-255b=0 256-511b=0 512-1023b=0 1024-1518b=0 octets/sec
OK  RMON Interface Unit: 2 Slot: 0 Port: 10 10G - Level    OK - [xxxxxxxx, RMON: 34] bcast=0 mcast=0 0-63b=0 64-127b=0 128-255b=0 256-511b=0 512-1023b=0 1024-1518b=0 octets/sec
OK  RMON Interface Unit: 2 Slot: 0 Port: 16 10G - Level    OK - [xxxxxxxx, RMON: 40] bcast=0 mcast=0 0-63b=0 64-127b=0 128-255b=0 256-511b=0 512-1023b=0 1024-1518b=0 octets/sec
OK  RMON Interface Unit: 2 Slot: 0 Port: 19 10G - Level    OK - [xxxxxxxx, RMON: 43] bcast=0 mcast=1 0-63b=1 64-127b=0 128-255b=0 256-511b=0 512-1023b=0 1024-1518b=0 octets/sec
[...]
OK  RMON Interface Unit: 4 Slot: 0 Port: 9 10G - Level     OK - [xxxxxxxx, RMON: 77] bcast=0 mcast=0 0-63b=0 64-127b=0 128-255b=0 256-511b=0 512-1023b=0 1024-1518b=0 octets/sec

The enhanced version of the rmon_stats check can be downloaded here along with a slightly beautified version of the accompanying PNP4Nagios template:

Enhanced version of the rmon_stats check
Patch to enhance the original verion of the rmon_stats check
Slightly beautified version of the rmon_stats PNP4Nagios template
Patch to beautify the rmon_stats PNP4Nagios template

Finally, the following two screenshots show examples of PNP4Nagios graphs using the modified version of the PNP4Nagios template:

Example RMON interface statistics graph using the modified version of the PNP4Nagios template
Example RMON interface statistics graph using the modified version of the PNP4Nagios template

The sources to both, the enhanced version of the rmon_stats check and the beautified version of the rmon_stats PNP4Nagios template, can be found on GitHub in my Check_MK Plugins repository.

// Nagios Monitoring - IBM SVC and Storwize (Update)

After upgrading our test IBM SAN Volume Controller (SVC) systems from version 7.3.0.7 to 7.3.0.8 or later, the previously described Nagios monitoring plugin (Nagios Monitoring - IBM SVC and Storwize) ceased to work. A quick check revealed that the “wbemcli” command line tool from the Standards Based Linux Instrumentation project, which is used in the Nagios plugin to query the CIMOM server on the SVC or Storwize systems, would fail with the following error message:

$ /opt/sblim-wbemcli/bin/wbemcli -dx -noverify ecn https://user:pass@svc-test:5989/root/ibm
*
* /opt/sblim-wbemcli/bin/wbemcli: Http Exception: SSL connect error
*

Re-checking the release notes, this sudden change in behaviour seemed to be explained by the fix:

SSL vulnerability CVE-2014-3566

Not really being that verbose a description, a quick look at the CVE-2014-3566 showed that this is a fix for the “POODLE” issue. So IBM probably switched off the support for the SSLv3 protocol in the SVC and Storwize code. But why would this cause the “wbemcli” command line tool to fail? Here are the steps taken in an analysis of the issue:

  1. First, i was trying to get the “wbemcli” command line tool to be a tad more verbose about what it is actually doing:

    $ /opt/sblim-wbemcli/bin/wbemcli -dx -noverify ecn https://user:pass@svc-test:5989/root/ibm
    To server: <?xml version="1.0" encoding="utf-8" ?>
    <CIM CIMVERSION="2.0" DTDVERSION="2.0">
    <MESSAGE ID="4711" PROTOCOLVERSION="1.0"><SIMPLEREQ><IMETHODCALL NAME="EnumerateClassNames"><LOCALNAMESPACEPATH><NAMESPACE NAME="root"></NAMESPACE><NAMESPACE NAME="ibm"></NAMESPACE></LOCALNAMESPACEPATH>
    <IPARAMVALUE NAME="DeepInheritance"><VALUE>TRUE</VALUE></IPARAMVALUE>
    </IMETHODCALL></SIMPLEREQ>
    </MESSAGE></CIM>
    *
    * /opt/sblim-wbemcli/bin/wbemcli: Http Exception: SSL connect error
    *
    

    Not really an abundance of information in here too.

  2. Getting the source code for and, while we're at it, updating the “wbemcli” command line tool from version 1.6.0 to 1.6.3. Spent some time looking through the source code and with the “gdb” debugger to get a feeling for the general program flow and functions/methods being called. While looking through the source code i noticed the cURL debugging options are being set if a environment variable named “CURLDEBUG” is set to “true”. Later also found this mentioned in the ChangeLog file:

    $ CURLDEBUG=true /opt/sblim-wbemcli/bin/wbemcli -dx -noverify ecn https://user:pass@svc-test:5989/root/ibm
    To server: <?xml version="1.0" encoding="utf-8" ?>
    <CIM CIMVERSION="2.0" DTDVERSION="2.0">
    <MESSAGE ID="4711" PROTOCOLVERSION="1.0"><SIMPLEREQ><IMETHODCALL NAME="EnumerateClassNames"><LOCALNAMESPACEPATH><NAMESPACE NAME="root"></NAMESPACE><NAMESPACE NAME="ibm"></NAMESPACE></LOCALNAMESPACEPATH>
    <IPARAMVALUE NAME="DeepInheritance"><VALUE>TRUE</VALUE></IPARAMVALUE>
    </IMETHODCALL></SIMPLEREQ>
    </MESSAGE></CIM>
    * About to connect() to svc-test port 5989 (#0)
    *   Trying 192.168.x.x...
    * connected
    * Connected to svc-test (192.168.x.x) port 5989 (#0)
    * successfully set certificate verify locations:
    *   CAfile: none
      CApath: /etc/ssl/certs
    * Unknown SSL protocol error in connection to svc-test:5989
    * Closing connection #0
    * SSL connect error
    *
    * /opt/sblim-wbemcli/bin/wbemcli: Http Exception: SSL connect error
    *
    

    Now we know that we're encountering a “Unknown SSL protocol error” – not that helpful either.

  3. Searched the web for further information on how to debug the cURL library and found the debug.c example source code, which was very helpful. Incorperated it into the file “CimCurl.cpp”:

    sblim-wbemcli-1.6.3_debug.patch
    --- sblim-wbemcli-1.6.3_orig/CimCurl.cpp    2013-09-21 01:26:32.000000000 +0200
    +++ sblim-wbemcli-1.6.3_new/CimCurl.cpp     2014-11-26 16:30:19.000000000 +0100
    @@ -37,6 +37,100 @@
     extern int waitTime;
     extern int expect100;
     
    +// Trace Begin
    +struct data {
    +  char trace_ascii; /* 1 or 0 */ 
    +};
    +
    +static
    +void dump(const char *text,
    +          FILE *stream, unsigned char *ptr, size_t size,
    +          char nohex)
    +{ 
    +  size_t i;
    +  size_t c;
    +  
    +  unsigned int width=0x10;
    +  
    +  if(nohex)
    +    /* without the hex output, we can fit more on screen */
    +    width = 0x40;
    +
    +  fprintf(stream, "%s, %010.10ld bytes (0x%08.8lx)\n",
    +          text, (long)size, (long)size);
    +
    +  for(i=0; i<size; i+= width) {
    +
    +    fprintf(stream, "%04.4lx: ", (long)i);
    +
    +    if(!nohex) {
    +      /* hex not disabled, show it */
    +      for(c = 0; c < width; c++)
    +        if(i+c < size)
    +          fprintf(stream, "%02x ", ptr[i+c]);
    +        else
    +          fputs("   ", stream);
    +    }
    +
    +    for(c = 0; (c < width) && (i+c < size); c++) {
    +      /* check for 0D0A; if found, skip past and start a new line of output */
    +      if (nohex && (i+c+1 < size) && ptr[i+c]==0x0D && ptr[i+c+1]==0x0A) {
    +        i+=(c+2-width);
    +        break;
    +      }
    +      fprintf(stream, "%c",
    +              (ptr[i+c]>=0x20) && (ptr[i+c]<0x80)?ptr[i+c]:'.');
    +      /* check again for 0D0A, to avoid an extra \n if it's at width */
    +      if (nohex && (i+c+2 < size) && ptr[i+c+1]==0x0D && ptr[i+c+2]==0x0A) {
    +        i+=(c+3-width);
    +        break;
    +      }
    +    }
    +    fputc('\n', stream); /* newline */
    +  }
    +  fflush(stream);
    +}
    +
    +static
    +int my_trace(CURL *handle, curl_infotype type,
    +             char *data, size_t size,
    +             void *userp)
    +{
    +  struct data *config = (struct data *)userp;
    +  const char *text;
    +  (void)handle; /* prevent compiler warning */
    +
    +  switch (type) {
    +  case CURLINFO_TEXT:
    +    fprintf(stderr, "== Info: %s", data);
    +  default: /* in case a new one is introduced to shock us */
    +    return 0;
    +
    +  case CURLINFO_HEADER_OUT:
    +    text = "=> Send header";
    +    break;
    +  case CURLINFO_DATA_OUT:
    +    text = "=> Send data";
    +    break;
    +  case CURLINFO_SSL_DATA_OUT:
    +    text = "=> Send SSL data";
    +    break;
    +  case CURLINFO_HEADER_IN:
    +    text = "<= Recv header";
    +    break;
    +  case CURLINFO_DATA_IN:
    +    text = "<= Recv data";
    +    break;
    +  case CURLINFO_SSL_DATA_IN:
    +    text = "<= Recv SSL data";
    +    break;
    +  }
    +
    +  dump(text, stderr, (unsigned char *)data, size, config->trace_ascii);
    +  return 0;
    +}
    +// Trace End
    +
     // These are the constant headers added to all requests
     static const char *headers[] = {
         "Content-Type: application/xml; charset=\"utf-8\"",
    @@ -152,6 +246,11 @@
         CURLcode rv;
         string sb;
     
    +// Trace Begin
    +struct data config;
    +config.trace_ascii = 1;
    +// Trace End
    +
         mUri = url.scheme + "://" + url.host + ":" + url.port + "/cimom";
         url.ns.toStringBuffer(sb,"%2F");
     
    @@ -248,6 +347,11 @@
     
         rv = curl_easy_setopt(mHandle, CURLOPT_WRITEHEADER, &mErrorData);
         rv = curl_easy_setopt(mHandle, CURLOPT_HEADERFUNCTION, headerCb);
    +
    +// Trace Begin
    +    rv = curl_easy_setopt(mHandle, CURLOPT_DEBUGFUNCTION, my_trace);
    +    rv = curl_easy_setopt(mHandle, CURLOPT_DEBUGDATA, &config);
    +// Trace End
     }
     
     static string getErrorMessage(CURLcode err)

    rebuild the “wbemcli” command line tool and tried again:

    $ CURLDEBUG=true /opt/sblim-wbemcli/bin/wbemcli -dx -noverify ecn https://user:pass@svc-test:5989/root/ibm
    To server: <?xml version="1.0" encoding="utf-8" ?>
    <CIM CIMVERSION="2.0" DTDVERSION="2.0">
    <MESSAGE ID="4711" PROTOCOLVERSION="1.0"><SIMPLEREQ><IMETHODCALL NAME="EnumerateClassNames"><LOCALNAMESPACEPATH><NAMESPACE NAME="root"></NAMESPACE><NAMESPACE NAME="ibm"></NAMESPACE></LOCALNAMESPACEPATH>
    <IPARAMVALUE NAME="DeepInheritance"><VALUE>TRUE</VALUE></IPARAMVALUE>
    </IMETHODCALL></SIMPLEREQ>
    </MESSAGE></CIM>
    == Info: About to connect() to svc-test port 5989 (#0)
    == Info:   Trying 192.168.x.x...
    == Info: connected
    == Info: Connected to svc-test (192.168.x.x) port 5989 (#0)
    == Info: found 172 certificates in /etc/ssl/certs/ca-certificates.crt
    == Info: gnutls_handshake() failed: A TLS packet with unexpected length was received.
    == Info: Closing connection #0
    == Info: SSL connect error
    *
    * /opt/sblim-wbemcli/bin/wbemcli: Http Exception: SSL connect error
    *
    

    Now we know there is an issue in the way the GnuTLS library used by libcURL interacts with the CIMOM server on the SVC or Storwize systems.

  4. Again, searching the web for similar issues with the error message “gnutls_handshake() failed: A TLS packet with unexpected length was received.”, we find a rebuild against the OpenSSL library instead of the GnuTLS library could solve this issue:

    $ dpkg -l | grep curl
    ii  curl                        7.26.0-1+wheezy11   powerpc     command line tool for transferring data with URL syntax
    ii  libcurl3:powerpc            7.26.0-1+wheezy11   powerpc     easy-to-use client-side URL transfer library (OpenSSL flavour)
    ii  libcurl3-gnutls:powerpc     7.26.0-1+wheezy11   powerpc     easy-to-use client-side URL transfer library (GnuTLS flavour)
    ii  libcurl4-gnutls-dev         7.26.0-1+wheezy11   powerpc     development files and documentation for libcurl (GnuTLS flavour)
    
    $ apt-get install libcurl4-openssl-dev
    Reading package lists... Done
    Building dependency tree
    Reading state information... Done
    Suggested packages:
      libcurl3-dbg
    The following packages will be REMOVED:
      libcurl4-gnutls-dev
    The following NEW packages will be installed:
      libcurl4-openssl-dev
    0 upgraded, 1 newly installed, 1 to remove and 0 not upgraded.
    Need to get 0 B/1,259 kB of archives.
    After this operation, 28.7 kB of additional disk space will be used.
    Do you want to continue [Y/n]? y
    (Reading database ... 65717 files and directories currently installed.)
    Removing libcurl4-gnutls-dev ...
    Processing triggers for man-db ...
    Selecting previously unselected package libcurl4-openssl-dev.
    (Reading database ... 65463 files and directories currently installed.)
    Unpacking libcurl4-openssl-dev (from .../libcurl4-openssl-dev_7.26.0-1+wheezy11_powerpc.deb) ...
    Processing triggers for man-db ...
    Setting up libcurl4-openssl-dev (7.26.0-1+wheezy11) ...
    
    $ dpkg -l | grep curl
    ii  curl                        7.26.0-1+wheezy11   powerpc     command line tool for transferring data with URL syntax
    ii  libcurl3:powerpc            7.26.0-1+wheezy11   powerpc     easy-to-use client-side URL transfer library (OpenSSL flavour)
    ii  libcurl3-gnutls:powerpc     7.26.0-1+wheezy11   powerpc     easy-to-use client-side URL transfer library (GnuTLS flavour)
    ii  libcurl4-openssl-dev        7.26.0-1+wheezy11   powerpc     development files and documentation for libcurl (OpenSSL flavour)
    

    Rebuild the “wbemcli” command line tool and tried again:

    $ CURLDEBUG=true /opt/sblim-wbemcli/bin/wbemcli -dx -noverify ecn https://user:pass@svc-test:5989/root/ibm
    To server: <?xml version="1.0" encoding="utf-8" ?>
    <CIM CIMVERSION="2.0" DTDVERSION="2.0">
    <MESSAGE ID="4711" PROTOCOLVERSION="1.0"><SIMPLEREQ><IMETHODCALL NAME="EnumerateClassNames"><LOCALNAMESPACEPATH><NAMESPACE NAME="root"></NAMESPACE><NAMESPACE NAME="ibm"></NAMESPACE></LOCALNAMESPACEPATH>
    <IPARAMVALUE NAME="DeepInheritance"><VALUE>TRUE</VALUE></IPARAMVALUE>
    </IMETHODCALL></SIMPLEREQ>
    </MESSAGE></CIM>
    == Info: About to connect() to svc-test port 5989 (#0)
    == Info:   Trying 192.168.x.x...
    == Info: connected
    == Info: Connected to svc-test (192.168.x.x) port 5989 (#0)
    == Info: successfully set certificate verify locations:
    == Info:   CAfile: none
      CApath: /etc/ssl/certs
    == Info: SSLv3, TLS handshake, Client hello (1):
    => Send SSL data, 0000000134 bytes (0x00000086)
    0000: ......T.."\.~....`...K4.......p..7"&4...Z.....9.8.........5.....
    0040: ................3.2.....E.D...../...A...........................
    0080: ......
    == Info: Unknown SSL protocol error in connection to svc-test:5989
    == Info: Closing connection #0
    == Info: SSL connect error
    *
    * /opt/sblim-wbemcli/bin/wbemcli: Http Exception: SSL connect error
    *
    

    Now we know that the “wbemcli” command line tool is actually – as already suspected – trying to initiate a SSLv3 connection.

  5. In order to confirm we're on the right track, try to first verify manually that we're unable to connct with a SSLv3 secured connection:

    $ openssl s_client -host svc-test -port 5989 -ssl3
    CONNECTED(00000003)
    write:errno=104
    ---
    no peer certificate available
    ---
    No client certificate CA names sent
    ---
    SSL handshake has read 0 bytes and written 0 bytes
    ---
    New, (NONE), Cipher is (NONE)
    Secure Renegotiation IS NOT supported
    Compression: NONE
    Expansion: NONE
    SSL-Session:
        Protocol  : SSLv3
        Cipher    : 0000
        Session-ID:
        Session-ID-ctx:
        Master-Key:
        Key-Arg   : None
        PSK identity: None
        PSK identity hint: None
        SRP username: None
        Start Time: 1418397594
        Timeout   : 7200 (sec)
        Verify return code: 0 (ok)
    ---
    quit
    

    And that we're instead able to connect with a TLS secured connection:

    $ openssl s_client -host svc-test -port 5989
    CONNECTED(00000003)
    depth=0 C = GB, L = Hursley, O = IBM, OU = SSG, CN = 2145, emailAddress = support@ibm.com
    verify error:num=18:self signed certificate
    verify return:1
    depth=0 C = GB, L = Hursley, O = IBM, OU = SSG, CN = 2145, emailAddress = support@ibm.com
    verify return:1
    ---
    Certificate chain
     0 s:/C=GB/L=Hursley/O=IBM/OU=SSG/CN=2145/emailAddress=support@ibm.com
       i:/C=GB/L=Hursley/O=IBM/OU=SSG/CN=2145/emailAddress=support@ibm.com
    ---
    Server certificate
    -----BEGIN CERTIFICATE-----
    MIICyDCCAjGgAwIBAgIEUAPzmTANBgkqhkiG9w0BAQUFADBqMQswCQYDVQQGEwJH
    QjEQMA4GA1UEBxMHSHVyc2xleTEMMAoGA1UEChMDSUJNMQwwCgYDVQQLEwNTU0cx
    DTALBgNVBAMTBDIxNDUxHjAcBgkqhkiG9w0BCQEWD3N1cHBvcnRAaWJtLmNvbTAe
    Fw0xMjA3MTYxMDU3MjlaFw0yNzA3MTMxMDU3MjlaMGoxCzAJBgNVBAYTAkdCMRAw
    DgYDVQQHEwdIdXJzbGV5MQwwCgYDVQQKEwNJQk0xDDAKBgNVBAsTA1NTRzENMAsG
    A1UEAxMEMjE0NTEeMBwGCSqGSIb3DQEJARYPc3VwcG9ydEBpYm0uY29tMIGfMA0G
    CSqGSIb3DQEBAQUAA4GNADCBiQKBgQC3E7+7mE2GAID/35o5/s7cnzoqu9PQdOGB
    ryGMa8adD4Wd9hpmTkrsgyNvkUB6sPIifbFstGooOkQtK9ZNgP5OHOorZmqINSxM
    9goCkSCQG9xRKAvNt2tA8gujaV+p42oVEhIH6naJUul96qZI31y3GffUu2CRrJL7
    4wG/8cv0BQIDAQABo3sweTAJBgNVHRMEAjAAMCwGCWCGSAGG+EIBDQQfFh1PcGVu
    U1NMIEdlbmVyYXRlZCBDZXJ0aWZpY2F0ZTAdBgNVHQ4EFgQUkPMkXUjn0YHlfQW8
    TJiRC5jWQO4wHwYDVR0jBBgwFoAUkPMkXUjn0YHlfQW8TJiRC5jWQO4wDQYJKoZI
    hvcNAQEFBQADgYEAKqu7KpVxnOXonQE3unC1O7qUHKoyQUEWqcKsM/4tPI+lsBMZ
    jvoPwn8yQRWiLehFmVc8VSZfdFPLzshNabXp5qbZo/EFberXrgI2CbtPiULYyyyH
    DUhWF+vhwb6uqwfBbGncvTvI2ewU8+0oTXsuTkSjumJ7+chpaHFWWyj2cJA=
    -----END CERTIFICATE-----
    subject=/C=GB/L=Hursley/O=IBM/OU=SSG/CN=2145/emailAddress=support@ibm.com
    issuer=/C=GB/L=Hursley/O=IBM/OU=SSG/CN=2145/emailAddress=support@ibm.com
    ---
    No client certificate CA names sent
    ---
    SSL handshake has read 1029 bytes and written 498 bytes
    ---
    New, TLSv1/SSLv3, Cipher is AES256-GCM-SHA384
    Server public key is 1024 bit
    Secure Renegotiation IS supported
    Compression: NONE
    Expansion: NONE
    SSL-Session:
        Protocol  : TLSv1.2
        Cipher    : AES256-GCM-SHA384
        Session-ID: 48291D368E0A8584A8DFA00A9881B8979BDE370FC6C9439294C670695D031239
        Session-ID-ctx:
        Master-Key: 71BD12A161FC595CD056DA8E6D6E27420F37468E47498B7591A403A86844C55F61FF02B2FEC7739FAAEDCE3DFEA0F217
        Key-Arg   : None
        PSK identity: None
        PSK identity hint: None
        SRP username: None
        TLS session ticket lifetime hint: 300 (seconds)
        TLS session ticket:
        0000 - a0 e0 b1 9b c2 37 9a ca-49 1c 54 f5 26 4b d6 24   .....7..I.T.&K.$
        0010 - af 6a 7d cc 5e 4a 97 a8-b3 6d b7 66 0b b7 0a 65   .j}.^J...m.f...e
        0020 - 47 af ef 47 76 fc c7 e9-38 ff 84 28 ca 8e 73 25   G..Gv...8..(..s%
        0030 - 47 25 f6 0d 36 01 04 f1-f9 f7 0c b6 42 ef cf 09   G%..6.......B...
        0040 - 64 8f df ff 89 38 ed 7c-ae 1d 0e 25 d1 c1 77 86   d....8.|...%..w.
        0050 - b6 61 88 15 cf fe 9f 20-86 0d 17 74 18 da ea c0   .a..... ...t....
        0060 - 33 3a 47 f5 f9 51 24 ae-48 37 8a 3f 19 dd c6 04   3:G..Q$.H7.?....
        0070 - 7e d1 20 78 35 99 0b 9f-3b 1f ce 7c bc 11 93 e4   ~. x5...;..|....
        0080 - 0f 94 de 94 f1 0d 0c da-64 ca 0d f6 10 2a c8 fa   ........d....*..
        0090 - dc 3e e4 1a 97 d1 34 7a-9c f5 c3 00 e8 1b 10 d7   .>....4z........
    
        Start Time: 1418397614
        Timeout   : 300 (sec)
        Verify return code: 18 (self signed certificate)
    ---
    quit
    

    The secured connection negotiated to TLS succeeded, so we're on the right track!

  6. Now we need to find the spot in the source code, where the “wbemcli” command line tool is forced to initiate a SSLv3 connection. We know this is probably done in a cURL related function call, since libcURL is used for the network connection. So lets first look into the cURL code for all the lines showing any sign of SSL related operations:

    $ grep -n curl CimCurl.cpp | grep -i ssl
    185:    // Assume we support SSL if we don't have the curl_version_info API
    276:    rv = curl_easy_setopt(mHandle, CURLOPT_SSL_VERIFYHOST, 0);
    277:    //    rv = curl_easy_setopt(mHandle, CURLOPT_SSL_VERIFYPEER, 0);
    280:    rv = curl_easy_setopt(mHandle, CURLOPT_SSLVERSION, 3);
    441:     if ((rv=curl_easy_setopt(mHandle,CURLOPT_SSL_VERIFYPEER,0))) {
    448:       if ((rv=curl_easy_setopt(mHandle,CURLOPT_SSL_VERIFYPEER,1))) {
    466:     if ((rv=curl_easy_setopt(mHandle,CURLOPT_SSLCERT,certificate))) {
    470:     if ((rv=curl_easy_setopt(mHandle,CURLOPT_SSLKEY,key))) {
    

    Looks like a perfect match in line 280 of the file “CimCurl.cpp”. The line numbers are a bit off from the original source code, since the file “CimCurl.cpp” was patched with our above debugging code. The code in the original, unpatched source file “CimCurl.cpp”, within the function “CimomCurl::genRequest” looks like this:

    CimCurl.cpp
    175     [...]
    176     /* Disable SSL host verification */
    177     rv = curl_easy_setopt(mHandle, CURLOPT_SSL_VERIFYHOST, 0);
    178     //    rv = curl_easy_setopt(mHandle, CURLOPT_SSL_VERIFYPEER, 0);
    179
    180     /* Force using SSL V3 */
    181     rv = curl_easy_setopt(mHandle, CURLOPT_SSLVERSION, 3);
    182
    183     /* Set username and password */
    184     if (url.user.length() > 0 && url.password.length() > 0) {
    185         mUserPass = url.user + ":" + url.password;
    186         rv = curl_easy_setopt(mHandle, CURLOPT_USERPWD, mUserPass.c_str());
    187     }
    188     [...]

    In line 181 the cURL option “CURLOPT_SSLVERSION” is indiscriminately set to use SSLv3 and nothing else, which we know from the above deduction is bound to fail on systems adressing the “POODLE” issues.

  7. With the knowledge where the issue is actually caused, an easy quick'n'dirty fix can be implemented:

    sblim-wbemcli-1.6.3_debug.patch
    --- sblim-wbemcli-1.6.3_orig/CimCurl.cpp    2013-09-21 01:26:32.000000000 +0200
    +++ sblim-wbemcli-1.6.3/CimCurl.cpp         2014-11-26 16:46:09.000000000 +0100
    @@ -178,7 +178,7 @@
         //    rv = curl_easy_setopt(mHandle, CURLOPT_SSL_VERIFYPEER, 0);
     
         /* Force using SSL V3 */
    -    rv = curl_easy_setopt(mHandle, CURLOPT_SSLVERSION, 3);
    +    //rv = curl_easy_setopt(mHandle, CURLOPT_SSLVERSION, 3);
     
         /* Set username and password */
         if (url.user.length() > 0 && url.password.length() > 0) {

    Inserting the comment at the line where the cURL option “CURLOPT_SSLVERSION” is forced to SSLv3 causes libcURL to fall back to its default value, which is now TLS.

    Rebuild the “wbemcli” command line tool and tried again:

    $ CURLDEBUG=true /opt/sblim-wbemcli/bin/wbemcli -dx -noverify ecn https://user:pass@svc-test:5989/root/ibm
    To server: <?xml version="1.0" encoding="utf-8" ?>
    <CIM CIMVERSION="2.0" DTDVERSION="2.0">
    <MESSAGE ID="4711" PROTOCOLVERSION="1.0"><SIMPLEREQ><IMETHODCALL NAME="EnumerateClassNames"><LOCALNAMESPACEPATH><NAMESPACE NAME="root"></NAMESPACE><NAMESPACE NAME="ibm"></NAMESPACE></LOCALNAMESPACEPATH>
    <IPARAMVALUE NAME="DeepInheritance"><VALUE>TRUE</VALUE></IPARAMVALUE>
    </IMETHODCALL></SIMPLEREQ>
    </MESSAGE></CIM>
    * About to connect() to svc-test port 5989 (#0)
    *   Trying 192.168.x.x...
    * connected
    * Connected to svc-test (192.168.x.x) port 5989 (#0)
    * found 172 certificates in /etc/ssl/certs/ca-certificates.crt
    *    server certificate verification SKIPPED
    *    common name: 2145 (does not match 'svc-test')
    *    server certificate expiration date OK
    *    server certificate activation date OK
    *    certificate public key: RSA
    *    certificate version: #3
    *    subject: C=GB,L=Hursley,O=IBM,OU=SSG,CN=2145,EMAIL=support@ibm.com
    *    start date: Mon, 16 Jul 2012 10:57:29 GMT
    
    *    expire date: Tue, 13 Jul 2027 10:57:29 GMT
    
    *    issuer: C=GB,L=Hursley,O=IBM,OU=SSG,CN=2145,EMAIL=support@ibm.com
    *    compression: NULL
    *    cipher: AES-128-CBC
    *    MAC: SHA1
    * Server auth using Basic with user 'user'
    > POST /cimom HTTP/1.1
    Authorization: Basic enp6bmFnaW9zOm5hZ2lvcw==
    Host: svc-test:5989
    Content-Type: application/xml; charset="utf-8"
    Connection: Keep-Alive, TE
    CIMProtocolVersion: 1.0
    CIMOperation: MethodCall
    CIMMethod: EnumerateClassNames
    CIMObject: root%2Fibm
    Content-Length: 396
    
    * upload completely sent off: 396 out of 396 bytes
    * additional stuff not fine transfer.c:1037: 0 0
    * HTTP 1.1 or later with persistent connection, pipelining supported
    < HTTP/1.1 200 OK
    < Content-Type: application/xml; charset="utf-8"
    From server: Content-Type: application/xml; charset="utf-8"
    < content-length: 0000072284
    From server: content-length: 0000072284
    < CIMOperation: MethodResponse
    From server: CIMOperation: MethodResponse
    <
    * Connection #0 to host svc-test left intact
    From server: <?xml version="1.0" encoding="utf-8" ?>
    <CIM CIMVERSION="2.0" DTDVERSION="2.0">
    <MESSAGE ID="4711" PROTOCOLVERSION="1.0">
    <SIMPLERSP>
    <IMETHODRESPONSE NAME="EnumerateClassNames">
    <IRETURNVALUE>
    <CLASSNAME NAME="CIM_ConcreteIdentity"/>
    <CLASSNAME NAME="CIM_NetworkPacketAction"/>
    <CLASSNAME NAME="CIM_CollectionInSystem"/>
    [...]
    
    $ /opt/sblim-wbemcli/bin/wbemcli -noverify ecn https://user:pass@svc-test:5989/root/ibm
    svc-test:5989/root/ibm:CIM_ConcreteIdentity
    svc-test:5989/root/ibm:CIM_NetworkPacketAction
    svc-test:5989/root/ibm:CIM_CollectionInSystem
    svc-test:5989/root/ibm:CIM_DeviceSAPImplementation
    svc-test:5989/root/ibm:CIM_ProtocolControllerAccessesUnit
    svc-test:5989/root/ibm:CIM_ControlledBy
    [...]
    

    Great, now we're finally able to query and monitor the IBM SVC or Storwize systems again with the Nagios monitoring plugin (Nagios Monitoring - IBM SVC and Storwize)!

Between first noticing and researching the issue and creating the quick'n'dirty fix shown above, an official bug report has been filed on the issue and a patch has already been submitted to the source code repository in order to adress the issue more thoroughly. Hopefully an updated official source code package will be released soon.

Nonetheless, the process of researching and debugging this issue was an excellent hands on exercise for me, which i enjoyed very much. Hopefully the steps taken and described here, will turn out to be of use for others as well.

This website uses cookies for visitor traffic analysis. By using the website, you agree with storing the cookies on your computer.More information