bityard Blog

// Debugging Segfaults in Open-iSCSIs iscsiuio on Intel Broadwell

Open-iSCSIs tool iscsiuio, which is used to configure and manage Broadcom BCM577xx and BCM578xx iSCSI offload engines (iSOE), currently crashes with a segmentation fault upon target login on Intel Broadwell based systems. Comparable system setups, based on the older Intel Haswell chips do not show this issue.

In the past i've been using QLogic 4000 and QLogic 8200 Series network adapters and iSCSI HBAs which provide a full iSCSI offload engine (iSOE) implementation in the adapters firmware. Unfortunately the QLogic 8200 Series network adapters are no longer available for Dell M-Series blade servers. The alternatives offered by Dell are the Intel X520 and Intel X540 series adapters, or the Broadcom BCM57810S series adapters. Instead of using the Intel X520/X540, which provide no iSOE at all, i decided to go with the Broadcom BCM57810S, which at least provide some kind of iSOE. According to the VMware terminology the Broadcom BCM57810S are dependent hardware iSCSI Initiators. Dependent in this context means, that the iSOE does not implement all the necessary features and thus cannot perform all the tasks (e.g. TCP/IP stack, configuration and session management, authentication, etc.) necessary for target handling by itself. Rather, some of these tasks are provided by a third party on which this kind of iSOE depends on. In case of the Broadcom BCM57810S this third party is the iscsiuio daemon, which has for some time been part of the Open-iSCSI project. Simply put, the iscsiuio daemon acts as an intermediary between the iscsid on the one side and the QLogic1) NetXtreme II driver (kernel module bnx2 or bnx2x) and the QLogic2) CNIC driver (kernel module cnic) on the other side, facilitating the creation and overall management of offloaded iSCSI sessions. Very simplified, the flow of information is as follows:

iscsiadm ←→ iscsid ←→ iscsiuio ←→ bnx2/bnx2x ←→ cnic ←→ Broadcom BCM57810S adapter ←→ Network ←→ Target

In my environment the Broadcom BCM57810S adapters are installed and used on six hosts (host1 and host{5,6,7,8,9}). They all connect to the same Dell EqualLogic storage systems in the backend, using the same network dedicated to iSCSI traffic. All hosts are Dell M630 blade servers with exactly the same firmware, operating system (Debian 8) and software versions. I'm using a backported version of Open-iSCSI, which is based on Git commit 0fa43f29, but excluding the commits 76832662 and c6d1117b which just implement the externalization of the Open-iSNS code. The systems originate from the same install image, so their configuration is – to a large extent – exactly the same. The only difference between the hosts is that host1 has Intel E5 v3 (aka Haswell) CPUs, while host{5,6,7,8,9} have Intel E5 v4 (aka Broadwell) CPUs.

On host1 everything works fine, iscsiuio runs as expected and access to targets via the Broadcom BCM57810S iSOEs is working flawlessly. On host{5,6,7,8,9} on the other hand, i was getting segmentation faults like the one in the example shown below, while trying to log in to any target.

host5:~# gdb /sbin/iscsiuio
GNU gdb (Debian 7.7.1+dfsg-5) 7.7.1
Copyright (C) 2014 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /sbin/iscsiuio...(no debugging symbols found)...done.


(gdb) # run -d 4 -f
Starting program: /sbin/iscsiuio -d 4 -f
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
INFO  [Wed Jul 27 10:01:45 2016]Initialize logger using log file: /var/log/iscsiuio.log
INFO  [Wed Jul 27 10:01:45 2016]Started iSCSI uio stack: Ver 0.7.8.2
INFO  [Wed Jul 27 10:01:45 2016]Build date: Fri Jul 22 15:40:04 CEST 2016
INFO  [Wed Jul 27 10:01:45 2016]Debug mode enabled
INFO  [Wed Jul 27 10:01:45 2016]Running on sysname: 'Linux', release: '3.16.0-4-amd64', version '#1 SMP Debian 3.16.7-ckt
25-2+deb8u3 (2016-07-02)' machine: 'x86_64'
DBG   [Wed Jul 27 10:01:45 2016]Loaded nic library 'bnx2' Version: '0.7.8.2' build on Fri Jul 22 15:40:04 CEST 2016'
DBG   [Wed Jul 27 10:01:45 2016]Added 'bnx2' nic library
DBG   [Wed Jul 27 10:01:45 2016]Loaded nic library 'bnx2x' Version: '0.7.8.2' build on Fri Jul 22 15:40:04 CEST 2016'
DBG   [Wed Jul 27 10:01:45 2016]Added 'bnx2x' nic library
[New Thread 0x7ffff760f700 (LWP 4942)]
INFO  [Wed Jul 27 10:01:45 2016]signal handling thread ready
INFO  [Wed Jul 27 10:01:45 2016]nic_utils Found host[11]: host11
INFO  [Wed Jul 27 10:01:45 2016]Done capturing /sys/class/iscsi_host/host11/netdev
INFO  [Wed Jul 27 10:01:45 2016]Done capturing /sys/class/iscsi_host/host11/netdev
INFO  [Wed Jul 27 10:01:45 2016]nic_utils looking for uio device for eth3
WARN  [Wed Jul 27 10:01:45 2016]Could not find assoicate uio device with eth3
ERR   [Wed Jul 27 10:01:45 2016]nic_utils Could not determine UIO name for eth3
INFO  [Wed Jul 27 10:01:45 2016]nic_utils Found host[12]: host12
INFO  [Wed Jul 27 10:01:45 2016]Done capturing /sys/class/iscsi_host/host12/netdev
INFO  [Wed Jul 27 10:01:45 2016]Done capturing /sys/class/iscsi_host/host12/netdev
INFO  [Wed Jul 27 10:01:45 2016]nic_utils looking for uio device for eth2
INFO  [Wed Jul 27 10:01:45 2016]nic_utils eth2 associated with uio0
INFO  [Wed Jul 27 10:01:45 2016]nic_utils NIC not found creating an instance for host_no: 12 eth2
DBG   [Wed Jul 27 10:01:45 2016]Could not increase process priority: Success
[New Thread 0x7ffff6e0e700 (LWP 4943)]
DBG   [Wed Jul 27 10:01:45 2016]iscsi_ipc Started iscsid listening thread
DBG   [Wed Jul 27 10:01:45 2016]iscsi_ipc Waiting for iscsid command
INFO  [Wed Jul 27 10:01:45 2016]NIC_NL Netlink to CNIC on pid 4938 is ready
DBG   [Wed Jul 27 10:01:57 2016]iscsi_ipc recv iscsid request: cmd: 1, payload_len: 11720
INFO  [Wed Jul 27 10:01:57 2016]iscsi_ipc Received request for 'eth2' to set IP address: '10.0.1.62' VLAN: '0'
INFO  [Wed Jul 27 10:01:57 2016]iscsi_ipc Using netmask: 0.0.0.0
INFO  [Wed Jul 27 10:01:57 2016]iscsi_ipc  eth2, using existing NIC
INFO  [Wed Jul 27 10:01:57 2016]nic_utils looking for uio device for eth2
INFO  [Wed Jul 27 10:01:57 2016]nic_utils eth2 associated with uio0
INFO  [Wed Jul 27 10:01:57 2016]Done capturing /sys/class/uio/uio0/name
INFO  [Wed Jul 27 10:01:57 2016]nic_utils eth2: Verified uio name bnx2x_cnic with library bnx2x
INFO  [Wed Jul 27 10:01:57 2016]eth2: found NIC with library 'bnx2x'
INFO  [Wed Jul 27 10:01:57 2016]iscsi_ipc eth2 library set using transport_name bnx2i
INFO  [Wed Jul 27 10:01:57 2016]iscsi_ipc eth2: requesting configuration using static IP address
DBG   [Wed Jul 27 10:01:57 2016]iscsi_ipc eth2 couldn't find interface with ip_type: 0x2 creating it
INFO  [Wed Jul 27 10:01:57 2016]nic eth2: Added nic interface for VLAN: 0, protocol: 2
INFO  [Wed Jul 27 10:01:57 2016]iscsi_ipc eth2: created network interface
[New Thread 0x7ffff660d700 (LWP 4947)]
WARN  [Wed Jul 27 10:01:57 2016]nic_utils eth2: device already disabled: flag: 0x1088 state: 0x1
INFO  [Wed Jul 27 10:01:57 2016]iscsi_ipc eth2: configuring using static IP IPv4 address :10.0.1.62
INFO  [Wed Jul 27 10:01:57 2016]iscsi_ipc  netmask: 255.255.255.0
[New Thread 0x7ffff5e0c700 (LWP 4948)]
INFO  [Wed Jul 27 10:01:57 2016]iscsi_ipc ISCSID_UIP_IPC_GET_IFACE: command: 1 name: bnx2i.d0:43:1e:51:98:53, netdev: eth2 ipaddr: 10.0.1.62 vlan: 0 transport_name:bnx2i
INFO  [Wed Jul 27 10:01:57 2016]nic_utils eth2: spinning up thread for nic
DBG   [Wed Jul 27 10:01:57 2016]iscsi_ipc Waiting for iscsid command
[New Thread 0x7ffff560b700 (LWP 4949)]
DBG   [Wed Jul 27 10:01:57 2016]nic eth2: Waiting to be enabled
INFO  [Wed Jul 27 10:01:57 2016]Created nic thread: eth2
INFO  [Wed Jul 27 10:01:57 2016]iscsi_ipc eth2: started NIC enable thread state: 0x1
DBG   [Wed Jul 27 10:01:57 2016]nic eth2: is now enabled
INFO  [Wed Jul 27 10:01:57 2016]bnx2x eth2: bnx2x driver using version 1.78.19
ERR   [Wed Jul 27 10:01:58 2016]bnx2x /dev/uio0: uio device has been brought up via pid: 4938 on fd: 7
INFO  [Wed Jul 27 10:01:58 2016]Done capturing /sys/class/uio/uio0/name
INFO  [Wed Jul 27 10:01:58 2016]bnx2x eth2: Verified is a cnic_uio device
DBG   [Wed Jul 27 10:01:58 2016]bnx2x eth2: using rx ring size: 15, rx buffer size: 1024
INFO  [Wed Jul 27 10:01:58 2016]Done capturing /sys/class/uio/uio0/event
DBG   [Wed Jul 27 10:01:58 2016]bnx2x Chip ID: 168e1000
INFO  [Wed Jul 27 10:01:58 2016]nic_id eth2: is found at 03:00.00
INFO  [Wed Jul 27 10:01:58 2016]bnx2x eth2: func 0x0, pfid 0x0, client_id 0x88, cid 0x1
DBG   [Wed Jul 27 10:01:58 2016]bnx2x eth2: mode = 0x100
INFO  [Wed Jul 27 10:01:58 2016]bnx2x eth2:  Using mac address: d0:43:1e:51:98:53
INFO  [Wed Jul 27 10:01:58 2016]eth2: bnx2x initialized
INFO  [Wed Jul 27 10:01:58 2016]nic eth2: Initialized ip stack: VLAN: 0
INFO  [Wed Jul 27 10:01:58 2016]nic eth2: mac: d0:43:1e:51:98:53
INFO  [Wed Jul 27 10:01:58 2016]nic eth2: Using IP address: 10.0.1.62
INFO  [Wed Jul 27 10:01:58 2016]nic eth2: Using netmask: 255.255.255.0
INFO  [Wed Jul 27 10:01:58 2016]nic eth2: enabled vlan 0 protocol: 2
INFO  [Wed Jul 27 10:01:58 2016]nic eth2: entering main nic loop
DBG   [Wed Jul 27 10:01:58 2016]nic_utils eth2: device enabled
[Thread 0x7ffff5e0c700 (LWP 4948) exited]
DBG   [Wed Jul 27 10:01:59 2016]iscsi_ipc recv iscsid request: cmd: 1, payload_len: 11720
INFO  [Wed Jul 27 10:01:59 2016]iscsi_ipc Received request for 'eth2' to set IP address: '10.0.1.62' VLAN: '0'
INFO  [Wed Jul 27 10:01:59 2016]iscsi_ipc Using netmask: 0.0.0.0
INFO  [Wed Jul 27 10:01:59 2016]iscsi_ipc  eth2, using existing NIC
INFO  [Wed Jul 27 10:01:59 2016]nic_utils looking for uio device for eth2
INFO  [Wed Jul 27 10:01:59 2016]nic_utils eth2 associated with uio0
INFO  [Wed Jul 27 10:01:59 2016]eth2: Have NIC library 'bnx2x'
INFO  [Wed Jul 27 10:01:59 2016]Done capturing /sys/class/uio/uio0/name
INFO  [Wed Jul 27 10:01:59 2016]nic_utils eth2: Verified uio name bnx2x_cnic with library bnx2x
INFO  [Wed Jul 27 10:01:59 2016]eth2: found NIC with library 'bnx2x'
INFO  [Wed Jul 27 10:01:59 2016]iscsi_ipc eth2 library set using transport_name bnx2i
INFO  [Wed Jul 27 10:01:59 2016]iscsi_ipc eth2: requesting configuration using static IP address
INFO  [Wed Jul 27 10:01:59 2016]iscsi_ipc eth2: using existing network interface
INFO  [Wed Jul 27 10:01:59 2016]iscsi_ipc eth2: IP configuration didn't change using 0x2
INFO  [Wed Jul 27 10:01:59 2016]iscsi_ipc eth2: NIC already enabled flags: 0x1084 state: 0x4

INFO  [Wed Jul 27 10:01:59 2016]iscsi_ipc ISCSID_UIP_IPC_GET_IFACE: command: 1 name: bnx2i.d0:43:1e:51:98:53, netdev: eth2 ipaddr: 10.0.1.62 vlan: 0 transport_name:bnx2i
DBG   [Wed Jul 27 10:01:59 2016]iscsi_ipc Waiting for iscsid command
INFO  [Wed Jul 27 10:02:00 2016]NIC_NL Received path_req for host 12
INFO  [Wed Jul 27 10:02:00 2016]Done capturing /sys/class/iscsi_host/host12/netdev
DBG   [Wed Jul 27 10:02:00 2016]NIC_NL Pulled nl event
INFO  [Wed Jul 27 10:02:00 2016]NIC_NL eth2: Processing 'path_req'
DBG   [Wed Jul 27 10:02:00 2016]NIC_NL eth2: PATH_REQ with iface_num -1 VLAN 32768
DBG   [Wed Jul 27 10:02:00 2016]CNIC eth2: Netlink message with VLAN ID: 0, path MTU: 9000 minor: 0 ip_addr_len: 4
DBG   [Wed Jul 27 10:02:00 2016]CNIC eth2: src=10.0.1.62
DBG   [Wed Jul 27 10:02:00 2016]CNIC eth2: dst=10.0.1.2
DBG   [Wed Jul 27 10:02:00 2016]CNIC eth2: nm=255.255.255.0
INFO  [Wed Jul 27 10:02:00 2016]CNIC eth2: Didn't find IPv4: '10.0.1.2' in ARP table
DBG   [Wed Jul 27 10:02:00 2016]CNIC eth2: Sent cnic arp request for IP: 10.0.1.2
INFO  [Wed Jul 27 10:02:00 2016]Found 10.0.1.2 at b0:83:fe:cc:57:bb
DBG   [Wed Jul 27 10:02:00 2016]CNIC neighbor reply sent back to kernel 10.0.1.62 at b0:83:fe:cc:57:bb with vlan 0
INFO  [Wed Jul 27 10:02:00 2016]NIC_NL eth2: 'path_req' operation finished

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff660d700 (LWP 4947)]
__lll_unlock_elision (lock=0x55555577fd40, private=0) at ../nptl/sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
29      ../nptl/sysdeps/unix/sysv/linux/x86/elision-unlock.c: No such file or directory.


(gdb) # info threads
  Id   Target Id         Frame
  6    Thread 0x7ffff560b700 (LWP 4949) "iscsiuio" 0x00007ffff76f1ae3 in select () at ../sysdeps/unix/syscall-template.S:81
* 4    Thread 0x7ffff660d700 (LWP 4947) "iscsiuio" __lll_unlock_elision (lock=0x55555577fd40, private=0) at ../nptl/sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
  3    Thread 0x7ffff6e0e700 (LWP 4943) "iscsiuio" 0x00007ffff79c9ccd in accept () at ../sysdeps/unix/syscall-template.S:81
  2    Thread 0x7ffff760f700 (LWP 4942) "iscsiuio" do_sigwait (set=<optimized out>, sig=0x7ffff760eeac) at ../nptl/sysdeps/unix/sysv/linux/../../../../../sysdeps/unix/sysv/linux/sigwait.c:63
  1    Thread 0x7ffff7fea700 (LWP 4938) "iscsiuio" 0x00007ffff79c9e9d in recvmsg () at ../sysdeps/unix/syscall-template.S:81


(gdb) # thread apply all bt

Thread 6 (Thread 0x7ffff560b700 (LWP 4949)):
#0  0x00007ffff76f1ae3 in select () at ../sysdeps/unix/syscall-template.S:81
#1  0x000055555555ac06 in ?? ()
#2  0x000055555555d39e in nic_loop ()
#3  0x00007ffff79c30a4 in start_thread (arg=0x7ffff560b700) at pthread_create.c:309
#4  0x00007ffff76f887d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 4 (Thread 0x7ffff660d700 (LWP 4947)):
#0  __lll_unlock_elision (lock=0x55555577fd40, private=0) at ../nptl/sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
#1  0x00007ffff79c7007 in pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:94
#2  0x000055555555e803 in nl_process_handle_thread ()
#3  0x00007ffff79c30a4 in start_thread (arg=0x7ffff660d700) at pthread_create.c:309
#4  0x00007ffff76f887d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 3 (Thread 0x7ffff6e0e700 (LWP 4943)):
#0  0x00007ffff79c9ccd in accept () at ../sysdeps/unix/syscall-template.S:81
#1  0x00005555555641b0 in ?? ()
#2  0x00007ffff79c30a4 in start_thread (arg=0x7ffff6e0e700) at pthread_create.c:309
#3  0x00007ffff76f887d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 2 (Thread 0x7ffff760f700 (LWP 4942)):
#0  do_sigwait (set=<optimized out>, sig=0x7ffff760eeac) at ../nptl/sysdeps/unix/sysv/linux/../../../../../sysdeps/unix/sysv/linux/sigwait.c:63
#1  0x00007ffff79ca693 in __sigwait (set=0x7ffff760eeb0, sig=0x0) at ../nptl/sysdeps/unix/sysv/linux/../../../../../sysdeps/unix/sysv/linux/sigwait.c:97
#2  0x000055555555a49c in _start ()

Thread 1 (Thread 0x7ffff7fea700 (LWP 4938)):
#0  0x00007ffff79c9e9d in recvmsg () at ../sysdeps/unix/syscall-template.S:81
#1  0x000055555555e5e9 in ?? ()
#2  0x000055555555eea8 in nic_nl_open ()
#3  0x000055555555a1b8 in main ()

The Open-iSCSI command which lead to this segmentation fault was a simple login at a previously defined target node:

host5:~# iscsiadm -m node -T <target iqn> -I <interface> --login

Double-checking the configuration, the firmware and software versions as well as the general hardware setup didn't yield any usable indication as to where the root cause of this issue might be. Searching the web for __lll_unlock_elision in conjunction with the pthread_* function calls, led me to the following resources:

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=800574
https://lwn.net/Articles/534758/

Those are pointing towards a CPU (Broadwell and Skylake) specific problem when not carefully using mutexes. The general opinion from there and also other related bug reports seems to be, that the source of such issues is almost always an improper use of mutex locking, which – up to now – has either been tolerated or just by chance not lead to a failure. More recent CPUs and software implementations (e.g. the GNU libc) appear to be less forgiving in this regard. Thus the advice is to change the application behaviour towards a proper use of mutex locking, in order to address such an issue.

The article Intel's Broadwell Xeon E5-2600 v4 chips: So what's in it for you, smartie-pants coders offers a rather nice write-up of the new features introduced in the Intel Broadwell CPUs.

Tracking this issue further down in the Open-iSCSI sources, i ended up in the function nl_process_handle_thread() in iscsiuio/src/unix/nic_nl.c and specifically in the following code section:

iscsiuio/src/unix/nic_nl.c
474 /* NIC specific nl processing thread */
475 void *nl_process_handle_thread(void *arg)
476 {
[...]
483         while (!event_loop_stop) {
484                 char *data = NULL;
485
486                 rc = pthread_cond_wait(&nic->nl_process_cond,
487                                        &nic->nl_process_mutex);
488                 if (rc != 0) {
489                         LOG_ERR("Fatal error in NL processing thread "
490                                 "during wait[%s]", strerror(rc));
491                         break;
492                 }
[...]
499                 pthread_mutex_unlock(&nic->nl_process_mutex);
[...]

Debugging this revealed that the call to pthread_cond_wait() from the above GDB backtrace output of thread number 4 is the one from line 486 in the above code snippet.

Looking at the pthread_cond_wait() manpage showed the following constraint for its proper use:

[…]
The pthread_cond_timedwait() and pthread_cond_wait() functions shall
block on a condition variable. They shall be called with mutex locked
by the calling thread or undefined behavior results.
[…]

Although not shown in the above GDB output, this would on occasion – and again, probably just by chance – work on the first pass of the loop. At the end of the loop, at line 499 in the above code snippet, the mutex is then unlocked. Thus the cited constraint from the pthread_cond_wait() manpage is no longer met on the subsequent passes of the loop. On Intel E5 v3 (aka Haswell) CPUs, this seemed to be tolerated and without any impact. But on Intel E5 v4 (aka Broadwell) – and probably other CPUs implementing HLE and RTM – this causes the observed segmentation fault.

In order to verify my analysis and test this theory, i added a call to pthread_mutex_lock() right before the call to pthread_cond_wait() in line 486. The resulting change is available in the Git commit 9f770f9e of my Open-iSCSI fork on Github and also shown in the following patch:

nic_nl.c.patch
diff --git a/iscsiuio/src/unix/nic_nl.c b/iscsiuio/src/unix/nic_nl.c
index 391003f..581ddb0 100644
--- a/iscsiuio/src/unix/nic_nl.c
+++ b/iscsiuio/src/unix/nic_nl.c
@@ -483,6 +483,7 @@ void *nl_process_handle_thread(void *arg)
        while (!event_loop_stop) {
                char *data = NULL;
 
+               pthread_mutex_lock(&nic->nl_process_mutex);
                rc = pthread_cond_wait(&nic->nl_process_cond,
                                       &nic->nl_process_mutex);
                if (rc != 0) {

Posting this on the Open-iSCSI mailing list lead to this discussion. The suggestion from there was an additional change to the error handling code of nl_process_handle_thread() starting at line 488 in the above above code snippet. This adds proper handling of the locked mutex in case the loop is left due to an error returned from the call to pthread_cond_wait(). The resulting additional change is available in the Git commit 4191ca6b of my Open-iSCSI fork on Github. The following patch shows the summarised code changes:

nic_nl.c.patch
diff --git a/iscsiuio/src/unix/nic_nl.c b/iscsiuio/src/unix/nic_nl.c
index 391003f..1a920c7 100644
--- a/iscsiuio/src/unix/nic_nl.c
+++ b/iscsiuio/src/unix/nic_nl.c
@@ -483,9 +483,11 @@ void *nl_process_handle_thread(void *arg)
        while (!event_loop_stop) {
                char *data = NULL;
 
+               pthread_mutex_lock(&nic->nl_process_mutex);
                rc = pthread_cond_wait(&nic->nl_process_cond,
                                       &nic->nl_process_mutex);
                if (rc != 0) {
+                       pthread_mutex_unlock(&nic->nl_process_mutex);
                        LOG_ERR("Fatal error in NL processing thread "
                                "during wait[%s]", strerror(rc));
                        break;

With those two small changes to the sources of Open-iSCSIs iscsiuio, the iSCSI connections via the Broadcom BCM57810S iSOE do now work flawlessly even on newer Intel E5 v4 (aka Broadwell) based systems. Hopefully the original authors of the iscsiuio code at Broadcom/QLogic will also take part in the discussion and provide their feedback on the proposed code changes too.

1) , 2) formerly Broadcom

// Compatibility of DokuWiki "Detritus" and the Taratasy Template

The DokuWiki Taratasy Template offers a rather nice design, which i wanted to use for my new Wiki. This Wiki will be based upon the currently stable DokuWiki 2015-08-10a (aka “Detritus”) release. The Taratasy templates site currently shows the compatibility to the DokuWiki “Detritus” release as “unknown”, see the following screenshot:

Taratasy template compatibility status

Installing the Taratasy templates in an otherwise working DokuWiki 2015-08-10a instance lead to the rather broken appearance shown in the following example screenshot:

Screenshot of the Taratasy template in a DokuWiki 2015-08-10a (aka "Detritus") release

The colors and fonts are way off, the navigation bar at the head and bottom of the page as well as the tools menu are broken. Thus, my immediate suspicion for the root cause of the issues was in the direction of the CSS definitions provided by the Taratasy template.

Digging through the webserver and PHP logs as well as double checking the webservers rewrite rules showed no noticable indication of a problem with the DokuWiki instance itself. Switching to another template like e.g. Arctic also showed that the DokuWiki 2015-08-10a instance itself worked flawlessly. So this was not a general issue with my setup of the current DokuWiki release. The other way around, using the Taratasy template with an older DokuWiki 2014-09-29d (aka “Hrun”) release, also worked as expected and without any issues. So the problem was narrowed down to some change introduced between those two DokuWiki releases.

With the suspicion about the CSS definitions mentioned above, i tried to make some changes in the Taratasy templates style.ini file. Those changes immediately effected the layout of the re-rendered page. On the other hand, making some changes in the Taratasy templates style.local.ini file had no effect on the subsequently re-rendered page.

Browsing the commit history of the DokuWiki GitHub repository for the 2015-08-10a release and searching for commits with a reference to style.local.ini, i stubled upon the commit 656e584. The associated changes finally removed a feature which had already been marked as deprecated in previous DokuWiki releases. This feature was to include user defined CSS formatting definitions through the file style.local.ini in a templates main directory and merge these definitions with the ones from the templates style.ini file into the final CSS definitions. For several DokuWiki releases the user defined CSS formatting definitions are now supposed to be placed in the file DOKU_CONF.“tpl/$tpl/style.ini” instead.

With the source of the issue now known, my quick'n'dirty solution was to merge the CSS definitions from the templates style.local.ini into the templates style.ini and afterwards drop the now redundant style.local.ini. This was just easier to accomplish given my particular DokuWiki directory setup. Another solution would have been to create a directory DOKU_CONF/tpl/taratasy/ and move the style.local.ini into it as DOKU_CONF/tpl/taratasy/style.ini.

The modified template sources can be found in my Taratasy repository on GitHub

In conclusion, the following image shows an example screenshot of the same DokuWiki 2015-08-10a instance as above after the patch to the Taratasy template has been applied:

Screenshot of the patched Taratasy template in a DokuWiki 2015-08-10a (aka "Detritus") release

// Check_MK Monitoring - Dell PowerConnect Switches

Dell PowerConnect and Dell PowerConnect M-Series switches can – with regard to their most important aspects like CPU, fans, PSU and temperature – already be monitored with the standard Check_MK distribution. This article introduces an enhanced version and additional Check_MK service checks to monitor additional aspects of Dell PowerConnect switches. It is targeted mainly towards the Dell PowerConnect M-Series switches used in Dell PowerEdge M1000e blade chassis, but can probably be used on standalone Dell PowerConnect switches as well.

For the impatient and TL;DR here is the Check_MK package of the enhanced version of the Dell PowerConnect monitoring checks:

Enhanced version of the Dell PowerConnect monitoring checks (Compatible with Check_MK versions 1.2.6 and earlier)
Enhanced version of the Dell PowerConnect monitoring checks (Compatible with Check_MK versions 1.2.8 and later)

The sources are to be found in my Check_MK repository on GitHub


The Dell PowerConnect M-Series switches to be used in Dell PowerEdge M1000e blade chassis – and possibly some newer Dell PowerConnect standalone switches too – are based on Broadcom FASTPATH silicon. While this hardware base introduces a plethora of other issues to be covered in detail in a separate article, it also introduces the possibility of breaking backwards compatibility with older Dell PowerConnect models from a monitoring point of view. Therefore, the new checks to cover the Broadcom FASTPATH based hardware were moved to a entirely new namespace. The file names of the new checks now use the prefix dell_powerconnect_bcm_ in contrast to the already existing stock Check_MK checks with their prefix dell_powerconnect_. Another difference to the stock Check_MK checks is the use of the FASTPATH Enterprise MIBs, which are specific to devices based on Broadcom silicon. The only exemptions are the checks dell_powerconnect_bcm_global_status and the dell_powerconnect_bcm_dnsstats, both monitor items which are not covered by the FASTPATH Enterprise MIBs.

All checks have been verified to work with the firmware versions 5.1.8.x and 5.1.9.x. For the newly introduced check dell_powerconnect_bcm_global_status the firmware version 5.1.9.4 or later is needed in order to avoid spurious error messages in the switch event log. See the section Additional Checks below for a more detailed explanation.

The discontinued, modified and additional checks are described in greater detail in the following three respective sections:

Discontinued Checks

The two service checks dell_powerconnect_fans and dell_powerconnect_psu provided by the standard Check_MK distribution have become redundant for the Dell PowerConnect M-Series switches. The items to be monitored by both are not present in those devices, since the Dell PowerEdge M1000e blade chassis provides both central cooling and power supply facilities. Accordingly, the cooling and power supply facilities should be monitored via the Dell Chassis Managment Controller.

Modified Checks

The two service checks dell_powerconnect_cpu and dell_powerconnect_temp have been renamed to dell_powerconnect_bcm_cpu and dell_powerconnect_bcm_temp respectively. They both have been modified to use the new Dictionary based parameters and factory settings for the CPU and temperature warning and critical levels. A SNMP example output for all OIDs used has been added to both service checks for documentation purposes. Manual pages, PNP4Nagios templates, WATO and Perf-O-Meter plugins have also been added for both service checks. With the added WATO plugins it is now possible to configure the CPU and temperature warning and critical levels through the WATO WebUI. The configuration options for the CPU levels can be found under:

-> Host & Service Parameters
   -> Parameters for discovered services
      -> Operating System Resources
         -> Dell PowerConnect CPU usage
            -> Create rule in folder ...
               [x] The levels for the overall CPU usage on Dell PowerConnect switches

The configuration options for the temperature levels can be found under:

-> Host & Service Parameters
   -> Parameters for discovered services
      -> Temperature, Humidity, Electrical Parameters, etc.
         -> Dell PowerConnect temperature
            -> Create rule in folder ...
               [x] Temperature levels for Dell PowerConnect switches

The following image shows a status output example for the dell_powerconnect_bcm_cpu service check from the WATO WebUI:

Status output example for the modified dell_powerconnect_bcm_cpu service check

Three average CPU utilization values for the time sample intervals 5, 60 and 300 seconds are checked. The Perf-O-Meter is split accordingly into three sections in order to be able to display all three average CPU utilization values at once.

The following image shows a status output example for the dell_powerconnect_bcm_temp service check from the WATO WebUI:

Status output example for the modified dell_powerconnect_bcm_temp service check

This example shows the status and the current values of the temperature sensors in a switch stack with two switch members.

The following two images show examples of PNP4Nagios graphs for both service checks:

Example PNP4Nagios graph for the modified dell_powerconnect_bcm_cpu service check
Example PNP4Nagios graph for the modified dell_powerconnect_bcm_temp service check

Additional Checks

Overview

The following table shows a condensed overview of the additional Check_MK service checks and their available components.

Service check name Description Alarm Manpage PNP4Nagios template Perf-O-Meter plugin WATO plugin
dell_powerconnect_bcm_arp_cache Checks the current number of entries in the ARP cache against default or configured warning and critical threshold values. yes yes yes yes yes
dell_powerconnect_bcm_cos_queue Determines the number of packets dropped at each CoS queue for the CPU. yes yes
dell_powerconnect_bcm_cpu_proc Monitors the CPU utilization on a per process level. yes yes yes yes
dell_powerconnect_bcm_dnsstats Determines the number of DNS queries (total and several error states defined by RFC 1035) of the systems resolver. yes yes
dell_powerconnect_bcm_global_status Determines the global status of the “product”, via a Dell-specific SNMP OID. yes yes
dell_powerconnect_bcm_ip_conflict Determines if an IP address conflict has been detected on the switch. yes yes
dell_powerconnect_bcm_logstats Determines the number of log messages (total, dropped, relayed to syslog hosts) generated on the system. yes yes
dell_powerconnect_bcm_mbuf Determines the number of memory/message buffer allocations – or failures thereof – for packets arriving at the systems CPU. yes yes
dell_powerconnect_bcm_memory Monitors the current memory usage. yes yes yes yes yes
dell_powerconnect_bcm_sntp Checks the current status of the SNTP client on the switch. yes yes yes yes
dell_powerconnect_bcm_ssh_sessions Checks the number of currently active SSH sessions against the default limit of five allowed SSH sessions. yes yes yes yes yes

The first two columns should be pretty self-explanatory.

The Alarm column shows which checks will generate alarms based on the particular parameters monitored. Checks without an entry in the Alarm column are designed purely for long-term trends via their respective PNP4Nagios templates. All checks with an entry in the Alarm column use the new Dictionary based parameters and factory settings for their respective warning and critical levels. Where reasonable, those warning and critical levels are configurable through the WATO WebUI via an appropriate WATO plugin. See the last column, titled WATO plugin for the checks this applies to.

Manual pages are provided for each service check for documentation purposes. A SNMP example output is provided as a comment within the check script for all the OIDs used in the service check.

For all checks with an entry in the PNP4Nagios template column, a PNP4Nagios templates is provided in order to properly display the performance data delivered by the service check. Perf-O-Meter plugins are provided where reasonable, in order to display selected performance metrics in the service check overview of a host.

The specifics of each additional Check_MK service check are described in greater detail in the following sections.

ARP Cache

The Check_MK service check dell_powerconnect_bcm_arp_cache monitors the current total number of entries in the ARP cache on Dell PowerConnect switches. This number is compared to either the default or configured warning and critical threshold values, and an alarm is raised accordingly. With the added WATO plugin it is possible to configure the warning and critical levels through the WATO WebUI and thus override the default values (warning: 3072; critical: 3584). The configuration options for the ARP cache levels can be found under:

-> Host & Service Parameters
   -> Parameters for discovered services
      -> Operating System Resources
         -> Dell PowerConnect ARP cache
            -> Create rule in folder ...
               [x] The levels for the number of ARP cache entries on Dell PowerConnect switches

The following image shows a status output example for the dell_powerconnect_bcm_arp_cache service check from the WATO WebUI:

Status output example for the new dell_powerconnect_bcm_arp_cache service check

This example shows the current number of entries in the ARP cache along with the warning and critical threshold values.

In addition to the already mentioned total number of entries in the ARP cache, several other metrics are also collected as performance data. These are the overall ARP cache size, the number of static ARP entries and the peak values for both the current and the static number of ARP entries. The following image shows an example of the PNP4Nagios graph for the service check:

Example PNP4Nagios graph for the new dell_powerconnect_bcm_arp_cache service check

CoS Queue

The Check_MK service check dell_powerconnect_bcm_cos_queue monitors the number of packets dropped at each CoS queue for the CPU (quoted from the FASTPATH Enterprise MIB). Unfortunately the only other description available in the FASTPATH Enterprise MIBs is almost as cryptic as the first one: Number of packets dropped at this CPU CoS queue because the queue was full. The metric probably relates to the switches Class of Service (CoS) feature in a Quality of Service (QoS) setup. Currently, the dell_powerconnect_bcm_cos_queue service check is used purely for long-term trends via its respective PNP4Nagios template and thus only gathers its metrics as performance data.

The following image shows a status output example for the dell_powerconnect_bcm_cos_queue service check from the WATO WebUI:

Status output example for the new dell_powerconnect_bcm_cos_queue service check

The following image shows an example of the PNP4Nagios graph for the service check:

Example PNP4Nagios graph for the new dell_powerconnect_bcm_cos_queue service check

Process CPU Usage

The Check_MK service check dell_powerconnect_bcm_cpu_proc monitors the same CPU utilization metrics as the previously described dell_powerconnect_bcm_cpu service check, but on a more detailed, per process level. The three average CPU utilization values for the time sample intervals 5, 60 and 300 seconds are for each process compared to either the default or configured warning and critical threshold values and an alarm is raised accordingly. There is currently the limitation in the checks logic that warning and critical threshold values apply globally to all processes. Individual warning and critical threshold values for each process are currently not supported. With the added WATO plugin it is possible to configure the warning and critical levels through the WATO WebUI and thus override the default values (warning: 80%; critical: 90%) for the average CPU utilization. The configuration options for the per process CPU utilization levels can be found under:

-> Host & Service Parameters
   -> Parameters for discovered services
      -> Operating System Resources
         -> Dell PowerConnect CPU usage (per process)
            -> Create rule in folder ...
               [x] The levels for the per process CPU usage on Dell PowerConnect switches

The following image shows a status output example for the dell_powerconnect_bcm_cpu_proc service check from the WATO WebUI:

Status output example for the new dell_powerconnect_bcm_cpu_proc service check

The following image shows only four examples of PNP4Nagios graphs for the service checks:

Example PNP4Nagios graph for the new dell_powerconnect_bcm_cpu_proc service check

The selected example graphs show the average CPU utilization over the 5, 60 and 300 seconds time sample intervals for the processes SNMPTask, bcmRX, dot1s_timer_task and osapiTimer. Mind though that this is only a small selection of the various processes that can be found running on the Broadcom FASTPATH based Dell PowerConnect switches. Some processes are always to be found, others appear only after a specific feature – covered by appropriate processs – is enabled on the switch. Unfortunately i've not been able to find a complete list of the possible processes nor a good and comprehensive description of the purpose of each process. Sometimes – like in the case of the process SNMPTask – the purpose can be guessed from process name. So overall i'd say the per process CPU utilization metric is probably best used as a metric for long-term trends in conjunction with support from Broadcom or Dell, when dealing with a specific issue on the switch or an unusually high CPU utilization of a specific process.

DNS Statistics

The Check_MK service check dell_powerconnect_bcm_dnsstats monitors various aspects and metrics of the switches local DNS resolver. The metrics gathered can be grouped into three categories:

  • DNS Resolver: The number of DNS resolver queries and the number of DNS responses to those queries. For the DNS responses the number of responses in each response category. The response categories are: Non-auth Answers, Non-auth No-answer, Received Responses, Unparsable Responses, Martians Responses and Fallbacks.

  • DNS Resolver RCODE: The number of DNS resolver responses by resonse code. See 1035 for the details on DNS response codes.

  • DNS Cache: The number of DNS resouce records that have been successfully added or have failed to be added to the DNS resolver cache.

See 1612, 1035 and the service checks man page for a detailed description of the metrics covered by the dell_powerconnect_bcm_dnsstats service check. Currently, the dell_powerconnect_bcm_dnsstats service check is used purely for long-term trends via its respective PNP4Nagios template and thus only gathers its metrics as performance data.

The following image shows a status output example for the dell_powerconnect_bcm_dnsstats service check from the WATO WebUI:

Status output example for the new dell_powerconnect_bcm_dnsstats service check

The following image shows an example of the three PNP4Nagios graphs for the service check:

Example PNP4Nagios graph for the new dell_powerconnect_bcm_dnsstats service check

Global Status

The Check_MK service check dell_powerconnect_bcm_global_status monitors just one metric, the productStatusGlobalStatus from the Dell Vendor MIB for PowerConnect devices. As the name of the metric suggests, it represents an aggregated global status for a Dell PowerConnect device. The global status can assume one of the three values, shown in the following table:

Numeric Value Textual Value Description
3 OK “If fans and power supplies are functioning and the system did not reboot because of a HW watchdog failure or a SW fatal error condition.”
4 Non-critical “If at least one power supply is not functional or the system rebooted at least once because of a HW watchdog failure or a SW fatal error condition.”
5 Critical “If at least one fan is not functional, possibly causing a dangerous warming up of the device.”

While the information about the fan and PSU status is redundant for the Dell PowerConnect M-Series switches, the information about hard- and software error conditions might be quite valueable.

When we first implemented the enhanced version of the Dell PowerConnect monitoring checks, we noticed spurious error messages suddenly appearing in the switch event log and subsequently in our syslog servers. The messages showing up looked like the following example:

<189> OCT 14 12:48:50 <Management IP address>-1 MGMT_ACAL[251047504]: macal_api.c(873) 38462 %% macalRuleActionGet(): List  does not exist.

Disabling one check after another, we narrowed the source of this error message down to the dell_powerconnect_bcm_global_status service check. Logging a support case with Dell eventually lead to the following explaination from Dell PowerConnect engineering:

Hi Frank,

I got an update from our engineering team and they can see the problem
when snmpwalk is executed against switch but issue is not seen if snmpget
is executed on all OIDs.

They are working on a fix.

Once fix is available it will be included in the next FW patch release
for this switch. […]

At the time we were running the newest available firmware, which back then was version 5.1.9.3. After updating to the firmware version 5.1.9.4 which was released later on, the above error messages stopped showing up.

IP Address Conflict Detection

The Check_MK service check dell_powerconnect_bcm_ip_conflict monitors the status of the built-in IP address conflict detection feature of a Dell PowerConnect switch. If an IP address conflict is detected, an alarm with the status warning is raised. In addition to the alarm status, the service check will also report the conflicting IP, (if available) the MAC address of the device causing the conflict and the date and time the conflict was detected. The last bit of information is relative to the switches date and time settings. Needless to say, a properly configured date and time or a time synchronisation via NTP on the siwtch is quite helpful in such a case.

Once an IP address conflict is detected by a Dell PowerConnect switch, this status will not resolve itself automatically or time out in any way. The issue has to be acknowledged manually on the Dell PowerConnect switch. This can be achieved e.g. on the switchs' CLI with the following commands:

switch> enable
switch# clear ip address-conflict-detect

Log Statistics

The Check_MK service check dell_powerconnect_bcm_logstats monitors several metrics of the logging facility on Dell PowerConnect switches. These are the:

  • total number of log messages received by the log process, including dropped and ignored messages.

  • number of dropped log messages, which could not be processed by the log process due to an error or lack of resources.

  • number of relayed log messages. These are log messages which have been forwarded to a remote syslog host by the log process. If multiple remote syslog hosts are configured, each message is counted multiple times, once for each of the configured syslog hosts.

Currently, the dell_powerconnect_bcm_logstats service check is used purely for long-term trends via its respective PNP4Nagios template and thus only gathers its metrics as performance data.

The following image shows a status output example for the dell_powerconnect_bcm_logstats service check from the WATO WebUI:

Status output example for the new dell_powerconnect_bcm_logstats service check

The following image shows an example of the PNP4Nagios graph for the service check:

Example PNP4Nagios graph for the new dell_powerconnect_bcm_logstats service check

Unfortunately the information as to why log messages might have been erroneous or which resources (CPU cycles, free memory, etc.) were missing at the time of processsing the log message is scarce. The metrics about the logging facility are therefore – again – probably best used as metrics for long-term trends in conjunction with support from Broadcom or Dell, when dealing with a specific issue on the switch.

Memory Buffers

The Check_MK service check dell_powerconnect_bcm_mbuf monitors two groups of metrics regarding the memory or message buffers on Dell PowerConnect switches. The first group is the overall number of currently available memory or message buffers on the switch. This group consists of just one metric. The second group is the number of total and the number of failed memory or message buffer allocation attempts for packets arriving at the switches CPU. Those two metrics are gathered for each of memory or message buffer classes. The names of the currently available memory or message buffer classes are “Transmit”, “Rx High”, “Rx Mid0”, “Rx Mid1”, “Rx Mid2” and “Rx Normal”.

The dell_powerconnect_bcm_mbuf service check is currently used only for long-term trends via its respective PNP4Nagios template and thus only gathers its metrics as performance data.

The following image shows a status output example for the dell_powerconnect_bcm_mbuf service check from the WATO WebUI:

Status output example for the new dell_powerconnect_bcm_mbuf service check

The following image shows an example of the seven PNP4Nagios graphs for the service check. One graph for the overall available memory or message buffers and one graph for the allocation attempts on each of the six memory or message buffer classes:

Example PNP4Nagios graph for the new dell_powerconnect_bcm_mbuf service check

Similarly to the process names described in the previous section Process CPU Usage, i've also not been able to find a good and comprehensive description of the memory or message buffer classes defined on the Broadcom FASTPATH based Dell PowerConnect switches. Some meaning can again be derived from the name of the particular memory or message buffer class, but it is much more limited than in case of the process names. Beyond that, questions like the following – but not limited to – immediately come to mind:

  • which type of packets are forwarded to the CPU instead of being directly processed by the switching silicon of the device?

  • why are there several receive classes (“Rx …”) but only one transmit class?

  • what is the difference between the multiple receive classes and by what algorithm are packets assigned to a specific receive class?

  • what is likely the root cause of a failed memory or message buffer allocation attempt?

  • what are the effects of a failed memory or message buffer allocation attempt. Are packets going to be dropped due to this, or is the allocation attempt retried?

  • what design, implementation and configuration options should be taken into consideration in order to avoid failed memory or message buffer allocation attempts?

Unfortunately they remain unanswered due to the lack of comprehensive documentation. The metrics regarding the memory or message buffers are therefore – again – probably best used for long-term trends in conjunction with support from Broadcom or Dell, when dealing with a specific issue on the switch.

Memory Usage

The Check_MK service check dell_powerconnect_bcm_memory monitors the current memory (RAM) usage on Dell PowerConnect switches. The amount of currently free memory is compared to either the default or configured warning and critical threshold values, and an alarm is raised accordingly. With the added WATO plugin it is possible to configure the warning and critical levels through the WATO WebUI and thus override the default values (warning: 51200 KBytes; critical: 25600 KBytes of free memory). The configuration options for the free memory levels can be found under:

-> Host & Service Parameters
   -> Parameters for discovered services
      -> Operating System Resources
         -> Dell PowerConnect memory usage
            -> Create rule in folder ...
               [x] The levels for the amount of free memory on Dell PowerConnect switches

The following image shows a status output example for the dell_powerconnect_bcm_memory service check from the WATO WebUI:

Status output example for the new dell_powerconnect_bcm_memory service check

This example shows the current amount of free memory and the total memory size both measured in kilobytes.

The following image shows an example of the PNP4Nagios graph for the service check:

Example PNP4Nagios graph for the new dell_powerconnect_bcm_memory service check

SNTP Statistics

For this Check_MK service check to be useful and work properly, there needs to be at least one SNTP server configured on the monitored Dell PowerConnect devices. Setting up a SNTP configuration is generally a good idea for a number of reasons. Please refer to the Dell PowerConnect User's Configuration Guide or the Dell PowerConnect M8024-K Command Line Interface Guide on how to do this.

The Check_MK service check dell_powerconnect_bcm_sntp monitors the current status of the SNTP client on Dell PowerConnect switches. In order to achieve this, the check iterates over the list of SNTP servers configured as time references for the SNTP client on the switch. For each configured SNTP server, the status of the last connection attempt from the SNTP client on the switch to that particular SNTP server is evaluated. The overall number of SNTP servers with a connection status equal to success is counted and this number is compared to either the default or configured warning and critical threshold values, and an alarm is raised accordingly. With the added WATO plugin it is possible to configure the warning and critical levels through the WATO WebUI and thus override the default values (warning: 1 ; critical: 0 servers successfully connected). The configuration options for the levels of successful SNTP server connections can be found under:

-> Host & Service Parameters
   -> Parameters for discovered services
      -> Applications, Processes & Services
         -> Dell PowerConnect SNTP status
            -> Create rule in folder ...
               [x] Successful SNTP server connections on Dell PowerConnect switches

The following image shows a status output example for the dell_powerconnect_bcm_sntp service check from the WATO WebUI:

Status output example for the new dell_powerconnect_bcm_sntp service check

This example shows the current status of the SNTP client which has successfully connected the one configured SNTP server.

In addition to the aggregated current connection status of the SNTP client to all configured SNTP servers, two other metrics are – for each SNTP server – collected as performance data. These are the overall number of SNTP requests – including retries – and the number of failed SNTP requests the client made to a particular SNTP server. The following image shows an example of the PNP4Nagios graph for the service check:

Example PNP4Nagios graph for the new dell_powerconnect_bcm_sntp service check

SSH Sessions

The Check_MK service check dell_powerconnect_bcm_ssh_sessions monitors just one metric, the number of currently active SSH sessions on the Dell PowerConnect device. This number is compared to either the default or configured warning and critical threshold values, and an alarm is raised accordingly. With the added WATO plugin it is possible to configure the warning and critical levels through the WATO WebUI and thus override the default values (warning: 5; critical: 5 active SSH sessions). The configuration options for the number of active SSH sessions can be found under:

-> Host & Service Parameters
   -> Parameters for discovered services
      -> Applications, Processes & Services
         -> Dell PowerConnect SSH sessions
            -> Create rule in folder ...
               [x] Active SSH sessions on Dell PowerConnect switches

The following image shows a status output example for the dell_powerconnect_bcm_ssh_sessions service check from the WATO WebUI:

Status output example for the new dell_powerconnect_bcm_ssh_sessions service check

This example shows the current number of active SSH sessions along with the warning and critical threshold values.

The following image shows an example of the PNP4Nagios graph for the service check:

Example PNP4Nagios graph for the new dell_powerconnect_bcm_ssh_sessions service check

Conclusion

Adding the enhanced version of the Dell PowerConnect monitoring checks to your Check_MK server enables you to monitor various additional aspects of your Dell PowerConnect devices. New Dell PowerConnect devices should pick up the additional service checks immediately. Existing Dell PowerConnect devices might need a Check_MK inventory to be run explicitly on them in order to pick up the additional service checks.

Along with the built-in Check_MK monitoring of interfaces of network equipment, the monitoring of services (SSH, HTTP and HTTPS) and the status of certificates, as well as the previously described monitoring of RMON Interface Statistics, this enhanced version of the Dell PowerConnect monitoring checks enables you to create a complete monitoring solution for your Dell PowerConnect M-Series switches.

I hope you find the provided new and enhanced checks useful and enjoyed reading this blog post. Please don't hesitate to drop me a note if you have any suggestions or run into any issues with the provided checks.

// Check_MK Monitoring - RMON Interface Statistics

Check_MK provides the check rmon_stats to collect and monitor Remote Network MONitoring (RMON) statistics for interfaces of network equipment. While this stock check probably works fine with Cisco network devices, it does – out of the box – not work with network equipment from other vendors like e.g. Dell PowerConnect switches. This article shows the modifications necessary to make the rmon_stats check work with non-Cisco network equipment. Along the way, other shortcomings of the stock rmon_stats are discussed and the necessary modifications to address those issues are shown.

For the impatient and TL;DR here are the enhanced versions of the rmon_stats check here along with a slightly beautified version of the accompanying PNP4Nagios template:

Enhanced version of the rmon_stats check
Slightly beautified version of the rmon_stats PNP4Nagios template

The limitation of the rmon_stats check to Cisco network devices is due to the fact that the Check_MK inventory is being limited to vendor specific OIDs in the checks snmp_scan_function. The following code snippet shows the respective lines:

rmon_stats
check_info["rmon_stats"] = {
    'check_function'        : check_rmon_stats,
    'inventory_function'    : inventory_rmon_stats,
    'service_description'   : 'RMON Stats IF %s',
    'has_perfdata'          : True,
    'snmp_info'             : ('.1.3.6.1.2.1.16.1.1.1', [ #
                                        '1',    # etherStatsIndex = Item
                                        '6',    # etherStatsBroadcastPkts
                                        '7',    # etherStatsMulticastPkts
                                        '14',   # etherStatsPkts64Octets
                                        '15',   # etherStatsPkts65to127Octets
                                        '16',   # etherStatsPkts128to255Octets
                                        '17',   # etherStatsPkts256to511Octets
                                        '18',   # etherStatsPkts512to1023Octets
                                        '19',   # etherStatsPkts1024to1518Octets
                                        ]),
    # for the scan we need to check for any single object in the RMON tree,
    # we choose netDefaultGateway in the hope that it will always be present
    'snmp_scan_function'    : lambda oid: ( oid(".1.3.6.1.2.1.1.1.0").lower().startswith("cisco") \
                    or oid(".1.3.6.1.2.1.1.2.0") == ".1.3.6.1.4.1.11863.1.1.3" \
                    ) and oid(".1.3.6.1.2.1.16.19.12.0") != None,
}

By replacing the snmp_scan_function lines at the bottom of the code above with the following lines:

rmon_stats
    # if at least one interface within the RMON tree is present (the previous
    # "netDefaultGateway" is not present in every device implementing the RMON
    # MIB
    'snmp_scan_function'    : lambda oid: oid(".1.3.6.1.2.1.16.1.1.1.1.1") > 0,

the check will now be able to successfully inventorize the RMON representation of interfaces even on non-Cisco network equipment.

To actually execute such an inventory, there first needs to be a Check_MK configuration rule in place, which enables the appropriate code in the inventory_rmon_stats function of the check. The following code snippet shows the respective lines:

rmon_stats
def inventory_rmon_stats(info):
    settings = host_extra_conf_merged(g_hostname, inventory_if_rules)
    if settings.get("rmon"):
        [...]

I guess this is supposed to be sort of a safeguard, since the number of interfaces can grow quite large in RMON and querying them can put a lot of strain on the service processor of the network device.

The configuration option to enable RMON based checks is neatly tucked away in the WATO WebGUI at:

-> Host & Service Parameters
   -> Parameters for discovered services
      -> Inventory - automatic service detection
         -> Network Interface and Switch Port Discovery
            -> Create rule in folder ...
               -> Value
                  [x] Collect RMON statistics data
                      [x] Create extra service with RMON statistics data (if available for the device)

After creating a global or folder specific configuration rule, the next run of the Check_MK inventory should discover a – possibly large – number of new interfaces and create the individual service checks, one for each new interface. If not, the network devices probably has disabled the RMON statistics feature by default or has no such feature at all. Since the procedure to enable the collection and presentation of RMON statistics via SNMP on a network device is different for each vendor, one has to check with the corresponding vendor documentation.

Taking a look at the list of newly discovered interfaces and the resulting service checks reveals two other issues of the rmon_stats check.

One is, the check is rather simple in the way that it indiscriminately collects RMON interface statistics on all interfaces of a network device, regardless of the actual link state of each interface. While the RMON-MIB lacks direct information about the link state of an interface, it fortunately carries a reference to the IF-MIB for each interface. The IF-MIB in turn provides the information about the link state (ifOperStatus), which can be used to determine if RMON statistics should be collected for a particular interface.

The other issue is, the checks use of the RMON interface index (etherStatsIndex) as a base for the name of the associated service. E.g. RMON Stats IF <etherStatsIndex> in the following example output:

OK  RMON Stats IF 1     OK - bcast=0 mcast=0 0-63b=5 64-127b=124 128-255b=28 256-511b=8 512-1023b=2 1024-1518b=94 octets/sec
OK  RMON Stats IF 10    OK - bcast=0 mcast=0 0-63b=0 64-127b=0 128-255b=0 256-511b=0 512-1023b=0 1024-1518b=0 octets/sec
OK  RMON Stats IF 100   OK - bcast=0 mcast=0 0-63b=0 64-127b=0 128-255b=0 256-511b=0 512-1023b=0 1024-1518b=0 octets/sec
OK  RMON Stats IF 101   OK - bcast=0 mcast=0 0-63b=0 64-127b=0 128-255b=0 256-511b=0 512-1023b=0 1024-1518b=0 octets/sec
OK  RMON Stats IF 102   OK - bcast=0 mcast=0 0-63b=0 64-127b=0 128-255b=0 256-511b=0 512-1023b=0 1024-1518b=0 octets/sec
OK  RMON Stats IF 103   OK - bcast=0 mcast=0 0-63b=0 64-127b=0 128-255b=0 256-511b=0 512-1023b=0 1024-1518b=0 octets/sec
OK  RMON Stats IF 104   OK - bcast=0 mcast=0 0-63b=0 64-127b=11 128-255b=3 256-511b=0 512-1023b=0 1024-1518b=0 octets/sec
OK  RMON Stats IF 105   OK - bcast=0 mcast=0 0-63b=0 64-127b=0 128-255b=0 256-511b=0 512-1023b=0 1024-1518b=0 octets/sec
OK  RMON Stats IF 106   OK - bcast=0 mcast=0 0-63b=0 64-127b=0 128-255b=0 256-511b=0 512-1023b=0 1024-1518b=0 octets/sec
OK  RMON Stats IF 107   OK - bcast=0 mcast=1 0-63b=1 64-127b=8 128-255b=0 256-511b=0 512-1023b=0 1024-1518b=0 octets/sec
OK  RMON Stats IF 108   OK - bcast=0 mcast=0 0-63b=0 64-127b=0 128-255b=0 256-511b=0 512-1023b=0 1024-1518b=0 octets/sec
OK  RMON Stats IF 109   OK - bcast=0 mcast=0 0-63b=0 64-127b=0 128-255b=0 256-511b=0 512-1023b=0 1024-1518b=0 octets/sec
OK  RMON Stats IF 11    OK - bcast=0 mcast=0 0-63b=0 64-127b=0 128-255b=0 256-511b=0 512-1023b=0 1024-1518b=0 octets/sec
OK  RMON Stats IF 110   OK - bcast=0 mcast=0 0-63b=0 64-127b=0 128-255b=0 256-511b=0 512-1023b=0 1024-1518b=0 octets/sec
OK  RMON Stats IF 111   OK - bcast=0 mcast=0 0-63b=0 64-127b=0 128-255b=0 256-511b=0 512-1023b=0 1024-1518b=0 octets/sec
OK  RMON Stats IF 112   OK - bcast=0 mcast=0 0-63b=0 64-127b=0 128-255b=0 256-511b=0 512-1023b=0 1024-1518b=0 octets/sec
[...]

Unfortunately in RMON the values of the interface index etherStatsIndex are not guaranteed to be consistent across reboots of the network device, much less the addition of new or the removal of existing interfaces. The interface numbering definition according to the original MIB-II on the other hand is much stricter in this regard. Although subsequent RFCs like the IF-MIB (see section “3.1.5. Interface Numbering”) have weakened the original definition to some degree.

Another drawback of the pure RMON interface index number as an interface identifier in the service check name is that it is less descriptive than e.g. the interface description ifDescr from the IF-MIB. This makes the visual mapping of the service check to the respective interface of the network device rather tedious and error-prone.

An obvious solution to both issues would be to use the already mentioned reference to the IF-MIB for each interface and – in the process of the Check_MK inventory – check for the interface link state (ifOperStatus). The Check_MK inventory process would return only those interfaces with a value of up(1) in the ifOperStatus variable. For those particular interfaces it would also return the interface description ifDescr to be used as a service name, instead of the previsously used RMON interface index.

As a side effect, this approach unfortunately breaks the necessary mapping from the service name – now based on the IF-MIB interface description (ifDescr) instead of the previously used RMON interface index (etherStatsIndex) – back to the RMON MIB where the interface statistic values are stored. This is mainly due to the fact that the IF-MIB itself does not provide a native reference to the RMON MIB. Without such a back reference, the statistics values collected from the RMON MIB can – during normal service check runs – not distinctly be assigned to the appropriate interface. It is therefore necessary to manually implement such a back reference within the service check. In this case the solution was to store the value of the RMON interface index etherStatsIndex in the parameters section (params) of each interface service checks inventory entry. The following excerpt shows an example of such a new inventory data structure for several RMON interface service checks:

[
  [...]
  ('rmon_stats', 'Link Aggregate 1', {'rmon_if_idx': 89}),
  ('rmon_stats', 'Link Aggregate 16', {'rmon_if_idx': 104}),
  ('rmon_stats', 'Link Aggregate 19', {'rmon_if_idx': 107}),
  ('rmon_stats', 'Link Aggregate 2', {'rmon_if_idx': 90}),
  ('rmon_stats', 'Link Aggregate 31', {'rmon_if_idx': 119}),
  ('rmon_stats', 'Link Aggregate 32', {'rmon_if_idx': 120}),
  ('rmon_stats', 'Link Aggregate 7', {'rmon_if_idx': 95}),
  ('rmon_stats', 'Link Aggregate 8', {'rmon_if_idx': 96}),
  ('rmon_stats', 'Unit: 1 Slot: 0 Port: 1 10G - Level', {'rmon_if_idx': 1}),
  ('rmon_stats', 'Unit: 1 Slot: 0 Port: 10 10G - Level', {'rmon_if_idx': 10}),
  ('rmon_stats', 'Unit: 1 Slot: 0 Port: 16 10G - Level', {'rmon_if_idx': 16}),
  ('rmon_stats', 'Unit: 1 Slot: 0 Port: 19 10G - Level', {'rmon_if_idx': 19}),
  ('rmon_stats', 'Unit: 1 Slot: 0 Port: 2 10G - Level', {'rmon_if_idx': 2}),
  ('rmon_stats', 'Unit: 1 Slot: 0 Port: 20 10G - Level', {'rmon_if_idx': 20}),
  ('rmon_stats', 'Unit: 1 Slot: 0 Port: 7 10G - Level', {'rmon_if_idx': 7}),
  ('rmon_stats', 'Unit: 1 Slot: 0 Port: 8 10G - Level', {'rmon_if_idx': 8}),
  ('rmon_stats', 'Unit: 1 Slot: 0 Port: 9 10G - Level', {'rmon_if_idx': 9}),
  ('rmon_stats', 'Unit: 1 Slot: 1 Port: 3 10G - Level', {'rmon_if_idx': 23}),
  ('rmon_stats', 'Unit: 1 Slot: 1 Port: 4 10G - Level', {'rmon_if_idx': 24}),
  ('rmon_stats', 'Unit: 2 Slot: 0 Port: 1 10G - Level', {'rmon_if_idx': 25}),
  ('rmon_stats', 'Unit: 2 Slot: 0 Port: 10 10G - Level', {'rmon_if_idx': 34}),
  ('rmon_stats', 'Unit: 2 Slot: 0 Port: 16 10G - Level', {'rmon_if_idx': 40}),
  ('rmon_stats', 'Unit: 2 Slot: 0 Port: 19 10G - Level', {'rmon_if_idx': 43}),
  ('rmon_stats', 'Unit: 2 Slot: 0 Port: 2 10G - Level', {'rmon_if_idx': 26}),
  ('rmon_stats', 'Unit: 2 Slot: 0 Port: 20 10G - Level', {'rmon_if_idx': 44}),
  ('rmon_stats', 'Unit: 2 Slot: 0 Port: 7 10G - Level', {'rmon_if_idx': 31}),
  ('rmon_stats', 'Unit: 2 Slot: 0 Port: 8 10G - Level', {'rmon_if_idx': 32}),
  ('rmon_stats', 'Unit: 2 Slot: 0 Port: 9 10G - Level', {'rmon_if_idx': 33}),
  ('rmon_stats', 'Unit: 2 Slot: 1 Port: 3 10G - Level', {'rmon_if_idx': 47}),
  ('rmon_stats', 'Unit: 2 Slot: 1 Port: 4 10G - Level', {'rmon_if_idx': 48}),
  ('rmon_stats', 'Unit: 3 Slot: 0 Port: 1 10G - Level', {'rmon_if_idx': 49}),
  ('rmon_stats', 'Unit: 3 Slot: 0 Port: 2 10G - Level', {'rmon_if_idx': 50}),
  ('rmon_stats', 'Unit: 3 Slot: 0 Port: 9 10G - Level', {'rmon_if_idx': 57}),
  ('rmon_stats', 'Unit: 4 Slot: 0 Port: 1 10G - Level', {'rmon_if_idx': 69}),
  ('rmon_stats', 'Unit: 4 Slot: 0 Port: 2 10G - Level', {'rmon_if_idx': 70}),
  ('rmon_stats', 'Unit: 4 Slot: 0 Port: 9 10G - Level', {'rmon_if_idx': 77}),
  [...]
]

The changes to the rmon_stats check, necessary to implement the features described above are shown in the following patch:

srancid.patch
--- rmon_stats.orig 2015-12-21 13:41:46.000000000 +0100
+++ rmon_stats  2016-02-02 11:18:56.789188249 +0100
@@ -34,20 +34,43 @@
     settings = host_extra_conf_merged(g_hostname, inventory_if_rules)
     if settings.get("rmon"):
         inventory = []
-        for line in info:
-            inventory.append((line[0], None))
+        for line in info[1]:
+            rmon_if_idx = int(re.sub('.1.3.6.1.2.1.2.2.1.1.','',line[1]))
+            for iface in info[0]:
+                if int(iface[0]) == rmon_if_idx and int(iface[2]) == 1:
+                    params = {}
+                    params["rmon_if_idx"] = int(line[0])
+                    inventory.append((iface[1], "%r" % params))
         return inventory
 
-def check_rmon_stats(item, _no_params, info):
-    bytes = { 1: 'bcast', 2: 'mcast', 3: '0-63b', 4: '64-127b', 5: '128-255b', 6: '256-511b', 7: '512-1023b', 8: '1024-1518b' }
-    perfdata = []
+def check_rmon_stats(item, params, info):
+    bytes = { 2: 'bcast', 3: 'mcast', 4: '0-63b', 5: '64-127b', 6: '128-255b', 7: '256-511b', 8: '512-1023b', 9: '1024-1518b' }
+    rmon_if_idx = str(params.get("rmon_if_idx"))
+    if_alias = ''
     infotext = ''
+    perfdata = []
     now = time.time()
-    for line in info:
-        if line[0] == item:
+
+    for line in info[0]:
+        ifIndex, ifDescr, ifOperStatus, ifAlias = line
+        if item == ifDescr:
+            if_alias = ifAlias
+            if_index = int(ifIndex)
+
+    for line in info[1]:
+        if line[0] == rmon_if_idx:
+            if_mib_idx = int(re.sub('.1.3.6.1.2.1.2.2.1.1.','',line[1]))
+            if if_mib_idx != if_index:
+                return (3, "RMON interface index mapping to IF interface index mapping changed. Re-run Check_MK inventory.")
+
+            if item != if_alias and if_alias != '':
+                infotext = "[%s, RMON: %s] " % (if_alias, rmon_if_idx)
+            else:
+                infotext = "[RMON: %s] " % rmon_if_idx
+
             for i, val in bytes.items():
                 octets = int(re.sub(' Packets','',line[i]))
-                rate = get_rate("%s-%s" % (item, val), now, octets)
+                rate = get_rate("%s-%s" % (rmon_if_idx, val), now, octets)
                 perfdata.append((val, rate, 0, 0, 0))
                 infotext += "%s=%.0f " % (val, rate)
             infotext += 'octets/sec'
@@ -58,10 +81,18 @@
 check_info["rmon_stats"] = {
     'check_function'        : check_rmon_stats,
     'inventory_function'    : inventory_rmon_stats,
-    'service_description'   : 'RMON Stats IF %s',
+    'service_description'   : 'RMON Interface %s',
     'has_perfdata'          : True,
-    'snmp_info'             : ('.1.3.6.1.2.1.16.1.1.1', [ #
+    'snmp_info'             : [
+        ( '.1.3.6.1.2.1', [ #
+                                        '2.2.1.1',      # ifNumber
+                                        '2.2.1.2',      # ifDescr
+                                        '2.2.1.8',      # ifOperStatus
+                                        '31.1.1.1.18',  # ifAlias
+                          ]),
+        ('.1.3.6.1.2.1.16.1.1.1', [ #
                                         '1',    # etherStatsIndex = Item
+                                        '2',    # etherStatsDataSource = Interface in the RFC1213-MIB
                                         '6',    # etherStatsBroadcastPkts
                                         '7',    # etherStatsMulticastPkts
                                         '14',   # etherStatsPkts64Octets
@@ -70,10 +101,11 @@
                                         '17',   # etherStatsPkts256to511Octets
                                         '18',   # etherStatsPkts512to1023Octets
                                         '19',   # etherStatsPkts1024to1518Octets
-                                        ]),
-    # for the scan we need to check for any single object in the RMON tree,
-    # we choose netDefaultGateway in the hope that it will always be present
-    'snmp_scan_function'    : lambda oid: ( oid(".1.3.6.1.2.1.1.1.0").lower().startswith("cisco") \
-                    or oid(".1.3.6.1.2.1.1.2.0") == ".1.3.6.1.4.1.11863.1.1.3" \
-                    ) and oid(".1.3.6.1.2.1.16.19.12.0") != None,
+                                  ]),
+        ],
+    # if at least one interface within the RMON tree is present (the previous
+    # "netDefaultGateway" is not present in every device implementing the RMON
+    # MIB
+    'snmp_scan_function'    : lambda oid: oid(".1.3.6.1.2.1.16.1.1.1.1.1") > 0,
+
 }

The following output shows examples of the new service check names. The ifAlias and the etherStatsIndex have been added as auxiliary information to the service check output. The ifAlias entries in the examples have – in order to protect the innocent – been redacted with xxxxxxxx though.

OK  RMON Interface Link Aggregate 1                        OK - [xxxxxxxx, RMON: 89] bcast=0 mcast=0 0-63b=85 64-127b=390 128-255b=136 256-511b=25 512-1023b=20 1024-1518b=407 octets/sec
OK  RMON Interface Link Aggregate 16                       OK - [xxxxxxxx, RMON: 104] bcast=0 mcast=0 0-63b=1 64-127b=11 128-255b=3 256-511b=0 512-1023b=0 1024-1518b=0 octets/sec
UNKN    RMON Interface Link Aggregate 19                       UNKNOWN - RMON interface index mapping to IF interface index mapping changed. Re-run Check_MK inventory.
OK  RMON Interface Link Aggregate 2                        OK - [xxxxxxxx, RMON: 90] bcast=0 mcast=0 0-63b=111 64-127b=480 128-255b=207 256-511b=44 512-1023b=41 1024-1518b=484 octets/sec
[...]
OK  RMON Interface Unit: 1 Slot: 0 Port: 1 10G - Level     OK - [xxxxxxxx, RMON: 1] bcast=0 mcast=0 0-63b=49 64-127b=232 128-255b=57 256-511b=14 512-1023b=4 1024-1518b=185 octets/sec
OK  RMON Interface Unit: 1 Slot: 0 Port: 10 10G - Level    OK - [xxxxxxxx, RMON: 10] bcast=0 mcast=0 0-63b=0 64-127b=0 128-255b=0 256-511b=0 512-1023b=0 1024-1518b=0 octets/sec
OK  RMON Interface Unit: 1 Slot: 0 Port: 16 10G - Level    OK - [xxxxxxxx, RMON: 16] bcast=0 mcast=0 0-63b=0 64-127b=2 128-255b=1 256-511b=0 512-1023b=0 1024-1518b=0 octets/sec
OK  RMON Interface Unit: 1 Slot: 0 Port: 19 10G - Level    OK - [xxxxxxxx, RMON: 19] bcast=0 mcast=0 0-63b=0 64-127b=0 128-255b=0 256-511b=0 512-1023b=0 1024-1518b=0 octets/sec
OK  RMON Interface Unit: 1 Slot: 0 Port: 2 10G - Level     OK - [xxxxxxxx, RMON: 2] bcast=0 mcast=0 0-63b=62 64-127b=293 128-255b=89 256-511b=28 512-1023b=9 1024-1518b=252 octets/sec
OK  RMON Interface Unit: 1 Slot: 0 Port: 20 10G - Level    OK - [xxxxxxxx, RMON: 20] bcast=0 mcast=0 0-63b=75 64-127b=1640 128-255b=217 256-511b=41 512-1023b=79 1024-1518b=2096 octets/sec
OK  RMON Interface Unit: 1 Slot: 0 Port: 7 10G - Level     OK - [xxxxxxxx, RMON: 7] bcast=0 mcast=0 0-63b=0 64-127b=0 128-255b=0 256-511b=0 512-1023b=0 1024-1518b=0 octets/sec
OK  RMON Interface Unit: 1 Slot: 0 Port: 8 10G - Level     OK - [xxxxxxxx, RMON: 8] bcast=0 mcast=0 0-63b=0 64-127b=5 128-255b=1 256-511b=0 512-1023b=0 1024-1518b=0 octets/sec
OK  RMON Interface Unit: 1 Slot: 0 Port: 9 10G - Level     OK - [xxxxxxxx, RMON: 9] bcast=0 mcast=0 0-63b=0 64-127b=0 128-255b=0 256-511b=0 512-1023b=0 1024-1518b=0 octets/sec
OK  RMON Interface Unit: 1 Slot: 1 Port: 3 10G - Level     OK - [xxxxxxxx, RMON: 23] bcast=0 mcast=0 0-63b=0 64-127b=3 128-255b=0 256-511b=0 512-1023b=0 1024-1518b=0 octets/sec
OK  RMON Interface Unit: 1 Slot: 1 Port: 4 10G - Level     OK - [xxxxxxxx, RMON: 24] bcast=0 mcast=0 0-63b=0 64-127b=3 128-255b=0 256-511b=0 512-1023b=0 1024-1518b=0 octets/sec
OK  RMON Interface Unit: 2 Slot: 0 Port: 1 10G - Level     OK - [xxxxxxxx, RMON: 25] bcast=0 mcast=0 0-63b=0 64-127b=0 128-255b=0 256-511b=0 512-1023b=0 1024-1518b=0 octets/sec
OK  RMON Interface Unit: 2 Slot: 0 Port: 10 10G - Level    OK - [xxxxxxxx, RMON: 34] bcast=0 mcast=0 0-63b=0 64-127b=0 128-255b=0 256-511b=0 512-1023b=0 1024-1518b=0 octets/sec
OK  RMON Interface Unit: 2 Slot: 0 Port: 16 10G - Level    OK - [xxxxxxxx, RMON: 40] bcast=0 mcast=0 0-63b=0 64-127b=0 128-255b=0 256-511b=0 512-1023b=0 1024-1518b=0 octets/sec
OK  RMON Interface Unit: 2 Slot: 0 Port: 19 10G - Level    OK - [xxxxxxxx, RMON: 43] bcast=0 mcast=1 0-63b=1 64-127b=0 128-255b=0 256-511b=0 512-1023b=0 1024-1518b=0 octets/sec
[...]
OK  RMON Interface Unit: 4 Slot: 0 Port: 9 10G - Level     OK - [xxxxxxxx, RMON: 77] bcast=0 mcast=0 0-63b=0 64-127b=0 128-255b=0 256-511b=0 512-1023b=0 1024-1518b=0 octets/sec

The enhanced version of the rmon_stats check can be downloaded here along with a slightly beautified version of the accompanying PNP4Nagios template:

Enhanced version of the rmon_stats check
Patch to enhance the original verion of the rmon_stats check
Slightly beautified version of the rmon_stats PNP4Nagios template
Patch to beautify the rmon_stats PNP4Nagios template

Finally, the following two screenshots show examples of PNP4Nagios graphs using the modified version of the PNP4Nagios template:

Example RMON interface statistics graph using the modified version of the PNP4Nagios template
Example RMON interface statistics graph using the modified version of the PNP4Nagios template

The sources to both, the enhanced version of the rmon_stats check and the beautified version of the rmon_stats PNP4Nagios template, can be found on GitHub in my Check_MK Plugins repository.

// World + dog on GitHub - me too now

World + dog seems to be on GitHub, me too now. Fork me on GitHub.

Fair warning though – it currently looks a bit deserted there, because i'm still getting to grips with Git. Over time i think i'll be moving my source code from subversion (SVN) to Git. It'll probably take some time, but make sure to stay tuned for the repositories emerging there.

This website uses cookies for visitor traffic analysis. By using the website, you agree with storing the cookies on your computer.More information