If you're maintaining moderately complex SAP landscapes with network connections traversing firewall or other devices with access lists, you're bound the experience network connection issues sooner or later. Usually this is due to the dynamic nature of modern firewall ACLs, which are on demand being build up and teared down again based on the actual packet flow. SAP RFC and other network connections on the other hand are set up once and can remain inactive for an indefinite amount of time until the next attempt for data exchange becomes neccessary. The order of events where both kinds of behaviour become a problem is roughtly something like this:
The SAP processes initiate a network connection to a remote system. With TCP based connections this causes the OS to send a SYN packet to the remote system.
A firewall along the way recognises the SYN packet as an attempt to build up a connection. It compares the parameters (usually source and destination IP addresses and port numbers) of the connection request to a set defined rules, finds an entry allowing the connection to be made and inserts a temporary rule into its state table. Along with the original connection request a rule is also added to allow the corresponding traffic in the reverse direction. Both state table entries are associated with predefined timeout values.
At some point the SAP processes finish their data exchange, but the connection is not being teared down. It's usually kept up for future communication and to avoid the overhead introduced by the connection buildup.
As soon as the SAP processes stop exchanging data the state table entry timeout counters of the firewall start ticking down. Once the time of communication inactivity has reached the predefined timeout values, the state table entries are removed. The firewall will now block future communication attempts, unless it's a connection initiation containing a SYN packet.
At some point the SAP processes want to start to exchange data over the connection again. From their perspektive the connection is still established, so there seems to be no need to initiate it again with a SYN packet. The non-SYN packets arrive at the firewall which either drops them silently or sends a RST packet back. Either way, this causes a connection breakdown within the SAP system.
The actual issue here is, that the firewall or network device has no knowledge on how the SAP systems intends to use the network connection. This is a design implication of the independent layers of the OSI stack and actually not a SAP specific problem. The issue described above is usually addressed by sending empty keepalive packets in regular intervals once the actual data transfer cedes for a configurable amount of time. This simulates ongoing network traffic over the connection and in effect keeps the state table enties from timing out. The transmission of keepalive packets is handled by the network stack of the OS and the application has to request sending them via an option of OS system call to set up the network connection. SAP has an instance profile configuration parameter to request keepalives at the start of the SAP system:
gw/so_keepalive = 1
(see SAP Note 743888). The parameter can also be queried and dynamically changed via the SAP Transaction
SMGW → Goto → Parameters → Display/Change. A change performed this way is only valid until the next start of the Gateway process and only for connections established after the parameter has been changed.
Another problem arises if the timeout values for sending keepalives (e.g. 2 hours) and the timeout values for state table enties (e.g. 10 min.) are not properly matched. Obviously the timeout value for sending keepalives needs to be lower or equal than the timeout value for state table entries. Otherwise the system might start sending keepalives well after the state table entries have already been removed. Since the handling of keepalive packets is done by the OS the timeout values need to be set there (see SAP Note 1410736). They are OS specific and they apply globally to the OS network stack. For example a 10 min. timeout for keepalives and a resend interval of every 10 min. for IBM AIX is set by:
no -p -o tcp_keepidle = 600 no -p -o tcp_keepintvl = 600
Since those are global values you need to choose the lowest required value of all applications running on the system and of all the network devices involved.
In order to determine if an existing network connection has keepalives enabled, you can use tcpdump to sniff the actual network traffic. On busy systems this produces copious amounts of data and might be difficult to catch because you have to wait for the keepalive mechanism to kick in. With AIX there's another way to determine the socket options of existing network connections:
netstat -Aancommand and find the network connection in question. Gather
PCB/ADDRvalue for that connection from the left-most column:
$ netstat -Aan | egrep "PCB|tcp" PCB/ADDR Proto Recv-Q Send-Q Local Address Foreign Address (state) ... f1000e0003a7d3b8 tcp4 0 0 220.127.116.11.3311 18.104.22.168.53487 ESTABLISHED ...
Start the kernel debugger
kdbwith root privileges. Within
(0)>prompt run the
sockinfocommand on the
PCB/ADDRvalue gathered during the previous step. Since this prints all the socket option values, filter the output with the
grepcommand for the
$ kdb (0)> sockinfo f1000e0003a7d3b8 tcpcb | grep KEEP t_timer....... 0000021B (TCPT_KEEP) opts........ 000C (REUSEADDR|KEEPALIVE) (0)>
The filtered output shows:
KEEPALIVEflag in the
optsline, indicating that the socket option SO_KEEPALIVE is set for this connection and keepalive packets will be sent.
a hex value of 0000021B in the
t_timerline, representing the time in half-seconds that is left before the next keepalive packet is sent. In this example the next keepalive packet will be sent in 269.5 seconds (21B hex half-seconds == 539 decimal half-seconds == 269.5 decimal seconds).
The SAP saposcol process sometimes refuses to start, claiming there's already a saposcol process running even if there really isn't. Most of the time i found that this is due to a leftover shared memory segment. This usually happens when saposcol was previously not properly shutdown or otherwise being killed. Normally a:
$ saposcol -c pf=<path to profile>
takes care of this, but sometimes even that doesn't seem to do the trick. In this case SAP Note 548699, especially item 7, comes in handy. Basically, you look for the leftover shared memory segment identified by the saposcols key and you remove it manually with the OS tools:
$ ipcs -ma | grep '4dbe' T ID KEY MODE OWNER GROUP CREATOR CGROUP NATTCH SEGSZ CPID LPID ... m 1048576 0x00004dbe --rw-rw-rw- root system root system 1 1839766 4915206 5046426 ... $ ipcrm -m 1048576
Once the shared memory segment has been removed, saposcol can be started again. Be careful though, in recent versions or if you installed the SMD agent, saposcol is started from the SAP host agent!
Related or otherwise interesting SAP Notes: 710975.