bityard Blog

// Broken or ill behaved Web Crawlers - Bezeq International

In principal web crawlers are a good thing if you want your website and content to be found by others. Search engine operators have been using them for years and put a considerable amount of time and knowledge into the subject to make them useful, yet considerate tools. Like any other tool they can be used for good and evil, Bezeq is IMHO a case of the latter category.

Like in numerous other reports on the web, my sites have recently been hit by the crawlers operated by Bezeq International, an Israeli internet service provider. Judging from an analysis of the webservers access logs they've been doing it in a fashion, which shows a complete disregard for the targeted sites health and the bandwidth consumed in the process (i.e. very high request rate and multiple full site crawls). Since i've worked for an ISP myself, i understand the motivation of offering the best possible service to your customers. Mirroring or caching entire sites to a location near your customer may be a good idea, depending on how you do it and how much collateral damage your causing in the process. Needless to say i don't condone the way Bezeq does this.

The first countermeasure was to block the source IP address of the crawler with an iptables rule on the webserver. Assuming this simple measure wouldn't last very long i decided to block the whole network the source IP address resided in. Sadly this wasn't enough. A few days later another crawl hit my sites, this time from an entirely different network also operated by Bezeq. Alright, now you've got my attention! Searching the BGP routing tables and cross-checking against the RIPE and other LIR databases i looked up all prefixes operated by Bezeq as of 2012/08/20 and put them in the iptables configuration to be blocked. I'm sharing this list for anyone interested and in the hope that Bezeq will learn if enough sites are unreachable to their customer base:

iptables.conf
# Drop ill behaved crawlers - Bezeq International
-A INPUT -s 31.168.0.0/16 -j DROP
-A INPUT -s 62.219.0.0/16 -j DROP
-A INPUT -s 79.176.0.0/13 -j DROP
-A INPUT -s 81.218.0.0/16 -j DROP
-A INPUT -s 82.80.0.0/15 -j DROP
-A INPUT -s 84.108.0.0/14 -j DROP
-A INPUT -s 85.130.128.0/17 -j DROP
-A INPUT -s 109.64.0.0/14 -j DROP
-A INPUT -s 147.235.0.0/16 -j DROP
-A INPUT -s 192.114.0.0/15 -j DROP
-A INPUT -s 192.116.0.0/15 -j DROP
-A INPUT -s 192.118.0.0/16 -j DROP
-A INPUT -s 195.213.229.0/24 -j DROP
-A INPUT -s 206.82.140.0/24 -j DROP
-A INPUT -s 207.117.93.0/24 -j DROP
-A INPUT -s 212.25.64.0/19 -j DROP
-A INPUT -s 212.25.65.0/24 -j DROP
-A INPUT -s 212.25.96.0/19 -j DROP
-A INPUT -s 217.194.200.0/24 -j DROP
-A INPUT -s 217.194.201.0/24 -j DROP
-A INPUT -s 212.179.0.0/16 -j DROP

There are probably better and more sophisticated coutermeasures like tarpitting or rate limiting, but i didn't want waste too much energy on the matter.

// Apache Logfile Analysis with AWStats and X-Forwarded-For Headers

When running a webserver behind a (reverse) proxy or load-balancer it's often necessary to enable the logging of the original clients IP address by looking at a possible X-Forwarded-For header in the HTTP request. Otherwise one will only see the IP address of the proxy or load-balancer in the webservers access logs. With the Apache webserver this is usually done by changing the LogFormat directive in the httpd.conf from:

LogFormat "%h %l %u %t \"%r\" %>s %O \"%{Referer}i\" \"%{User-Agent}i\"" combined

to something like this:

LogFormat "%{X-Forwarded-For}i %l %u %t \"%r\" %>s %O \"%{Referer}i\" \"%{User-Agent}i\"" combined

Unfortunately this solution fails if your so “lucky” to have a setup where clients can access the webserver both directly and through the proxy/load-balancer. In this particular case the HTTP request of the direct access to the webserver has no X-Forwarded-For header and will be logged with a “-” (dash). There are some solutions to this out on the net, utilizing the Apache SetEnvIf configuration directive.

Since this wasn't really satisfying for me, because it prevents you from knowing which requests were made directly and which were made through the proxy/load-balancer, i just inserted the X-Forwarded-For field right after the regular remote hostname field (%h):

LogFormat "%h %{X-Forwarded-For}i %l %u %t \"%r\" %>s %O \"%{Referer}i\" \"%{User-Agent}i\"" combined

An example of two access log entries looks like this:

10.0.0.1 - - - [21/Jul/2012:18:32:45 +0200] "GET / HTTP/1.1" 200 22043 "-" "User-Agent: Lynx/2.8.4 ..."
192.168.0.1 10.0.0.1 - - [21/Jul/2012:18:32:45 +0200] "GET / HTTP/1.1" 200 22860 "-" "User-Agent: Lynx/2.8.4 ..."

The first one shows a direct access from the client 10.0.0.1 and the second one shows a proxied/load-balanced access from the same client and the proxies/load-balancers IP address 192.168.0.1.

Now if you're also using AWStats to analyse the webservers access logs things get a bit tricky. AWStats has support for X-Forwarded-For entries in access logs, but only for the example in the second code block above as well as the mentioned SetEnvIf solution. This is due to the fact that the responsible AWStats LogFormat strings %host and %host_proxy are mutually exclusive. An AWStats configuration based on the Type 1 LogFormat like:

LogFormat = "%host %host_proxy %other %logname %time1 %methodurl %code %bytesd %refererquot %uaquot"

will not work, since %host_proxy will overwrite %host entries even if the %host_proxy field does not contain an IP address, like in the above direct access example. The result is a lot of hosts named “-” (dash) in your AWStats statistics. This can be fixed by the following simple, quick'n'dirty patch to the AWStats sources:

awstats.pl
diff -u wwwroot/cgi-bin/awstats.pl.orig wwwroot/cgi-bin/awstats.pl
--- wwwroot/cgi-bin/awstats.pl.orig     2012-07-18 19:28:59.000000000 +0200
+++ wwwroot/cgi-bin/awstats.pl  2012-07-18 19:30:52.000000000 +0200
@@ -17685,6 +17685,18 @@
                        next;
                }
 
+               my $pos_proxy = $pos_host - 1;
+               if ( $field[$pos_host] =~ /^-$/ ) {
+                       if ($Debug) {
+                               debug(
+                                       " Empty field host_proxy, using value "
+                                       . $field[$pos_proxy] . " from first host field instead",
+                                       4
+                               );
+                       }
+                       $field[$pos_host] = $field[$pos_proxy];
+               }
+ 
                if ($Debug) {
                        my $string = '';
                        foreach ( 0 .. @field - 1 ) {

Please bear in mind that this is a “works for me”TM kind of solution, which might break AWStats in all kinds of other ways.

// HMC Update to 7.7.5.0

Today i did an update of our two IBM HMC appliances from v7.7.4.0 SP2 to v7.7.5.0 (MH01311 and MH01312). Like in other cases with HMC, AIX, TSM, SVC, storage and Power systems microcode over the last year or so, IBM failed to impress with the product they released. I guess this is what happens when as a vendor you start skipping quality control altogether and start doing your beta testing in the field.

First off was the update process, which was a huge pain in the rear. Who in this day and age dreams up an update process that is only described as purley media based? I guess i didn't read the system requirement that states, you have to be located geographically near your HMC equipment. After digging around for a while, i found the nice TechDoc 635220142 over at the i5 folks, which in detail describes how to do a network based update. The question remains why this really nice description didn't make it into the release notes?

Next up was the backup of the managed system profile and HMC upgrade data, which could only be saved locally or on a removeable media (DVD or USB). Again, network (NFS, SCP, FTP, etc.) anyone?

The update itself went fine, except for the usual error output on the KVM console which make you wonder if anyone is ever going to fix the messages that have been popping up there for ages. Apparently it is a considerable challenge to check for the existence of a symlink before trying to create one!

The first things i noticed after the update were that now the “OS Version” column actually displays the ioslevel instead of the oslevel for VIOS partitions (nice!), and that our Nagios monitoring wasn't working anymore for the HMC CPU check (WTF?). Manually checking on the HMC CLI showed this:

hscroot@hmc:~> monhmc -n 0 -r proc
/opt/hsc/bin/MONHmc: line 80: awk: command not found

That's a nice one, isn't it? Could probably be easily fixed like this:

diff -u MONHmc.orig MONHmc 
--- /opt/hsc/bin/MONHmc.orig 2012-06-19 19:34:15.000000000 +0200
+++ /opt/hsc/bin/MONHmc      2012-06-19 19:13:41.000000000 +0200
@@ -77,7 +77,7 @@
    if [ ! -f ${HOME}/.toprc ];then
       /bin/cp /opt/hsc/data/toprc ${HOME}/.toprc
    fi
-   /usr/bin/top -b -n 2 -p $PPID | awk '/^top/{i++}i==2' | /usr/bin/grep -i cpu[0-9,\(s]
+   /usr/bin/top -b -n 2 -p $PPID | /usr/bin/awk '/^top/{i++}i==2' | /usr/bin/grep -i cpu[0-9,\(s]
 }
 
 showMem()

if one only had unrestricted access to the Linux system running within the appliance.

I'm pretty sure those were only the first roadblocks to be hit, stay tuned …

// Webserver - Windows vs. Unix

Recently at work, i was given the task of evaluating alternatives for the current OS platform running the company homepage. Sounds trivial enough, doesn't it? But every subject in a moderately complex corporate environment has some history, lots of pitfalls and a considerable amount of politics attached to it, so why should this particular one be an exception.

The current environment was running a WAMP (Windows, Apache, MySQL, PHP) stack with a PHP-based CMS and was not performing well at all. The systems would cave under even minimal connection load, not to mention user rushes during campaign launches. The situation dragged on for over a year and a half, while expert consultants were brought in, measurements were made, fingers were pointed and even new hardware was purchased. Nothing helped, the new hardware brought the system down even faster, because it could serve more initial user requests thus effectively overrunning the system. IT management drew a lot of fire for the situation, but nontheless stuck with the “Microsoft, our strategic platform” mantra. I guess at some point the pressure got too high for even those guys.

This is where i, the Unix guy with almost no M$ knowledge, got the task of evaluating whether or not an “alternative OS platform” could do the job. Hot potatoe, anyone?

So i went on and set up four different environments that were at least somewhere within the scope of our IT departments supported systems (so no *BSD, no Solaris, etc.):

  1. Linux on the newly purchased x86 hardware mentioned above

  2. Linux on our VMware ESX cluster

  3. Linux as a LPAR on our IBM Power systems

  4. AIX as a LPAR on our IBM Power systems

Apache, MySQL and PHP were all the same version as in the Windows environment. The CMS and content were direct copies from the Windows production systems. Without any special further tweaking i ran some load tests with siege:

Webserver performance comparison - Transaktions per second Webserver performance comparison - Response time

Compared to the Windows environment (gray line), scenario 1 (dark blue line) was giving about 5 times the performance on the exact same hardware. The virtualized scenarios 2, 3 and 4 did not perform so well in absolute values. But since their CPU resources were only about 1/2 of the ones available in scenario 1, their relative performance isn't too bad after all. Also notable is the fact that all scenarios served requests up to the test limit of a thousend parallel clients. Windows started dropping requests after about 300 parallel clients.

Presented with those numbers, management decided the company webserver environment should be migrated to an “alternative OS platform”. AIX on Power systems was chosen for operational reasons, even though it didn't have the highest possible performance out of the tested scenarios. The go-live of the new webserver environment was wednesday last week at noon, with the switchover of the load-balancing groups. Take a look what happened to the response time measurements around that time:

Webserver performance - Daily after migration

Also very interesting is the weekly graph a few days after the migration:

Webserver performance - Weekly after migration

Note the largely reduced jitter in the response time!

// Cacti Monitoring Templates and Nagios Plugin for TMS RamSan-630

Some time ago we got two TMS RamSan-630 SAN-based flash storage arrays at work. They are integrated in our overall SAN storage architecture and thus provide their LUNs to the storage virtualization layer based on a four node IBM SVC cluster. The TMS LUNs are used in two different ways. Some are used as dedicated flash-backed MDiskGroups for applications with moderate space, but very high I/O and very low latency requirements. Some are used in existing disk-based MDiskGroups as an additional SSD-tier, using the SVCs “Easy Tier” feature to do a dynamic relocation of “hot” extends to the flash and “cold” extends from the flash. With the two different use cases we try to get an opimal use out of the TMS arrays, while simultaniously reducing the I/O load on the existing disk based storages.

So far the TMS boxes work very well, the documentation is nothing but excellent. Unlike other classic storage arrays (e.g. IBM DS/DCS, EMC Clariion, HDS AMS, etc.) the TMS arrays are conveniently self-contained. All management operations are available via a telnet/SSH interface or an embedded WebGUI, no OS-dependent management software is neccessary. All functionality is already available, no additional licenses for this and that are neccessary. Monitoring could be improved a bit, especially the long term storage of performance metrics. Unfortunately only the most important performance metrics are presented via SNMP to the outside, so you can't really fill that particular gap yourself with a third party monitoring application.

With the metrics that are available via SNMP i created a Nagios plugin for availability and health monitoring and a Cacti template for performance trends. The Nagios configuration for the TMS arrays monitors the following generic services:

  • ICMP ping.

  • Check for the availability of the SNMP daemon.

  • Check for SNMP traps submitted to snmptrapd and processed by SNMPTT.

in addition to those, the Nagios plugin for the TMS arrays monitors the following more specific services:

  • Check for the overall status (OID: .1.3.6.1.4.1.8378.10.1.3.0).

  • Check for the fan status (OID: .1.3.6.1.4.1.8378.10.1.6.0.1.6).

  • Check for the temperature status (OID: .1.3.6.1.4.1.8378.10.1.6.1.1.6).

  • Check for the power status (OID: .1.3.6.1.4.1.8378.10.1.6.2.1.6).

  • Check for the FC connectivity status (OID: .1.3.6.1.4.1.8378.10.2.1.5).

The Cacti templates graph the following metrics:

  • FC port bandwidth usage.

    Example Read/Write Bandwidth on Port fc-1a:

    Example Read/Write Bandwidth on Port fc-1a

    Example Read/Write Bandwidth on Port fc-1b:

    Example Read/Write Bandwidth on Port fc-1b

  • FC port cache values (although they seem to remain at zero all the time).

  • FC port error values.

  • FC port received and transmitted frames.

  • FC port I/O operations.

    Example Read/Write IOPS on Port fc-1a:

    Example Read/Write IOPS on Port fc-1a

    Example Read/Write IOPS on Port fc-1b:

    Example Read/Write IOPS on Port fc-1b

  • Fan speed values.

  • Voltage and current values.

  • Temperature values.

The Nagios Plugin and the Cacti templates can be downloaded here Nagios Plugin and Cacti Templates. Beware that they should be considered as quick'n'dirty hacks which should generally work but don't come with any warranty of any kind ;-)

This website uses cookies. By using the website, you agree with storing cookies on your computer. Also you acknowledge that you have read and understand our Privacy Policy. If you do not agree leave the website. More information about cookies