2012-08-25 // Broken or ill behaved Web Crawlers - Bezeq International
In principal web crawlers are a good thing if you want your website and content to be found by others. Search engine operators have been using them for years and put a considerable amount of time and knowledge into the subject to make them useful, yet considerate tools. Like any other tool they can be used for good and evil, Bezeq is IMHO a case of the latter category.
Like in numerous other reports on the web, my sites have recently been hit by the crawlers operated by Bezeq International, an Israeli internet service provider. Judging from an analysis of the webservers access logs they've been doing it in a fashion, which shows a complete disregard for the targeted sites health and the bandwidth consumed in the process (i.e. very high request rate and multiple full site crawls). Since i've worked for an ISP myself, i understand the motivation of offering the best possible service to your customers. Mirroring or caching entire sites to a location near your customer may be a good idea, depending on how you do it and how much collateral damage your causing in the process. Needless to say i don't condone the way Bezeq does this.
The first countermeasure was to block the source IP address of the crawler with an iptables rule on the webserver. Assuming this simple measure wouldn't last very long i decided to block the whole network the source IP address resided in. Sadly this wasn't enough. A few days later another crawl hit my sites, this time from an entirely different network also operated by Bezeq. Alright, now you've got my attention! Searching the BGP routing tables and cross-checking against the RIPE and other LIR databases i looked up all prefixes operated by Bezeq as of 2012/08/20 and put them in the iptables configuration to be blocked. I'm sharing this list for anyone interested and in the hope that Bezeq will learn if enough sites are unreachable to their customer base:
- iptables.conf
# Drop ill behaved crawlers - Bezeq International -A INPUT -s 31.168.0.0/16 -j DROP -A INPUT -s 62.219.0.0/16 -j DROP -A INPUT -s 79.176.0.0/13 -j DROP -A INPUT -s 81.218.0.0/16 -j DROP -A INPUT -s 82.80.0.0/15 -j DROP -A INPUT -s 84.108.0.0/14 -j DROP -A INPUT -s 85.130.128.0/17 -j DROP -A INPUT -s 109.64.0.0/14 -j DROP -A INPUT -s 147.235.0.0/16 -j DROP -A INPUT -s 192.114.0.0/15 -j DROP -A INPUT -s 192.116.0.0/15 -j DROP -A INPUT -s 192.118.0.0/16 -j DROP -A INPUT -s 195.213.229.0/24 -j DROP -A INPUT -s 206.82.140.0/24 -j DROP -A INPUT -s 207.117.93.0/24 -j DROP -A INPUT -s 212.25.64.0/19 -j DROP -A INPUT -s 212.25.65.0/24 -j DROP -A INPUT -s 212.25.96.0/19 -j DROP -A INPUT -s 217.194.200.0/24 -j DROP -A INPUT -s 217.194.201.0/24 -j DROP -A INPUT -s 212.179.0.0/16 -j DROP
There are probably better and more sophisticated coutermeasures like tarpitting or rate limiting, but i didn't want waste too much energy on the matter.
Leave a comment…
- E-Mail address will not be published.
- Formatting:
//italic// __underlined__
**bold**''preformatted''
- Links:
[[http://example.com]]
[[http://example.com|Link Text]] - Quotation:
> This is a quote. Don't forget the space in front of the text: "> "
- Code:
<code>This is unspecific source code</code>
<code [lang]>This is specifc [lang] code</code>
<code php><?php echo 'example'; ?></code>
Available: html, css, javascript, bash, cpp, … - Lists:
Indent your text by two spaces and use a * for
each unordered list item or a - for ordered ones.