Python script to detect bad bots/people faking as Googlebot
Sat, Mar 28, 2009
Vote on HN
A script for analyzing my webservers access.log is long overdue here is a small start. Just recently I noticed a bad bot was attempting to scrape whole of my site using Googlebot's useragent. Since im learning python, I thought it might be a nice experience to write a simple script which can help me detect these fakers.
The script looks at the access log, looks for records matching "Googlebot" then validates based on techniques mentioned at "How to verify Googlebot
" at Google Webmaster Central Blog. It may also be useful or even fun to catch other SEOs trying to see your site thru Googlebot's eyes.
The logic is simple. The IP from which the request is coming in should point to a *.googlebot.com and in turn the hostname should resolve back to the same IP. The first part can be faked by a smart faker, but the latter is not possible(unless they break into Google's DNS servers ;) ). This 2 step validation is a sure shot method.
For a Genuine Googlebot request :-
Server log entry :-
22.214.171.124 - - [28/Mar/2009:08:59:14 -0500] GET / HTTP/1.1 "200" 17892 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "-"
IP : 126.96.36.199
# host 188.8.131.52
184.108.40.206.in-addr.arpa domain name pointer crawl-66-249-71-202.googlebot.com.
# host crawl-66-249-71-202.googlebot.com.
crawl-66-249-71-202.googlebot.com has address 220.127.116.11
For now this script outputs : The number of hits, IP, hostname, and what ip the hostname resolvs to....
92 - 18.104.22.168 - adsl-99-190-96-157.dsl.pltn13.sbcglobal.net - FAKE - 22.214.171.124
36 - 126.96.36.199 - dynamic-acs-24-154-150-217.zoominternet.net - FAKE - 188.8.131.52
4 - 184.108.40.206 - 5352BFB9.cable.casema.nl - FAKE - 220.127.116.11
4 - 18.104.22.168 - 69-64-69-150.dedicated.abac.net - FAKE - 22.214.171.124
3 - 126.96.36.199 - venus.surfwebhost.com - FAKE - 188.8.131.52
3 - 184.108.40.206 - err - FAKE - err
2 - 220.127.116.11 - cm202.omega12.maxonline.com.sg - FAKE - 18.104.22.168
1 - 22.214.171.124 - pool-96-254-203-143.tampfl.fios.verizon.net - FAKE - 126.96.36.199
1 - 188.8.131.52 - mail.appianllc.com - FAKE - 184.108.40.206
1 - 220.127.116.11 - 18.104.22.168.static-hyd.vsnl.net.in - FAKE - err
1 - 22.214.171.124 - err - FAKE - err
The script can be downloaded at : /logazier/0.0.1/logazier.py
- Detect other major bots as well - yahoo, msn, alexa, etc...
- Analyze the access.log for bad bot activity even when the bots use regular browser user agents - much more complex than I thought :)