Google works in mysterious ways – Stops Crawling

January 5th, 2008 | by Sajal Kayan |

I am running the website thaindian.com which on September of 2007 finally(after 3 attempts) got approved as a Google news search source. Since then we have seen a steady increase in traffic at the portal. We partnered with a news agency for latest Indian news also and they have been providing good content. Since content is still the king(some might argue), it was going good for us.

Notice the low traffic in december begining.

Sometime in November, we suddenly increased the rate of putting new content (from approx 30 articles/week to about 200 articles/day) this got a HUGE surge in traffic. My WAMP setup couldn’t handle it, Apache would restart for no apparent reason. I sent in motion to get myself a LAMP server.

On about November 26th or 27th, our news stories stopped being picked up by Google News. Even the main pages which were re-cached often were now outdated in Google’s cache. At this moment I had assumed that Google is penalizing me for running a bad server. On about December 1st I moved the site over to the LAMP setup. The site loaded fast, no issues, but the almighty Bot wasn’t still happy. I re-did some HTML, optimized even further, a couple of days passed, still situation got worse. At this time I had a new far fetched theory; Perhaps G was penalizing me for adding too many(200/day) stories without getting more inbound links.

A couple of days went by, my eyes were glued to the terminal window where i was religiously watching the access.log for any Googlebot activity real time. Posted my issues on multiple forums and mailing lists, still no help. to take steps further, I learnt a new command ngrep, a tcpdump like tool which is used to monitor network activity in realtime.

Ngrep-ing the Googlebot IP(yes at that point only one IP was trying to access my site) for a couple of days, I got to the conclusion that there was some networking issues between Googlebot and my server. The GET from google came in and then the series of packets were sent during the duration of next 2 to 5 minutes. The first packet sent instantly but the following packets were taking time. I could confirm that this wasnt a php/apache issue as the content was gziped, the first packet will be sent only if the processing of the whole page is complete. On closer observation, I figured that on many occasions the same packet was being resent over and over again. Now this finally gave me some insight into what was happening; Packet loss!

Googling around, I got to some forums where packet loss was being discussed, used ping flooding to find the optimum MTU which was 528, anything higher resulted in packet loss. Changed MTU of eth0 to 528 and immediately Googlebot started showing the love it once used to, my 2 week long ordeal was over. It was something like I had just turned on the magic switch. The crawl rate went from 1 request every 3 or 4 minutes to 4 to 6 requests a minute. The same days news articles was immediately shown on Google News.

The weird part is that both the WAMP and LAMP setups are 2 different physical servers hosted by 2 different ISPs in 2 different datacenters but using the same backbone and both these ISPs say that none of their other customers were having any issues.

Recently I solved another issue with Google News not picking up images accompanying the News items, but I’ll blog about that another day. Right now my caffeine levels are low and I must hit the sack.

EDIT 1st Feb 2008 : Couple of weeks ago I changed the MTU back to 1500 and its working fine… I guess it was a temporary bug… The data center guys have no clue.

EDIT 16th Feb 2008 : Returned home late at night Saturday night and saw no news stories on news.google.com for last 9 hours. Did some testings, reduction of MTU to 500 again seemed to be the only solution.

  • http://www.seometer.com Peter

    Cool, glad you got the Google’s love back. :p It’s strange that the MTU is set so small though. I thought a typical MTU for ethernet is 1500?

    Cheers,

  • http://www.sajalkayan.com Sajal Kayan

    Hi Peter, yes the default for ethernet and most servers is 1500. It is one part of the configuration I think no one even thinks about.

  • http://www.sajalkayan.com/seo-and-newsgooglecom-run-your-own-news-website.html Sajal Kayan » SEO and news.google.com – Run your own news website

    [...] ways to rank better in the Google News Search. When I first started, I faced many problems like the MTU issue, but since there was(and still is) a lack of online resources and blogs discussing how it works, I [...]

  • http://port80.syndk8.com/2008/03/08/000111-the-secret-to-massive-googlebot-crawls.html Port80 ThreadMonitor » The Secret To MASSIVE Googlebot Crawls

    [...] READ>>> Google works in mysterious ways – Stops Crawling [...]

  • http://www.sajalkayan.com/secrets-of-google-newswhat-i-learnt-the-hard-way.html Sajal Kayan » Secrets of Google News…What I learnt the hard way!

    [...] the Google’s blog post says, the web master tools does an excellent job in finding potential issues with the crawling your site. Publishing a sitemap helps my rankings: [...]

  • http://www.unlimitedwebdesigns.com Edwin

    If you are having MTU size problems it may be due a miscofigured router, it’s possibly your Hosting provider. Lowering the MTU size increases the overhead,thus increasing the bandith the solution is not bandwidth related, it’s a misconfiguration somewhere in between.

  • http://www.allaspectsuk.co.uk/location/west-midlands/walsall/pest-control.asp Galina

    This is an amazing.Angels and Demons can not cross over onto our plane.The written skill is so good.Thanks to share this blog with us.Keep it up.

  • http://custom-writing.org/buy-term-paper buy term paper

    what an interesting statistic….it makes me thinking about some problems with promotion, but I sure that I got a decision 

blog comments powered by Disqus