I am running the website thaindian.com which on September of 2007 finally(after 3 attempts) got approved as a Google news search source. Since then we have seen a steady increase in traffic at the portal. We partnered with a news agency for latest Indian news also and they have been providing good content. Since content is still the king(some might argue), it was going good for us.
Sometime in November, we suddenly increased the rate of putting new content (from approx 30 articles/week to about 200 articles/day) this got a HUGE surge in traffic. My WAMP setup couldn’t handle it, Apache would restart for no apparent reason. I sent in motion to get myself a LAMP server.
On about November 26th or 27th, our news stories stopped being picked up by Google News. Even the main pages which were re-cached often were now outdated in Google’s cache. At this moment I had assumed that Google is penalizing me for running a bad server. On about December 1st I moved the site over to the LAMP setup. The site loaded fast, no issues, but the almighty Bot wasn’t still happy. I re-did some HTML, optimized even further, a couple of days passed, still situation got worse. At this time I had a new far fetched theory; Perhaps G was penalizing me for adding too many(200/day) stories without getting more inbound links.
A couple of days went by, my eyes were glued to the terminal window where i was religiously watching the access.log for any Googlebot activity real time. Posted my issues on multiple forums and mailing lists, still no help. to take steps further, I learnt a new command ngrep, a tcpdump like tool which is used to monitor network activity in realtime.
Ngrep-ing the Googlebot IP(yes at that point only one IP was trying to access my site) for a couple of days, I got to the conclusion that there was some networking issues between Googlebot and my server. The GET from google came in and then the series of packets were sent during the duration of next 2 to 5 minutes. The first packet sent instantly but the following packets were taking time. I could confirm that this wasnt a php/apache issue as the content was gziped, the first packet will be sent only if the processing of the whole page is complete. On closer observation, I figured that on many occasions the same packet was being resent over and over again. Now this finally gave me some insight into what was happening; Packet loss!
Googling around, I got to some forums where packet loss was being discussed, used ping flooding to find the optimum MTU which was 528, anything higher resulted in packet loss. Changed MTU of eth0 to 528 and immediately Googlebot started showing the love it once used to, my 2 week long ordeal was over. It was something like I had just turned on the magic switch. The crawl rate went from 1 request every 3 or 4 minutes to 4 to 6 requests a minute. The same days news articles was immediately shown on Google News.
The weird part is that both the WAMP and LAMP setups are 2 different physical servers hosted by 2 different ISPs in 2 different datacenters but using the same backbone and both these ISPs say that none of their other customers were having any issues.
Recently I solved another issue with Google News not picking up images accompanying the News items, but I’ll blog about that another day. Right now my caffeine levels are low and I must hit the sack.
EDIT 1st Feb 2008 : Couple of weeks ago I changed the MTU back to 1500 and its working fine… I guess it was a temporary bug… The data center guys have no clue.
EDIT 16th Feb 2008 : Returned home late at night Saturday night and saw no news stories on news.google.com for last 9 hours. Did some testings, reduction of MTU to 500 again seemed to be the only solution.