Google News ranking factors, 2003 patent revealed

August 20th, 2009

Via WebProNews by Chris Crum
A patent application by Google “Systems and methods for improving the ranking of news articles” was Granted on August 18, 2009. The patent was originally filed about 6 years ago on “September 16, 2003″. Interesting analysis in human readable language at Seo By The Sea by Bill Slawski.

Before continuing, it is better if you read the Bill’s and Chris’ posts first.

In spite of this filing being 6 years old, I personally believe some of the theory is still valid today. It is important to know what Google was doing in 2003 to better understand what it may be doing today.

Abstract of the patent :-

A system ranks results. The system may receive a list of links. The system may identify a source with which each of the links is associated and rank the list of links based at least in part on a quality of the identified sources.

I will first discuss points already established and then try to have my own conclusions.

Source Rank : This is a rank given to different news sources. An article from a source having higher “Source Rank” would be more likely to rank higher than others. According to the patent, the following metrics go into determining the “Source Rank”.

Number of articles produced by the news source during a given time period : Presumably more the better, rather more original articles the better compared to newswire stories.

Average length of an article from the news source : Presumably, a news source with longer articles would get a better Source Rank.

Breaking news score : The most interesting aspect, I had a rough feeling this was an important factor, the patent agrees. Ill discuss in my conclusions below, citing examples. Basically as per the patent, a news source which publishes news about events which just occurred, gives source a higher Source Rank.

Usage pattern : Tracking click thrus from Google News search and analyzing that data. All links on Google news are redirected thru their forwarder. They have been tracking this data for as long as i can remember.

Human opinion of the news source : Quite obvious :)

Circulation statistics of the news source : Circulation stats from various media monitoring agencies.

The size of the staff associated with the news source : Google recently started showing(where possible) author names in news search. These are detected automagically using some algorithm. Im quite sure that they have been tracking these internally for quite some time.
Google News search result showing Author Name

The number of news bureaus associated with the news source To favour bigger more established news outlets.

Original named entities appearing in articles produced by the news source : A named entity is a specific person, place, organization, or thing. More unique Named Entities the better. This probably shows more in-depth news source.

Number of topics on which the source produces content : To determine the niche the news source participates in. A news source like TechCrunch almost exclusively writes about Tech related articles, Google may then determine that TechCrunch is an authority on Tech related topics.

International diversity of the news source : Checking on countries from where people visit to these sites from via Google News Search based on IP.

The writing style used by the news source : Grammar, spelling, readability. Writing style may also help Google determine target audience. (eg British vs American English)

Conclusion

This is not at all related to what Google told us about ranking on Google News, it does provide some nice insight.

Now what i believe, is that Google News also implements what id like to call a Source Rank per Topic. The Breaking news score as explained above is applicable on per topic basis too. Example my site had few stories about an incident just after a major news broke. It got some traffic, then got clouded by the regular big sources which allegedly have a much higher Source Rank. But from a couple of days later, any follow-ups I did, ranked well on Google News. My assumption is that Google sees which sources were the ones to Break the particular story and assigns them a temporary(or permanent) authority on the topic.

I have no views on the content length point, but i do agree that more original sentences do result in a higher Source Rank.

Another point which i don’t see mentioned but have a strong belief to be an important factor for the Source Rank is the performance of the website. Its basic common sense, that if Google is sending a lot of traffic, they don’t want these people to wait for ages while the overloaded servers of the News site is churning out the pages. Google would rather like faster sites. This was personally observed by me after I implemented a new caching mechanism which made average random page generation time drop to 50 to 100ms from previous ~1s . Within days my traffic from Google doubled. So even if you are running a small site like mine, it is best to keep your random page load delay as small as possible.

Google also sees(IMHO) regular SEO policies in determining the Story Rank for a news source. Internal linkage, external Linkage, etc..

Prospective search using python

July 22nd, 2009

Prospective search, or persistent search, is a relatively less common method of implementing search where the list of keywords is defined, and when provided a single document it determines the list of keywords applicable to it.

This is different from traditional (or “retrospective”) search, where many documents are stored into an indexed and when provided with a search term, the search engine returns the list of documents which best match the query.

The best real world examples would be how Google News Alerts(or IMHO categorization/clustering in Google News) works. When a new news story is found by Google, it makes more sense to run a prospective search on the news story to find which alert subscriptions (or news category) it belongs to, rather than searching for all the alerts repeatedly on their entire index.

Lucene has a MemoryIndex class for just this purpose, ive made a simple implementation in python using pylucene. MemoryIndex is a special class in lucene for on-the-fly searching. It can contain only one doccument which may have more than one field. This is ideal for prospective search.

Installation and setup of pylucene is out of scope of this post… RTFM! (do note u need to edit the MakeFile)

  1. import sys, os, lucene, time, threading
  2.  
  3. def ProspectiveSearch(body, terms):
  4.     lucene.initVM(lucene.CLASSPATH)
  5.     index = lucene.MemoryIndex()
  6.     index.addField("content", body, lucene.StandardAnalyzer())
  7.     parser = lucene.QueryParser("content", lucene.StandardAnalyzer())
  8.     matches = []
  9.     for term in terms:
  10.         score=index.search(parser.parse(term))
  11.         if score > 0:
  12.             matches += [term]
  13.     return matches

sample usage :-

  1. body = "hi my name is sajal kayan"
  2. terms = ["sajal", "good", "boy", "name", "sajal AND NOT kayan", "sajal AND kayan"]
  3. matches = ProspectiveSearch(body, terms)

In this case returns ['sajal', 'name', 'sajal AND kayan']

Note:initVM() is giving problems on mod_wsgi

On my computer, this is the benchmark i noticed for a 244 word content.

  • 1,492 queries : 0.79 seconds (for whole script only 248ms for the search loop)
  • 14,920 queries : 1.519 seconds
  • 74,600 queries : 3.425 seconds
  • 149,200 queries : 5.552 seconds
  • 298,400 queries : 10.328 seconds

If you know a better method to achieve prospective search in python do let me know. Would also be interested to know if any RPC based search software does this.

BarCampBKK3 - my experience!

May 25th, 2009

Last weekend(23rd and 24th May) I attended BarCamp Bangkok 3, it was an awesome experience… In this blogpost i intend to outline some of the interesting aspects of it from my viewpoint.

Barcampbkk3 sign board

(Photo Credit new_davich on flickr)

Firstly over 700 people registered on the Barcamp Website. Atleast 550 people showed up at the actual event. That is 550 people registered at the registration desks on Day 1. There may have been more people turning up throughout the day who didn’t register and I don’t yet have the figure for Day 2. This IMHO would make BarCampbkk3 the biggest BarCamp in ASEAN. There were many people who flew in to Bangkok from overseas exclusively for the BarCamp from countries including Malaysia, Singapore, Cambodia, Vietnam and Hong Kong. Many to Bangkok for their first time.

Many thanks to Sripatum University(SPU) for agreeing to be the venue. They were very helpful and even provided us with 20 to 30 volunteers to help with the arrangements.

BarCampbkk3 Opening Ceremony

Opening Ceremony! - Dont be scared barcamp isint anything formal.. this is exception ;) (Photo Credit new_davich on flickr)

I collected the following schwag :-

BarCampbkk3 Shirt

BarCamp Bangkok black T-Shirt (Thanks Luke for the awesome design) - Photo Credit Virak

Cloth Bag from SPU

An eco friendly cloth Bag from SPU (Photo credit Preetam Rai)
ATIZ white T-Shirt (if you can find photo ping me)
Yahoo Car hanging thingy. (if you can find photo ping me)

Tech start-ups in Thailand

Among the interesting topics covered were some presentations and a discussion relating to Start-ups in Thailand. There were talks focused on financing issues and other issues faced by startups. The most common factors discouraging Thais and Foreigners from setting up a start-up in Thailand is(IMHO) the procedure and red-tape involved in setting up and managing a Thai Company. John mentioned about a friend who flew to Singapore in a morning and by afternoon he had his company set-up and ready for business. So thats about 10,000 Baht for the airfare and about S$200 to S$300(about 4,700 to 7,100 Thai Baht) for formalities, etc. Here in Thailand even if you know exactly what to do, it would take weeks.

Ben from Proteus Tech gave and interesting talk titled “How to Create a Successful Technical Startup”. Proteus Tech is also interested to encourage the potential Thai entrepreneurs. Proteus Tech said in a statement:-

“We hope to organize a startup event to help people understand how to write a business plan and define a business strategy. Then we’ll have a follow up “startup gauntlet” where we give them a chance to present their biz plan and get evaluated + win some seed capital to start.”

Ben’s Presentation - Why didn’t I see this a few years ago, I learned some of the points the hard way.

Overnight Activities

This was the first Barcamp in Thailand where we stayed at the venue overnight. The evening started with drinks at a nearby pub, after which we returned back to the venue. I tried in vain to help people getstated in Linux, but looks like nobody was interested… We played a couple of rounds of a Werewolf Game which was interesting, the foreigners always got nominated to be werewolves and kicked out first…. @murz (tried to) introduce us to a board game “Adel Verpflichtet“. The rules were so complex that she had to draw a flowchart to explain it :)

Along with Jan, I did a “SEO site clinic” where we analyzed volunteers websites from an SEO viewpoint. Unlike last BarCamp, this was attended by very few people, probably due to clash in timing with other more popular topics.

Overall it was very exciting to be a part of BarCampBKK3 looking forward to BarCampBKK4

Links:-

BarCamp Bangkok Website : http://www.barcampbangkok.org
Pics : http://www.flickr.com/search/?q=barcampbkk3&w=all
Slides : http://www.slideshare.net/search/slideshow?lang=**&submit=post&q=+barcampbkk3&commit=search

Blogs : http://blogsearch.google.com/blogsearch?q=barcampbkk3

Typical interaction of Windows Vista

April 27th, 2009

Vista : Are you sure?
User : Yes
Vista : Are you sure about being sure?
User : Yes
Vista : Are you sure about being sure about being sure?
User : Yes
Vista : Are you sure about being sure about being sure about being sure?
User : Yes
Vista : Are you sure about being sure about being sure about being sure about being sure?
User : Yes
Vista : Are you sure about being sure about being sure about being sure about being sure about being sure?
User : Yes
Vista : Are you sure about being sure about being sure about being sure about being sure about being sure about being sure?
User : Grrr…. Screw you Microsoft!!!!
Vista : Are you sure you want to screw Microsoft?

Python script to detect bad bots/people faking as Googlebot

March 28th, 2009

A script for analyzing my webservers access.log is long overdue here is a small start. Just recently I noticed a bad bot was attempting to scrape whole of my site using Googlebot’s useragent. Since im learning python, I thought it might be a nice experience to write a simple script which can help me detect these fakers.

The script looks at the access log, looks for records matching “Googlebot” then validates based on techniques mentioned at “How to verify Googlebot” at Google Webmaster Central Blog. It may also be useful or even fun to catch other SEOs trying to see your site thru Googlebot’s eyes.

The logic is simple. The IP from which the request is coming in should point to a *.googlebot.com and in turn the hostname should resolve back to the same IP. The first part can be faked by a smart faker, but the latter is not possible(unless they break into Google’s DNS servers ;) ). This 2 step validation is a sure shot method.

For a Genuine Googlebot request :-

Server log entry :-
66.249.71.202 - - [28/Mar/2009:08:59:14 -0500] GET / HTTP/1.1 “200″ 17892 “-” “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)” “-”
IP : 66.249.71.202

Thus :-
# host 66.249.71.202
202.71.249.66.in-addr.arpa domain name pointer crawl-66-249-71-202.googlebot.com.
# host crawl-66-249-71-202.googlebot.com.
crawl-66-249-71-202.googlebot.com has address 66.249.71.202
#

For now this script outputs : The number of hits, IP, hostname, and what ip the hostname resolvs to….
# ./logazier.py
92 - 99.190.96.157 - adsl-99-190-96-157.dsl.pltn13.sbcglobal.net - FAKE - 99.190.96.157
36 - 24.154.150.217 - dynamic-acs-24-154-150-217.zoominternet.net - FAKE - 24.154.150.217
4 - 83.82.191.185 - 5352BFB9.cable.casema.nl - FAKE - 83.82.191.185
4 - 69.64.69.150 - 69-64-69-150.dedicated.abac.net - FAKE - 69.64.69.150
3 - 64.191.54.85 - venus.surfwebhost.com - FAKE - 64.191.54.85
3 - 117.47.205.13 - err - FAKE - err
2 - 218.186.12.202 - cm202.omega12.maxonline.com.sg - FAKE - 218.186.12.202
1 - 96.254.203.143 - pool-96-254-203-143.tampfl.fios.verizon.net - FAKE - 96.254.203.143
1 - 76.160.175.238 - mail.appianllc.com - FAKE - 76.160.175.238
1 - 121.246.166.247 - 121.246.166.247.static-hyd.vsnl.net.in - FAKE - err
1 - 117.196.235.141 - err - FAKE - err

The script can be downloaded at : http://www.sajalkayan.com/logazier/0.0.1/logazier.py

Upcoming features.

  1. Detect other major bots as well - yahoo, msn, alexa, etc…
  2. Analyze the access.log for bad bot activity even when the bots use regular browser user agents - much more complex than I thought :)