-->

Thursday, December 02, 2004

The magic that makes Google tick

ZDNet UK Insight

Google's vice-president of engineering was in London this week to talk to potential recruits about just what lies behind that search page. ZDNet UK snuck in to listen:

"It is one of the largest computing projects on the planet, arguably employing more computers than any other single, fully managed system (we're not counting distributed computing projects here), some 200 computer science PhDs, and 600 other computer scientists...

"Over four billion Web pages, each an average of 10KB, all fully indexed
Up to 2,000 PCs in a cluster
Over 30 clusters
104 interface languages including Klingon and Tagalog
One petabyte of data in a cluster -- so much that hard disk error rates of 10-15 begin to be a real issue
Sustained transfer rates of 2Gbps in a cluster
An expectation that two machines will fail every day in each of the larger clusters
No complete system failure since February 2000


The problem

Google indexes over four billion Web pages, using an average of 10KB per page, which comes to about 40TB. Google is asked to search this data over 1,000 times every second of every day, and typically comes back with sub-second response rates. If anything goes wrong, said Hölzle, "you can't just switch the system off and switch it back on again."

The job is not helped by the nature of the Web. "In academia," said Hölzle, "the information retrieval field has been around for years, but that is for books in libraries. On the Web, content is not nicely written -- there are many different grades of quality."

Some, he noted, may not even have text. "You may think we don't need to know about those but that’s not true -- it may be the home page of a very large company where the Webmaster decided to have everything graphical. The company name may not even appear on the page."

Google deals with such pages by regarding the Web not as a collection of text documents, but a collection of linked text documents, with each link containing valuable information.

The rest of the article covers:

1. The process
Obviously it would be impractical to run the algorithm once every page for every query, so Google splits the problem down.

The hardware
"Even though it is a big problem", said Hölzle, "it is tractable, and not just technically but economically too. You can use very cheap hardware, but to do this you have to have the right software."

The scalability
Google has two crucial factors in its favour. First, the whole problem is what Hölzle refers to as embarrassingly parallel, which means that if you double the amount of hardware, you can double performance (or capacity if you prefer -- the important point is that there are no diminishing returns as there would be with less parallel problems).

The second factor in Google's favour is the falling cost of hardware.

Other technical challenges

Quality of results: One big area of complaints for Google is connected to the growing prominence of commercial search results -- in particular price comparison engines and e-commerce sites. Hölzle is quick to defend Google's performance "on every metric", but admits there is a problem with the Web getting, as he puts it, "more commercial". Even three years ago, he said, the Web had much more of a grass roots feeling to it. "We have thought of having a button saying 'give me less commercial results'," but the company has shied away from implementing this yet.

Google
Creative Commons Licence
This work is licensed under a Creative Commons License.