[Paleopsych] ZDNet: The magic that makes Google tick

Premise Checker checker at panix.com
Sat Jan 22 15:39:25 UTC 2005

The magic that makes Google tick
Matt Loney, ZDNet UK
December 02, 2004

The numbers alone are enough to make your eyes water.

# Over four billion Web pages, each an average of 10KB, all fully indexed.
# Up to 2,000 PCs in a cluster.
# Over 30 clusters.
# 104 interface languages including Klingon and Tagalog.
# One petabyte of data in a cluster -- so much that hard disk error rates of
# 10-15 begin to be a real issue.
# Sustained transfer rates of 2Gbps in a cluster.
# An expectation that two machines will fail every day in each of the larger
# clusters.
# No complete system failure since February 2000.

It is one of the largest computing projects on the planet, arguably employing
more computers than any other single, fully managed system (we're not
counting distributed computing projects here), some 200 computer science
PhDs, and 600 other computer scientists.

And it is all hidden behind a deceptively simple, white, Web page that
contains a single one-line text box and a button that says Google Search.

When Arthur C. Clarke said that any sufficiently advanced technology is
indistinguishable from magic, he was alluding to the trick of hiding the
complexity of the job from the audience, or the user. Nobody hides the
complexity of the job better than Google does; so long as we have a
connection to the Internet, the Google search page is there day and night,
every day of the year, and it is not just there, but it returns results.
Google recognises that the returns are not always perfect, and there are
still issues there -- more on those later -- but when you understand the
complexity of the system behind that Web page you may be able to forgive the
imperfections. You may even agree that what Google achieves is nothing short
of sorcery.

On Thursday evening, Google's vice-president of engineering, Urs Hölzle, who
has been with the company since 1999 and who is now a Google fellow, gave an
insight to would-be Google employees into just what it takes to run an
operation on such a scale, with such reliability. ZDNet UK snuck in the back
to glean some of the secrets of Google's magic.

Google's vision is broader than most people imagine, said Hölzle: "Most
people say Google is a search engine but our mission is to organise
information to make it accessible."

Behind that, he said, comes a vast scale of computing power based on cheap,
no-name hardware that is prone to failure. There are hardware malfunctions
not just once, but time and time again, many times a day.

Yes, that's right, Google is built on imperfect hardware. The magic is
writing software that accepts that hardware will fail, and expeditiously
deals with that reality, says Hölzle.

Google indexes over four billion Web pages, using an average of 10KB per
page, which comes to about 40TB. Google is asked to search this data over
1,000 times every second of every day, and typically comes back with
sub-second response rates. If anything goes wrong, said Hölzle, "you can't
just switch the system off and switch it back on again."

How to slam spam
The job is not helped by the nature of the Web. "In academia," said Hölzle,
"the information retrieval field has been around for years, but that is for
books in libraries. On the Web, content is not nicely written -- there are
many different grades of quality."

Some, he noted, may not even have text. "You may think we don't need to know
about those but that.s not true -- it may be the home page of a very large
company where the Webmaster decided to have everything graphical. The company
name may not even appear on the page."

Google deals with such pages by regarding the Web not as a collection of text
documents, but a collection of linked text documents, with each link
containing valuable information.

"Take a link pointing to the Stanford university home page," said Hölzle.
"This tells us several things: First, that someone must think pointing to
Stanford is important. The text in the link also gives us some idea of what
is on the page being pointed to. And if we know something about the page that
contains the link we can tell something about the quality of the page being
linked to."

This knowledge is encapsulated in Google's famous PageRank algorithm, which
looks not just at the number of links to a page but at the quality or weight
of those links, to help determine which page is most likely to be of use, and
so which is presented at the top of the list when the search results are
returned to the user. Hölzle believes the PageRank algorithm is 'relatively'
spam resistant, and those interested in exactly how it works can find more
information here.

Page II: Google's vice-president of engineering was in London this week to
talk to potential recruits about just what lies behind that search page.

Obviously it would be impractical to run the algorithm once every page for
every query, so Google splits the problem down.

When a query comes in to the system it is sent off to index servers, which
contain an index of the Web. This index is a mapping of each word to each
page that contains that word. For instance, the word 'Imperial' will point to
a list of documents containing that word, and similarly for 'College'. For a
search on 'Imperial College' Google does a Boolean 'AND' operation on the two
words to get a list of what Hölzle calls 'word pages'.

"We also consider additional data, such as where in the page does the word
occur: in the title, the footnote, is it in bold or not, and so on.

Each index server indexes only part of the Web, as the whole Web will not fit
on a single machine -- certainly not the type of machines that Google uses.
Google's index of the Web is distributed across many machines, and the query
gets sent to many of them -- Google calls each on a shard (of the Web). Each
one works on its part of the problem.

Google computes the top 1000 or so results, and those come back as document
IDs rather than text. The next step is to use document servers, which contain
a copy of the Web as crawled by Google's spiders. Again the Web is
essentially chopped up so that each machine contains one part of the Web.
When a match is found, it is sent to the ad server which matches the ads and
produces the familiar results page.

Google's business model works because all this is done on cheap hardware,
which allows it to run the service free-of-charge to users, and charge only
for advertising.

The hardware
Kevin Mitnick in Australia - Register now!
"Even though it is a big problem", said Hölzle, "it is tractable, and not
just technically but economically too. You can use very cheap hardware, but
to do this you have to have the right software."

Google runs its systems on cheap, no-name IU and 2U servers -- so cheap that
Google refers to them as PCs. After all each one has a standard x86 PC
processor, standard IDE hard disk, and standard PC reliability -- which means
it is expected to fail once in three years.

On a PC at home, that is acceptable for many people (if only because they're
used to it), but on the scale that Google works at it becomes a real issue;
in a cluster of 1,000 PCs you would expect, on average, one to fail every
day. "On our scale you cannot deal with this failure by hand," said Hölzle.
"We wrote our software to assume that the components will fail and we can
just work around it. This software is what makes it work.

One key idea is replication. "This server that contains this shard of the
Web, let's have two, or 10," said Hölzle. "This sounds expensive, but if you
have a high-volume service you need that replication anyway. So you have
replication and redundancy for free. If one fails you have 10 percent
reduction inservice so no failures so long as the load balancer works. So
failure becomes and a manageable event."

In reality, he said, Google probably has "50 copies of every server". Google
replicates servers, sets of servers and entire data centres, added Hölzle,
and has not had a complete system failure since February 2000. Back then it
had a single data centre, and the main switch failed, shutting the search
engine down for an hour. Today the company mirrors everything across multiple
independent data centres, and the fault tolerance works across sites, "so if
we lose a data centre we can continue elsewhere -- and it happens more often
than you would think. Stuff happens and you have to deal with it."

A new data centre can be up and running in under three days. "Our data centre
now is like an iMac," said Schulz." You have two cables, power and data. All
you need is a truck to bring the servers in and the whole burning in,
operating system install and configuration is automated."

Working around failure of cheap hardware, said Hölzle, is fairly simple. If a
connection breaks it means that machine has crashed so no more queries are
sent to it. If there is no response to a query then again that signals a
problem, and it can cut it out of the loop.

That is redundancy taken care of, but what about scaling? The Web grows every
year, as do the number of people using it, and that means more strain on
Google's servers.

Page III: Google's vice-president of engineering was in London this week to
talk to potential recruits about just what lies behind that search page.

Google has two crucial factors in its favour. First, the whole problem is
what Hölzle refers to as embarrassingly parallel, which means that if you
double the amount of hardware, you can double performance (or capacity if you
prefer -- the important point is that there are no diminishing returns as
there would be with less parallel problems).

The second factor in Google's favour is the falling cost of hardware. If the
index size doubles, then the embarrassingly parallel nature of the problem
means that Google couldSo every year as the Web gets bigger and requires more
hardware to index, search and return Web pages, hardware gets cheaper so it
"more or less evens out" to use Hölzle's words.

As the scale of the operation increases, it introduces some particular
problems that would not be an issue on smaller systems. For instance, Google
uses IDE drives for all its storage. They are fast and cheap, but not highly
reliable. To help deal with this, Google developed its own file system --
called the Google File System, or GFS -- which assumes an individual unit of
storage can go away at any time either because of a crash, a lost disk or
just because someone stepped on a cable.

The power of three
There are no disk arrays within individual PCs; instead Google stores every
bit of data in triplicate on three machines on three racks on three data
switches to make sure there is no single point of failure between you and the
data. "We use this for hundreds of terabytes of data," said Hölzle.

Don't expect to see GFS on a desktop near you any time soon -- it is not a
general-purpose file system. For instance, a GFS block size is 64MB, compared
with the more usual 2KB on a desktop file system. Hölzle said Google has 30
plus clusters running GFS, some as large as 2,000 machines with petabytes of
storage. These large clusters can sustain read/write speeds of 2Gbps -- a
feat made possible because each PC manages 2Mbps.

Once, said Hölzle, "someone disconnected an 80-machine rack from a GFS
cluster, and the computation slowed down as the system began to re-replicate
and we lost some bandwidth, but it continued to work. This is really
important if you have 2,000 machines in a cluster." If you have 2000 machines
then you can expect to see two failures a day.

Running thousands of cheap servers with relatively high failure rates is not
an easy job. Standard tools don't work at this scale, so Google has had to
develop them in-house. Some of the other challenges the company continues to
face include:

Debugging: "You see things on the real site you never saw in testing because
some special set of circumstances that create a bug," said Hölzle. "This can
create non-trivial but fun problems to work on."

Data errors: A regular IDE hard disk will have an error rate in the order of
10^-15 -- that is one millionth of one billionth of the data written to it may
get corrupted and the hard-disk's own error checking will not pick it up.
"But when you have a petabyte of data you need to start worrying about these
failures," said Hölzle. "You must expect that you will have undetected bit
errors on your disk several times a month, even with hardware checking
built-in, so GFS does have an extra level of checksumming. Again this is
something we didn.t expect, but things happen."

Spelling: Google wrote its own spell checker, and maintains that nobody know
as many spelling errors as it does. The amount of computing power available
at the company means it can afford to begin teaching the system which words
are related -- for instance "Imperial", "College" and "London". It's a job
that many CPU years, and which would not have been possible without these
thousands of machines. "When you have tons of data and tons of computation
you can make things work that don.t work on smaller systems," said Hölzle.
One goal of the company now is to develop a better conceptual understanding
of text, to get from the text string to a concept.

Power density: "There is an interesting problem when you use PCs," said
Hölzle. "If you go to a commercial data centre and look at what they can
support, you'll see a typical design allowing for 50W to 100W per square
foot. At 200W per square foot you notice the sales person still wants to sell
it but their international tech guy starts sweating. At 300W per square foot
they cry out in pain."

Eighty mid-range PCs in a rack, of which you will find many dozens in a
Google data centre, produce over 500W per square foot. "So we're not going to
blade technology," said Hölzle. "We're already too dense. Finally Intel has
realised this is a problem and is now focusing more on power efficiency, but
it took some time to get the message across."

Quality of search results: One big area of complaints for Google is connected
to the growing prominence of commercial search results -- in particular price
comparison engines and e-commerce sites. Hölzle is quick to defend Google's
performance "on every metric", but admits there is a problem with the Web
getting, as he puts it, "more commercial". Even three years ago, he said, the
Web had much more of a grass roots feeling to it. "We have thought of having
a button saying 'give me less commercial results'," but the company has shied
away from implementing this yet.

ZDNet UK's Matt Loney reported from London.

More information about the paleopsych mailing list