Page 3 of 3
Searching for fun
The sport of Googlewhacking involves typing in two words which causes Google to return exactly one result – i.e. a single web page on which both words appear or a “pure whack”. If you achieve this, and it is quite difficult, then you can submit it to http://www.googlewhack.com/. Of course as soon as the pure whack is listed on another web site it ceases to be one because Google indexes the new page and returns more than one result.
Google isn't just software
Google invented the PageRank and it has also implemented an amazing system of hardware and software to make it all work.
Currently Google indexes over 4 million, million web pages, each an average of 10K in size. To do this, and to respond to users in split-second time, needs a distributed architecture using 30 clusters of machines with up to 2000 machines in a cluster. Each cluster can store around 1PetaByte (i.e. a million Gigabytes) of data and uses network connectivity that provides 2Gbps sustained transfer rates. The machines are all low cost servers and redundancy is used to make sure that a machine can fail and the system carries on working.
The entire web index is stored on multiple machines each holding a “shard” of the web. When a user’s query arrives it is sent to each shard. The top 1000 or so results come back as document ID numbers. The documents in question are then retrieved from Google’s document servers which store a copy of the web as retrieved by the Google web bots.
The page that the user sees is then assembled by an ad server which drops in appropriate advertising and sponsored links so that Google makes money and you don’t have to pay for a search. All of the software that Google uses has been built from scratch on top of Red Hat Linux – it even uses its own filing system, the Google File System or GFS, to make storage more suitable for the task.
The exact details of Google's current operation are kept secret as are the locations of its data processing centers. From a hardware point of view Google's system is also interesting but only because of the choices made to use off-the-shelf hardware. Instead of using high cost servers Google uses standard machines, makes some hardware modifications such as adding battery backup and then uses them in its own grid computing system.
IIn many ways the sucess of Google is about its hardware and implemenation than just about the use of the page rank algorithm.
The search engine war
As the web became commercialised the importance of being returned by a search engine became a matter of economic importance.
Once you know how a search engine works there are things that you can do to improve your web site’s ranking. There are people who will do the job for you - Search Engine Optimisers or SEOs. Some SEOs use legitimate techniques to promote a site and are just making sure it gets the rank it deserves. There are, however, SEOs who use dirty tricks – sometimes called search engine spamming. These are designed to manipulate the search engine into returning the web site as often as possible even when it is unwarranted.
One such technique is “cloaking”, which shows one content to a search engine and another to a human visitor. For example, a site could include the entire contents of another high ranking site but hidden from view by using the same text colour as the background. There are even “link farms” which create thousands of links on demand to your website. The result is that search engines that use number of incoming links are fooled into giving your site a higher rating than it deserves.
The war for search engine ranking isn’t static and it didn’t take long for ranking algorithms to be adjusted to detect link farms and in turn for link farms to develop their response to the countermeasures by generating link pages dynamically as fast as the bots could download them. It’s a very secretive war because information is the main weapon.
If you want a more fanciful account of how Google works try:
For a history of Google:
Google’s patent is United States Patent Application 20050071741 a search on that number finds just the patent.
For general search engine information:
Google white papers and other information