If you have ever thought that you could do a better job than Google but were intimidated by the hardware needed to build a web index, then the Common Crawl Foundation has a solution for you.
Google’s stranglehold on search information is seen by many as being contrary to the Web’s ethos of freely available information and openness. Of course there is nothing stopping you from setting up your own search facility in opposition to Google, Bing or any other search engine for that matter, but the hardware investment would be huge. Google, for example, has custom-built data centers that do nothing but index the web by reading each page and processing the information it contains - this is generally called crawling the web.
Now we have a way to access an index created to make the web more open. The new index has been announced by the Common Crawl Foundation.
It currently consists of an index of 5 billion web pages, their page rank, their link graphs and other metadata, all hosted on Amazon EC2. The index is open and freely accessible to any users via EC2.
Lisa Green, director of the Common Crawl Foundation, says on the company blog that Gil Elbaz started the Common Crawl Foundation to take action on the belief that it is crucial our information-based society that web crawl data be open and accessible to anyone who desires to utilize it.
The Common Crawl Foundation aims to make use of cheaper crawling and storage costs for the common benefit. The Foundation says:
"Common Crawl is a Web Scale crawl, and as such, each version of our crawl contains billions of documents from the various sites that we are successfully able to crawl. This dataset can be tens of terabytes in size, making transfer of the crawl to interested third parties costly and impractical. In addition to this, performing data processing operations on a dataset this large requires parallel processing techniques, and a potentially large computer cluster. Luckily for us, Amazon's EC2/S3 cloud computing infrastructure provides us with both a theoretically unlimited storage capacity coupled with localized access to an elastic compute cloud."
The crawl data is stored in Amazon S3 and this means that you can access it from an Amazon EC2 image without having to even pay a data transfer charge. You will probably have to pay for the EC2 image, however, but your startup costs are likely to be a lot less than building a data center to do the job. At the very least you should be able to get a proof of concept up and running for very little investment.
The architecture of the crawl service is itself a testament to open source software being based on Hadoop, HDSF and a custom web crawler. The crawl is collated using a MapReduce process, compressed into 100Mbyte ARC files which are then uploaded to S3 storage buckets for you to access. Currently there are between 40,000 and 50,000 filled buckets waiting for you to search.
To access the data you have to run a Hadoop cluster on the EC2 service to create a map-reduce job that processes the S3 data. You also need to use some custom glue code that allows access to the ARC files. What all this means is that you still do have to work quite hard to get something up and running and you will need to budget approximately $100 per complete map-reduce job on the index.
The Foundation is working to build up a GitHub repository of code that has been and can be used to work with Common Crawl data. They also want to hear from any developers with apps they’d like to see built on Common Crawl data, or anyone who has Hadoop scripts that could be adapted to find useful information in the crawl data.
This is an interesting opportunity but how well it all works depends on the quality of the data and the continued building of the index. Google, for example, claimed to have indexed 1 trillion URLs in 2008 so 5 billion pages is a good start, but there is a obviously room for improvement.
Common Crawl Foundation
To be informed about new articles on I Programmer, subscribe to the RSS feed, follow us on Google+, Twitter or Facebook or sign up for our weekly newsletter.