ScraperWiki sets data free Now with PDF support

Wednesday, 29 December 2010

You may have missed ScraperWiki - it's one of those really good ideas that tend to remain hidden in obscurity. The latest feature is a PDF to HTML converter which makes it even more worth knowing about.

ScraperWiki is a really great idea. Scraping in the technique of retrieving data from HTML pages. Data embedded in an HTML page is usually formatted for human consumption and this generally means that it isn't the best format for other applications to process. A scraper is a program designed to download the HTML page and extract that data and then present it in a format another program can use - usually XML or JSON.

scraper

What is surprising is that there is so much data on the web which is only available embedded in HTML. Often a government department will make data available in a web page but then either not have the resources or the inclination to make it available for further processing - but scraping can deliver it in a usable form.

The problem with scraping is that HTML is not easy to process to extract data - it is often not regular enough and it sometime even changes its form with web site updates. So what you need is an easy way to create a scraper and after that why not share the data that it has retrieved for everyone to use. This is the idea of ScraperWiki.

It provides a number of online templates in PHP and Ruby to get a head start on creating a scraper. The approach taken is to construct a DOM tree and then extract the data by navigating and manipulating the DOM. This really is the only sensible way to create a scraper and once you have seen an example it is fairly easy. For non-programmers there is a "request a scraper" facility where members of the Wiki will spend a few minutes building a custom scraper. You can also volunteer to fix a broken scraper or document an existing one. At the time of writing there are 58 suggested datasets needing scrapers.

The data obtained by ScraperWiki can be downloaded as as CSV file and shared with other users. The whole thing is open source and so are any scrapers you create. The idea is to free up data that is otherwise locked into HTML. Scrapers can be run on a schedule and you get an email if your scraper fails. There is also an API that allows clients to download from the datastore in either JSON, YAML, SML, PHP objects or CSV.

scraper

The whole system has been up and running for about a year and is now in beta testing - although in common with many open source projects it may well say in beta for longer than actually needed. It all seems to work perfectly well.

The latest feature is a PDF to HTML converter which opens up the possibility of PDF scraping. To quote the ScraperWiki blog:

Scraping PDFs is a bit like cleaning drains with your teeth. It’s slow, unpleasant, and you can’t help but feel you’re using the wrong tools for the job.

Once converted to HTML the same scraper tools can be used to extract the data from what is often called the largest component of the "dark web", i.e. data hidden from search by being within a PDF.

More information

http://scraperwiki.com/

http://blog.scraperwiki.com/

Azure AI And Pgvector Run Generative AI Directly On Postgres
26/03/2024

It's a match made in heaven. The Azure AI extension enables the database to call into various Azure AI services like Azure OpenAI. Combined with pgvector you can go far beyond full text search. Let's [ ... ]

+ Full Story

Apache Superset 4 Updates Reports
15/04/2024

Apache Superset 4 has been released with improvements to the reporting module and redesigned alerts. Superset is a business intelligence web application. It is open source, provides data exploration a [ ... ]

+ Full Story

More News

Last Updated ( Wednesday, 29 December 2010 )