|Web Scraping with Python (2e)|
Author: Ryan Mitchell
Web scraping is a strange activity, but Python is a good choice of a language for it.
Web scraping is the name usually given to the activity of programatically downloading a web -page and then extracting the data it contains. If you have never tried to do it, you might think that this was easy. After all, a web page is highly structured with lots of tags that help you find what you are looking for. In practice it isn't so easy. The problem is that HTML tags aren't there to indicate what and where the data is. They are also often generated by a program and this usually means lots of additional and mostly redundant <div>s etc. For example, suppose we have list of earnings figures for companies. On the page they look like a table that is easy to read, but the tags give no idea of this structure as they are a set of undistinguished <p> tags. If you want to web scrape such a site then you have to use a range of techniques to pinpoint the data you need, which often involves using tag positions, i.e. tags nested within tags, tag attributes, and even the natural language content of the page.
This is difficult and the temptation is to fall into a disorganized way of working that treats everything as a unique problem with highly specific structure. For example, you might use the rule that the data for the US was in the third <p> tag after the <h2> tag. This might work for a while but when the site designer changes the <h2 to an <h3> and your program stops working.
In short web scraping is hard to do right.
This particular book is written by someone who has had plenty of experience and recommends using Python and its associated tools. The first chapter introduces Beautiful Soup which, despite its name, is a very serious web scraper. You can use it to download a page and process it using a range of parsing functions. Chapter 2 goes over its HTML parsing functions including regular expressions. This is where most of the clever things involved in web scraping are discussed. You need to figure out ways of picking out the data that is as robust as possible. You need to choose some condition that is unlikely to change when the page is reorganized or redesigned and this is still difficult, even if you have all of the sophisticated parsing provided by Beautiful Soup. This second edition uses the latest versions of all of the software it describes and so it will be some months before it starts to be out of date.
The next two chapters deal with web crawling. This is another aspect of web scraping. Sometimes you don't need to master this as you only need to get data from a fixed number of known pages but if you are trying to gather data from the web in general then you need a web crawler. The idea of a web crawler isn't a difficult one, but getting the way it works right can be hard if you are to avoid circular paths and dead ends. Chapter 5 introduces Scrapy, another web scrapping tool.that includes web crawling.
The final chapter in the section is all about storing data using CSV files and MySQL. This is a mini-tutorial on MySQL, and helpful if you really need one.
Part 2 is about advanced scraping. The first two chapters cover very general material - document encoding, csv files again, PDF and Microsoft Office formats. Chapter 8 covers data cleaning, recognizing that scraped data is often dirty.
Chapter 14 is on the strange topic of avoiding scraping traps - this is one for those wanting to make their websites unscrapable as well as the scraper. On the same footing is Chapter 15 on testing your website with scrapers. Chapter 16 introduces the idea of parallelism in scraping and 17 deals with using remote machines to scrap so hiding your location.
The final chapter deals with the legal issues raised by scraping. Scraping itself isn't illegal but you still have to obey copyright and trademark laws and similar restrictions.
This is a good book and it's easy to read. It does have some problems for some readers. Scraping is a topic that isn't particularly deep, but it does demand that you are on top of the technology in general. As a result there are places where the book appears to be off topic or teaching you things that you already know. In other words, there are places where you might find it a bit oversimplified. The real problem is that in these areas where it tries to simplify things, it doesn't really give you enough to cope, only enough to get started.
I would have like more insight into how scraping problems are solved. This, however, is a tall order because they are usually highly specific to the task in hand. However, there are general ways of thinking about the problems and it would be nice to see an attempt to catalog or classify these.
Overall, though, this book is recommended if you want a quick course in web scraping in all its forms. Its main use is to point you in the direction of the right tools for the job.
|Last Updated ( Saturday, 28 July 2018 )|