Web Scraping with Python (2e)

Author: Ryan Mitchell 
Publisher: O'Reilly
Pages: 308
ISBN: 978-491985571
Print: 1491985577
Kindle: B07BMGBYSK
Audience: Budding web scrapers
Rating: 4
Reviewer: Alex Armstrong

Web scraping is a strange activity, but Python is a good choice of a language for it.

Web scraping is the name usually given to the activity of programatically downloading a web -page and then extracting the data it contains. If you have never tried to do it, you might think that this was easy. After all, a web page is highly structured with lots of tags that help you find what you are looking for. In practice it isn't so easy. The problem is that HTML tags aren't there to indicate what and where the data is. They are also often generated by a program and this usually means lots of additional and mostly redundant <div>s etc. For example, suppose we have  list of earnings figures for companies. On the page they look like a table that is easy to read, but the tags give no idea of this structure as they are a set of undistinguished <p> tags. If you want to web scrape such a site then you have to use a range of techniques to pinpoint the data you need, which often involves using tag positions, i.e. tags nested within tags, tag attributes, and even the natural language content of the page.

This is difficult and the temptation is to fall into a disorganized way of working that treats everything as a unique problem with highly specific structure. For example, you might use the rule that the data for the US was in the third <p> tag after the <h2> tag. This might work for a while but when the site designer changes the <h2 to an <h3> and your program stops working.

In short web scraping is hard to do right.

Banner

This particular book is written by someone who has had plenty of experience and recommends using Python and its associated tools. The first chapter introduces Beautiful Soup which, despite its name, is a very serious web scraper. You can use it to download a page and process it using a range of parsing functions. Chapter 2 goes over its HTML parsing functions including regular expressions. This is where most of the clever things involved in web scraping are discussed. You need to figure out ways of picking out the data that is as robust as possible. You need to choose some condition that is unlikely to change when the page is reorganized or redesigned and this is still difficult, even if you have all of the sophisticated parsing provided by Beautiful Soup. This second edition uses the latest versions of all of the software it describes and so it will be some months before it starts to be out of date.

The next two chapters deal with web crawling. This is another aspect of web scraping. Sometimes you don't need to master this as you only need to get data from a fixed number of known pages but if you are trying to gather data from the web in general then you need a web crawler. The idea of a web crawler isn't a difficult one, but getting the way it works right can be hard if you are to avoid circular paths and dead ends. Chapter 5 introduces Scrapy, another web scrapping tool.that includes web crawling.

The final chapter in the section is all about storing data using CSV files and MySQL. This is a mini-tutorial on MySQL, and helpful if you really need one.

 

 

Part 2 is about advanced scraping.  The first two chapters cover very general material - document encoding, csv files again, PDF and Microsoft Office formats. Chapter 8 covers data cleaning, recognizing that scraped data is often dirty.

The next few chapters deal with more esoteric topics in various forms. Chapter 9 introduces a simple approach to natural language processing - Markov models and the NLTK toolkit. Next we deal with forms, and login forms in particular. JavaScript scraping is next and this includes using Selenium to execute JavaScript in a tame browser so you can find out what it does. Chapter 12 deals with APIs - too big a topic to expect everything to be covered. Chapter 13 takes us into the  area of image processing so that you can extract data from images. This  explains some libraries - Pillow, Tesseract and NumPy. Don't expect too much; this sort of problem needs lots of processing power and it's still at the edge of research.

Chapter 14 is on the strange topic of avoiding scraping traps - this is one for those wanting to make their websites unscrapable as well as the scraper. On the same footing is Chapter 15 on testing your website with scrapers. Chapter 16 introduces the idea of parallelism in scraping and 17 deals with using remote machines to scrap so hiding your location.

The final chapter deals with the legal issues raised by scraping. Scraping itself isn't illegal but you still have to obey copyright and trademark laws and similar restrictions.

Conclusion 

This is a good book and it's easy to read. It does have some problems for some readers. Scraping is a topic that isn't particularly deep, but it does demand that you are on top of the technology in general. As a result there are places where the book appears to be off topic or teaching you things that you already know. In other words, there are places where you might find it a bit oversimplified. The real problem is that in these areas where it tries to simplify things, it doesn't really give you enough to cope, only enough to get started.

I would have like more insight into how scraping problems are solved. This, however, is a tall order because they are usually highly specific to the task in hand. However, there are general ways of thinking about the problems and it would be nice to see an attempt to catalog or classify these.

Overall, though, this book is recommended if you want a quick course in web scraping in all its forms. Its main use is to point you in the direction of the right tools for the job.

 

To keep up with our coverage of books for programmers, follow @bookwatchiprog on Twitter or subscribe to I Programmer's Books RSS feed for each day's new addition to Book Watch and for new reviews.

Banner


Zombie Scrum Survival Guide (Addison-Wesley)

Author: Christiaan Verwijs, Johannes Schartau and Barry Overeem
Publisher: Addison-Wesley Professional
Date: November 2020
Pages: 200
ISBN: 978-0136523260
Print: 0136523269
Kindle: ‎ B08F5GY39V
Audience: Scrum developers
Rating: 5
Reviewer: Kay Ewbank

The idea behind this book is a fascinating [ ... ]



The Road to Azure Cost Governance

Author: Paola E. Annis et al
Publisher: Packt Publishing
Pages: 314
ISBN: 978-1803246444
Print: 1803246448
Kindle: B09NW2CTHX
Audience: Bill payers
Rating: 4.5
Reviewer: Ian Stirk

This book aims to help you reduce your Azure costs, how does it fare?


More Reviews

 

Last Updated ( Saturday, 28 July 2018 )