Headless Chrome and the Puppeteer Library for Scraping and Testing the Web
Headless Chrome and the Puppeteer Library for Scraping and Testing the Web
Written by Nikos Vaggalis   
Wednesday, 29 November 2017

With the advent of Single Page Applications, scraping pages for information as well as running automated user interaction tests has become much harder due to its highly dynamic nature. The solution? Headless Chrome and the Puppeteer library.

While there's always been Selenium, PhantomJS and others, and despite headless Chrome and Puppeteer arriving late to the party, they make for valuable additions to the team of web testing automation tools, which allow developers to simulate interaction of real users with a web site or application.

Headless Chrome is able to run without Puppeteer, as it can be programmatically controlled through the Chrome DevTools Protocol, typically invoked by attaching to a remotely running Chrome instance:

chrome --headless --disable-gpu
                     --remote-debugging-port=9222

Subsequently loading the protocol's sideckick module 'chrome-remote-interface' which provides  a simple abstraction of commands and notifications using a straightforward JavaScript API, one can execute  JavaScript scripts under a local Node.js installation.

From the official documentation, here is an  example that navigates to https://example.com and saves a screenshot as example.png::

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto('https://example.com');
  await page.screenshot({path: 'example.png'});

  await browser.close();
})();

But since there's 'chrome-remote-interface' already, what does Puppeteer do differently? Puppeteer offers a higher level API to the CDP than the one made available by 'chrome-remote-interface'.

As Paul Irish explains:

(with CDP) it's kind of horrifying that it takes ~75 LOC to take a fullpage screenshot with the protocol. I felt uncomfortable asking all developers to engage with the browser at that layer.

Instead of those 75 lines, with Puppeteer that should be more like:

const browser = await puppeteer.launch();
let page = await browser.newPage();
await page.goto('https://www.google.com/');
await page.screenshot({path: 'output.png', fullPage: true});
browser.close();

This is more reasonable for all developers who want to write an automation script.

It's even more developer friendly than that as it also packs an internal headless Chrome instance so you don't have to explicitly call it as we've seen above.

So let's see Puppeteer in action in scraping all available image elements from a web page:

//smadeseek1.js
const puppeteer = require("puppeteer");

const url = "http://smadeseek.com/index.html";
const imgSelector = "#about > div > div.row > div img";

(async () => {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();

    await page.goto(url, { waitUntil: "load" });
 
    const imgAll = await page.$$eval(imgSelector,
          images =>
           {return images.map((image)=>image.src)});

    await console.log(imgAll);

    await browser.close();
})();

 

 

Results in:

[ 'http://smadeseek.com/Freeze/img/icons/smartphone.png',
  'http://smadeseek.com/Freeze/img/icons/smartwatch.png',
  'http://smadeseek.com/Freeze/img/icons/tablet.png' ]

 

What happens when running 'node smadeseek1.js' is:

  • A headless Chrome instance is launched

  • Wait until the contents of the URL are loaded by hooking to the document.onload event

  • Call page.$$eval which wraps document.querySelectorAll passing to it the CSS selector for the element we are after, that is "#about > div > div.row > div img"

  • Iterate over the array we named as 'images' which contains all matching the CSS selector 'img' elements and retrieve their 'src' properties.

That's the tip of the Iceberg though as besides web scraping, Puppeteer can also pull automated User Interface interaction tests, just like the Espresso testing framework on Android.(See the UI testing part of the Insider's Guide To Udacity Android Developer Nanodegree Part 3 - Making the Baking App guide for more on Espresso).

For example, let's go to www.smadeseek.com and load a list of all smartphones availability.Then programmaticaly click on the img element of the second displayed device to bring up its detailed specifications page. From there we can access the innerHTML of the first table element:

 //smadeseek2.js

const puppeteer = require("puppeteer");

const url = "http://smadeseek.com/smartphones";
const imgSelector = "#contentx > div > div img";
const tableSelector = "#masthead";

const tableSelector2 = "#masthead > div >
 div:nth-child(2) > div > div >
  div.col-md-6.col-sm-6 > table:nth-child(2)";


(async () => {

    const browser = await puppeteer.launch();
    const page = await browser.newPage();
    await page.goto(url, { waitUntil: ["load","domcontentloaded"] });    

    await page.waitForSelector(imgSelector);
    const imgAll = await page.$$eval(imgSelector,
       images =>
        {return images.map((image)=>image.src)});


    await console.log(imgAll);

    await page.click("#contentx > div >
                div:nth-child(2) > a > img ");


    await page.waitForSelector(tableSelector);

    const tableAll =
        await page.$eval(tableSelector2,
                   images => images.innerHTML);

    
    await console.log(tableAll);
    
    await browser.close();
})();

 

Results in :

[ 'http://smadeseek.com/ModelImages/885-1.jpg',
  'http://smadeseek.com/ModelImages/937-1.jpg',
  'http://smadeseek.com/ModelImages/846-1.jpg',
  'http://smadeseek.com/ModelImages/858-1.jpg',
  'http://smadeseek.com/ModelImages/862-1.jpg' ]

 

 

smadeseek2

and:

<tbody><tr><td>Brand</td><td>Meizu</td>
</tr><tr><td>Model</td><td></td></tr><tr>

<td>Version</td><td></td></tr><tr><td>Colors</td><td></td></tr></tbody>

 smadeseek3

 

There's just one caveat. Since CDP only works with Chromium, Chrome and other Blink-based browsers, so does Puppeteer. If you require more than that, then sticking to Selenium and its WebDriver API still remains the best option..

 

More Information

Chrome DevTools Protocol

Puppeteer on GitHub

smadeseek1.js-Puppeteer - Scrape smadeseek.com 1 level deep

smadeseek2.js-Puppeteer - Scrape smadeseek.com 2 levels deep

Related Articles

Insider's Guide To Udacity Android Developer Nanodegree Part 3 - Making the Baking App (Espresso UI testing)

Insight.io - Code Intelligence on GitHub

Grimoire Lab-GitHub - Stats On Steroids

 

To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on, Twitter, FacebookGoogle+ or Linkedin.

 

Banner


Amazon Sumerian For Building Virtual Environments
27/11/2017

AWS (Amazon Web Service) introduced a Midnight Madness event to get it's annual re:Invent conference off to an early start. The first announcement was Amazon Sumerian providing assets for buildin [ ... ]



Prepare to Run a Code Club with Future Learn
21/11/2017

The Raspberry Pi Foundation has just opened the doors on a new free, short online course on the Future Learn platform. The aim of 'Prepare to Run a Code Club' is to assist new volunteers who want [ ... ]


More News

 
 

 

blog comments powered by Disqus

 

Last Updated ( Wednesday, 29 November 2017 )
 
 

   
Banner
RSS feed of news items only
I Programmer News
Copyright © 2017 i-programmer.info. All Rights Reserved.
Joomla! is Free Software released under the GNU/GPL License.