Scraping web pages with ScrapingBee
PhantomJS is dead, how to scrape web pages through ScrapingBee.

Web scraping is legal for now, and one excellent service that has emerged (since the death of PhantomJS) is ScrapingBee. This service is a reliable substitute for developers familiar with PhantomJS, CasperJS, and NickJS. In the demo below, I wrapped the ScrapingBee API request into a queue.
Why ScrapingBee?
If your team wants to harvest content, at scale, and not stress about the obstacles webmasters (LinkedIn, Craigslist) place on their data, then a managed web-scraping service is a good fit.
Suppose you decide to go DIY and leverage PhantomJS and build your scraper. Below are a few things your team must consider, 1) how to mimic an actual web browser and 2) how to simulate a real person using the browser.
Mimic an Established Browser
Aside from managing your servers, you will need to 1) configure your bot's "User-Agent" to resemble a real browser, 2) handle Javascript obfuscation from Single Page Apps (SPA's) 3) designing a browser fingerprint to help a content provider identify unique users and track online behavior, and 4) scale your service to manage 20+ simultaneous instances of headless Chrome.
Mimic a Real Human
Not only will content providers conduct programmatic due diligence on your browser, but they will also do a second set of tests to confirm a real user is behind the browser. The three most common tactics for verifying human behavior are 1) IP checking, 2) Captcha brain teasers, 3) username/password or a session ID, and 4) identifying strange patterns –such as downloading 1000 documents in sequential order (such as 000, 001, 002, 003, 004...).
If your team wants to build a web scraper, you'll need to consider these eight issues and more.
Simple Demo
In my example below, I am scraping five pages from my website. The code is written for NodeJS and uses better-queue library to simplify orchestration.
const https = require('https');
const fs = require('fs');
const util = require("util");
const Queue = require('better-queue');
const data = [
"https://www.chrisjmendez.com/2020/05/01/mba-glossary/",
"https://www.chrisjmendez.com/2020/04/13/nuxtjs/",
"https://www.chrisjmendez.com/2020/04/12/how-to-simultaneously-unrar-multiple-files-at-once-into-individual-folders/",
"https://www.chrisjmendez.com/2020/03/10/installing-octave-on-macos-for/",
"https://www.chrisjmendez.com/2019/12/25/find-files-on-your-mac-using-command-line/",
"https://www.chrisjmendez.com/2019/12/02/deploying-rails-on-elastic-beanstalk/",
"https://www.chrisjmendez.com/2019/10/30/managing-virtual-environments-for-python/"
];
const API_KEY = "REGISTER HERE => https://www.scrapingbee.com?fpr=chris-m37"
const config = (url) => {
return {
hostname: 'app.scrapingbee.com',
port: '443',
url: url,
path: util.format('/api/v1?api_key=%s&url=%s', API_KEY, encodeURIComponent(url)),
method: 'GET'
}
};
const save = (fileName, html) => {
fs.writeFile(fileName, html, function(err){
if(err) return console.log(err);
console.log("Document Saved:", fileName)
})
};
const q = new Queue( (url, cb) => {
let options = config(url);
let req = https.request(options, res => {
console.log(`\nStatusCode: ${ res.statusCode }`);
let fileName = options.url.split('/').pop();
let body = [];
res
.on('data', html => {
body.push(html);
})
.on('end', () => {
body = Buffer.concat(body).toString();
save(`${fileName}.html`, body);
})
.on('close', () => {
// Go to Next Item in Queue
cb(res,body);
})
})
req.on('error', err => {
console.error(err.message);
process.exit(1);
});
req.end();
});
// /////////////////////////
// Task-Level Events
// /////////////////////////
q.on('task_started', (taskId, obj) => {
console.log('task_started', taskId, obj);
});
q.on('task_finish', (taskId, result, stats) => {
console.log('task_finish', taskId, stats);
});
q.on('task_failed', (taskId, err, stats) => {
console.log('task_failed', taskId, stats);
});
// /////////////////////////
// Queue-Level Events
// /////////////////////////
// All tasks have been pulled off of the queue
// (there may still be tasks running!)
q.on('empty', () => {
console.log('empty');
});
// There are no more tasks on the queue and no tasks running
q.on('drain', () => {
console.log('drain');
});
// /////////////////////////
// Start Queue
// /////////////////////////
data.forEach(function (item) {
q.push(item);
});
Questions, comments, and feedback welcome.
Resources
- ScrapingBee Web Scraping Handbook PDF
- Mirror your Header
- Opensource Web Scraping through NickJS
- Understanding PhantomJS
- Concurrency in Javascript
- Web Scraping using Headless Chrome