Curated list of web scraping tools for NodeJS developers.
Web scraping is the act of fetching data from a third party website by downloading and parsing the HTML code to extract data. The web scraping process consists of 1) downloading HTML content of a page, 2) parsing/extracting the data, and 3) saving it into a database for further analysis or use.
Web Scraping Use Cases
Below are a few industries where web scraping is often used:
News - You can synthesize information from articles on different news sources using extractive or abstractive summaries.
News Aggregators - You can aggregate articles from different sources such as Reddit, Twitter, The New Yorker, etc.
Real Estate - You can scrape Redfin or Zillow to create real-time information on housing prices.
Search Engines - Suppose you want to build an internal search engine for your organization. As a supplement to your internal documents, you might also want to scrape 3rd party data such as eBay listings or user profiles from LinkedIn.
Travel - You can scrape flight and hotel prices for comparison.
Online Shopping - Scrape product pages to monitor competitor prices.
Retail Banking - Aggregate information from various sources such as Mint.com, Credit Union, and E-Trade.
Data Journalism - LA-based Crosstown is a non-profit that uses data scraping to aggregate real-time data across different data sources to generate unique, data-driven stories.
Search Engine Optimization - Marketers can optimize SEO content by comparing similar articles found on Google Search results.
Market Research - Scrape social media sites to identify public opinion (aka sentiment analysis) for trading.
Lead Generation - Discover new prospects and partners by scraping content online.
Content Audits - Webmasters can create a sitemap or content audit by scraping pages.
DIY Scraping Libraries
If you are a developer who prefers to build and manage things yourself, then here are a few good libraries for NodeJS.
[NickJS](https://nickjs.org] - Headless browser automation library. This is a good substitute for CasperJS. It works on Google Headless, PhantomJS and CasperJS.
Google's Pupeteer - Node library which provides a high-level API to control Chrome or Chromium over the DevTools Protocol.
ZombieJS - Insanely fast, full-stack, headless browser testing using NodeJS.
SlimmerJS - A scriptable browser like PhantomJS, based on Firefox.
CasperJS - CasperJS is a navigation scripting & testing utility for PhantomJS and SlimerJS. Deprecated.