Data scraping development is a technical art that starts with big blunt instruments —to write, capture, parse, and store data— and later leads to surgical slicing and dicing.

Below are a few handy tools and techniques I've gathered. This is a working document that will change over time.

Narrow Google Searches

I use this trick all the time for my blog. Let's say you are looking for an article about Ruby Rake tasks on my blog. If you want to narrow the search, type this into Google.

rake task site:chrisjmendez.com

This will only pick articles from my website that mention "rake" or "tasks" from within https://www.chrisjmendez.com/.

Explicit Google Searches

This example is really useful while scraping Twitter. Suppose your social media manager uses bit.ly to encode links on Twitter. Let's say that during a Downtown LA campaign, she and her team posted a handful of Tweets with the hash "#dtla2017". Weeks later, you're trying to audit the tweets and don't want to bug everyone with your inquiry. Here's how to search within Twitter for anything with a bit.ly URL and a reference to the keyword "#dtla2017".

Step 1

Suppose you want to track a Twitter contest promoted by a radio station that used bit.ly or goo.gl shortened links. You can always run this into Google Search:

Example 1

site:twitter.com intext:bit.ly "#dtla2017 *"

Example 2

site:twitter.com intext:bit.ly "classicalkusc *"

Source

Step 2

Once you have something working, the next step would be to create a Google custom search that will convert your search results into an RSS Feed.

Step 3

You can time-box your search (and feed) by adjusting the date parameter dateRestrict=. More ».

ATOM Example

Source


Google Alerts

Google alerts is still a great way to get notifications based on keywords you specify. This is especially useful when monitoring a business competitor's moves or tracking your name online.


LiveAPI

LiveAPI is pretty new, but it shows a lot of promise. It's a tool designed to help turn any public data into an API. Read this article by the brilliant @melissjs.


Yahoo Pipes Clones

Yahoo Pipes was an incredible piece of software, and although it's no longer in production, there are a few clones worth looking into.


IFTTT

You can you If This Then That for Ebay.


Feedity

Feedity provides a service to scrape web pages into feeds.


Google API RSS

Google API RSS tool helps you create RSS feeds for Google Search Results.


Google Spreadsheets

You can go one step beyond Google API and start screen scraping using Google Spreadsheets