Ways to scrape data

There are many situations where you may need to scrape data. Data scraping development is really an art form (of its own) and the complexity of a project can range from a giant aggregator —written to capture, parse and store data— to something really small like a single function connected to a timer (aka cronjob) that yanks data from Craigslist.

This article is for new developers looking for "quick wins" from a few noteworthy tools. I will continue to update this with more info over time.


Google Search

Narrow Google Searches

I use this trick all the time for my own blog. Let's say you are looking for an article about Ruby Rake tasks on my blog. If you really want to narrow the search, simply type this into Google.

rake task site:chrisjmendez.com

This will only pick articles from within my website that mention "rake" or "tasks" from within GHOST_URL/.

Explicit Google Searches

This example is really useful while scraping Twitter. Suppose your social media manager uses bit.ly to encode links on Twitter. Lets say that during a Downtown LA campaign, she and her team posted a handful of Tweets with the hash "#dtla2017". Weeks later, you're trying to do an audit of the tweets and you don't want to bug everyone with your inquiry. Here's how to seach within Twitter for anything with a bit.ly URL and a reference to the keyword "#dtla2017".

Step 1

Suppose you want to track a Twitter contest promoted by a radio station that used used bit.ly or goo.gl shortened links. You can always run this into Google Search:

Example 1

site:twitter.com intext:bit.ly "#dtla2017 *"

Example 2

site:twitter.com intext:bit.ly "classicalkusc *"

Source

Step 2

Once you have something working, the next step would be to create a Google custom search that will convert your search results into an RSS Feed.

Step 3

You can time-box your search (and feed) by adjusting the date parameter dateRestrict=. More ».

ATOM Example

Source


Google Alerts

Google alerts is still a great way to get notifications based on keywords you specify. This is especially useful in situations where you want to monitor a business competitor's moves or maybe track your own name online.


LiveAPI

LiveAPI is pretty new but it shows a lot of promise. It's a tool designed to help turn any public data into an API. Read this article by the brilliant @melissjs.


Yahoo Pipes Clones

Yahoo Pipes was an incredible piece of software and although it's no longer in production, there area few clones worth looking into.


IFTTT

You can you If This Then That for Ebay.


Feedity

Feedity provides a service to scrape web pages into feeds.


Google API RSS

Google API RSS tool helps you create RSS feeds for Google Search Results.


Google Spreadsheets

You can go one step beyond Google API and start screen scraping using Google Spreadsheets