Ways to scrape data
There are many situations where you may need to scrape data. Data scraping development is really an art form (of its own) and the complexity of a project can range from a giant aggregator —written to capture, parse and store data— to something really small like a single function connected to a timer (aka cronjob) that yanks data from Craigslist.
This article is for new developers looking for "quick wins" from a few noteworthy tools. I will continue to update this with more info over time.
Google Search
Narrow Google Searches
I use this trick all the time for my own blog. Let's say you are looking for an article about Ruby Rake tasks on my blog. If you really want to narrow the search, simply type this into Google.
rake task site:chrisjmendez.com
This will only pick articles from within my website that mention "rake" or "tasks" from within GHOST_URL/.
Explicit Google Searches
This example is really useful while scraping Twitter. Suppose your social media manager uses bit.ly to encode links on Twitter. Lets say that during a Downtown LA campaign, she and her team posted a handful of Tweets with the hash "#dtla2017". Weeks later, you're trying to do an audit of the tweets and you don't want to bug everyone with your inquiry. Here's how to seach within Twitter for anything with a bit.ly URL and a reference to the keyword "#dtla2017".
Step 1
Suppose you want to track a Twitter contest promoted by a radio station that used used bit.ly or goo.gl shortened links. You can always run this into Google Search:
Example 1
site:twitter.com intext:bit.ly "#dtla2017 *"
Example 2
site:twitter.com intext:bit.ly "classicalkusc *"
Step 2
Once you have something working, the next step would be to create a Google custom search that will convert your search results into an RSS Feed.
Step 3
You can time-box your search (and feed) by adjusting the date parameter dateRestrict=
. More ».
Google Alerts
Google alerts is still a great way to get notifications based on keywords you specify. This is especially useful in situations where you want to monitor a business competitor's moves or maybe track your own name online.
LiveAPI
LiveAPI is pretty new but it shows a lot of promise. It's a tool designed to help turn any public data into an API. Read this article by the brilliant @melissjs.
Yahoo Pipes Clones
Yahoo Pipes was an incredible piece of software and although it's no longer in production, there area few clones worth looking into.
- Pipes Digital is promising.
- Superfeedr is another good alternative.
IFTTT
You can you If This Then That for Ebay.
- Datas crape Ebay
- Data scrape Craigslist
- Data scape Twitter
- Data scrape SongKick
- Data scrape stock quotes
- Data scrape the Scoop.it feed focused on for Artists Opportunities and publish it to Pocket through IFTTT.
Feedity
Feedity provides a service to scrape web pages into feeds.
Google API RSS
Google API RSS tool helps you create RSS feeds for Google Search Results.
Google Spreadsheets
You can go one step beyond Google API and start screen scraping using Google Spreadsheets