Tom MacWright

tom@macwright.com

On Scrapers

Update: This post was published in 2012. As of 2017,
  • The dream of this post is finally achievable with puppeteer, Google's official API for controlling Chrome programmatically.
  • Greasemonkey is no longer supported, and neither is userscripts.org. Tampermonkey.net is the successor of those projects.

tl;dr: use GreaseMonkey, happen, and jQuery to write a scraper that runs in a real browser.

A ‘Web Scraper’ is a program that attempts to make a webpage or website accessible, in full, to other programs. For instance, governments often spend energy publishing their databases as ‘data portals’ - showing one page of police or election data at a time. This is perfectly fine for the majority of people visiting the site, but if your’re trying to do analysis like aggregation or visualization, or if you’re trying to download your own data in bulk, it’s counterproductive and frustrating.

And so web scrapers do the task of a very industrious user - they navigate websites, parse pages, and save information in bulk.

In the course of being a so-called data-centered programmer and a somewhat crazy person, I’ve written a lot of these scripts. They scrape things like Yelp, GitHub, Twitter, Garmin Connect, and weirder stuff like Afghanistan Election sites, public imagery sources, and more.

There are a few things that have stood out in all of this experience.

Scrapers Try To Be Browsers

There are two things about browsers which are valuable and hard to replicate: strong, tolerant HTML parsing, and the sessions/cookies/browserstring identification paradigm.

The web is filled with invalid HTML, which is in turn invalid XML. Browsers tolerate it, and users write it.

I support Marc completely in his decision to make Mosaic work as best it can when it is given invalid HTML. The maxim is that one should be

  • conservative in what one does
  • liberal in what one expects.

Tim Berners-Lee

The web is also filled with web pags that expect browsers - they use cookies, AJAX, Javascript links, etc.

So, advanced scrapers try to be browsers. The historical example is Mechanize, in Ruby and Perl, which implements such features as a ‘cookie jar’ to act like a browser. The new kid on the block is PhantomJS, a headless WebKit that’s inches away from being a node.js extension.

Everyone Wants Scrapers

Scrapers are weird but not narrow: they’re important for journalists, and they’re more and more a feature of ‘exporting’ data from morally wobbly sites.

What we think of as scrapers - Python or Ruby scripts - aren’t usable for most users, who will minimally be wary of the Terminal, and mostly be using Windows.

Keep Data Simple

My main complaint with systems like ScraperWiki is that it takes data from x difficult system and puts it into a new one, with its own API. The data that you get out of scrapers should be simple - that means either CSV or JSON, with the emphasis on JSON because it is a real standard with plenty of high-quality parsers.

Write Latin

My choice of language and environment for web scraping has always been driven by the package - by what XML parser and HTTP request toolchain was available. Sometimes it’s Python, sometimes it’s Ruby, and recently it has been node.js with the cheerio module.

This is a problem: unless you have a really good reason, you should write code in the language most understandable to your audience. Javascript - and by Javascript I mean mostly jQuery, is Latin on the web.

A New Approach for Scrapers

Here it goes: use a browser. Specifically Firefox, with Greasemonkey.

It’s not cute. Firefox is lagging behind Chrome most of the time nowadays, but Greasemonkey is the best augmented browsing software around, and has a great community for that code.

And so the approach is as such:

  1. Get the necessary data from the page with jQuery
  2. Save data in localStorage
  3. Go to the next page with happen. Go to #1

That’s it. To review the finer points:

happen is a library I wrote for creating real events in Javascript, like click events. You might have used $('#foo').click() in jQuery, but mind that it’s not real - that code simply executes handlers. If you haven’t used jQuery to bind those events, they can’t be triggered with jQuery. Whereas happen will trigger all events always.

A relevant example of happen in a scraper:

// Find the next link
var $next = $('a[title="Next Document"]');
// Click that link - [0] gives you
// the real DOM node from the jQuery-wrapping
if ($next) happen.click($next[0]);

localStorage is a slow but steady storage API. You can save data in your browser itself with it, with a super-simple interface. For instance,

var laws = localStorage.getItem('laws');
if (laws) { laws = JSON.parse(laws);
} else { laws = []; }
laws.push({ title: 'New!' });
localStorage.setItem('laws', JSON.stringify(laws));

Then, when you’ve stored your data into one big object, just enter in your FireBug console:

var laws = localStorage.getItem('laws');
document.location = 'data:Application/octet-stream,' +
    encodeURIComponent(laws);

Which gets the current data dump and downloads it to your system as JSON - easy as that.

Why

This technique has a few awesome benefits. You can prototype your scrapers in FireBug, getting really fast feedback on how to select, clean, and store stuff from the page. Other people who know jQuery can read the script and get what it does - it’s just jQuery.

It can handle crazy cases like sites that need login - which otherwise are a pain to script.

It’s not blazingly fast, but, of course, scraping is ‘fast’ when it’s ‘not a week of some intern’s time’.