get_them_ducats.pl: RSS parsing and screen-scraping, poorly.

Since I moved to New York, I’ve found myself facing an extra hour and a half a day on the train, with nothing to do but wish I had a seat or could afford a closer apartment.

It’s left me scrambling for things to listen to, or to read. So I’ve done two things. One, I started downloading and listening to podcasts - they require too much of my attention to listen to during work, and generally don’t hold up against the myriad distractions of home, but episodes of This American Life and Scientific American’s Science Talk podcasts are perfect for the train.

The other thing, which I’ll go into an absurd level of detail about in a minute here, has been to dust off of the old Sony eReader that I thought was going to be so important and ended up sitting on my desk for months on end unless I was going to be on an plane.

The trouble with the Reader, and I don’t think I’m overstating things to say that this is the problem with all portable media devices, is that it lives and dies by the content that you can get on it. The iPod thrives off of the iTunes store and people’s MP3 libraries. Amazon is trying to do the same with the Kindle, though I honestly think it’s a losing plan, for a number of reasons.

The Sony reader, then, by that standard, failed horribly, at least out of the box. The store it connected to had a small selection of overpriced books, and RSS support, which was pretty much why I wanted the thing, sucked. Hard. Through a combination of programs that other people wrote, I managed to get RSS feeds converted to PDFs that were readable on the umm, reader.

This time, I decided to do things differently. Becki’s got me reading New York Magazine, and it occurred to me that what I really wanted wasn’t 30 snarky Gawker line-items to read on the train, but one or two longer, in-depth pieces.

So what I did was throw together a perl script (get_them_ducats.pl, because I should never be allowed to name anything, ever) to make this a little easier.

I’ll warn you now, this is long and exceptionally dry, so I’m hiding it behind the jump.


…a perl script with the goal of doing two things.

First, to pull RSS feeds for sites I like, and convert them to text files, because sometimes I do actually want to read 30 snarky Gawker posts on my way to work. This is easy, and very much something the publisher had in mind. No problem there - Perl’s XML::Simple library is a fine enough parser for our purposes. All you need to do is, well, this:

Note that I’ve cropped out “use” statements and variable declarations - this is long enough already. And that write_to_file is just a wrapper I wrote around calls to the standard open/print/close methods, with error handling and a couple of other things to generate pretty file names, and that strip_html is just a couple of regular expresssions to grep out the html, because the Sony reader doesn’t deal so well with that. strip_html will also compress whitespace, because page turns are slow and a bit jarring on this thing, and so I don’t really want to display the 50 linebreaks that might be in someone’s HTML between every paragraph.

    # Just as a quick and dirty example, we'll use this set:
    # I love you, Gawker media, and I love your URLs.
    foreach(qw{gizmodo gawker idolator io9 kotaku deadspin consumerist
      lifehacker jezebel}){
        $feeds{ucfirst($_)} = "http://feeds.gawker.com/$_/full";
    };

    foreach my $name (keys %feeds){
        print "Processing $name RSS feed...";
        write_to_file($name.'.txt', process_rss_file(get $feeds{$name}));
    }

sub process_rss_file {
    my $raw = shift;
    my $xml = new XML::Simple;
    my $data;
    my $rv;

    eval { $data = $xml->XMLin($raw); };
    die $@ if $@;
    die "FAILGET: Nothing returned" if !$data;

    foreach(@{$data->{channel}->{item}}){
        $rv .= "nn".uc($_->{title})."nn";
        if($_->{'content:encoded'}){
            $rv .= strip_html($_->{'content:encoded'});
        } else {
            $rv .= strip_html($_->{description});
        }
    }

    return $rv;
}

This deals with “content:encoded” versus “description”, and the eval block around the XMLin call is because Simple XML dies a horrible gurgling death if it finds a malformed XML file. I know I call die if that happens anyway, but at least this way, if I decide to deal with it better in the future, the groundwork is already there.

The second goal is the slightly less kosher one, and thus I’m not posting too much code for that part of the project. I wanted to go on magazine websites and yank all the features, then separate the content from the markup and (from this perspective) cruft, and again, convert that into a text file. This is the interesting bit, from a programming standpoint, and it might be something of an ethical gray area, as well (I’m not very well going to copy the banner ads, am I? And it’s not like the script is generating ad impressions).

It’s essentially non-consensual syndication, though we’re not republishing anything (At least I’m not, and hopefully you aren’t either), which may or may not work out to theft, if you’re feeling alarmist, or intellectual property infringement if you’re being a little more accurate. I’m still not totally sure how socially or legally reprehensible the practice is, but I can pretty much guarantee you that it’s against the terms and conditions of the sites we’re talking about, and that’s enough reason not to do it. The more I think about it, in fact, the more I think that I’ll just stick to nice, legit, RSS, and really only develop the screen-scraper as a curiosity and to prove that I can. It’s less about ripping 50 sites to text, and more about seeing how elegantly one can rip 2 or 3.

OK, that’s the end of my lecture on why you and I should be finding other hobbies.

Still, it’s an interesting problem to solve, because it comes largely down to social engineering and observation, rather than trying to brute force anything, and much of it has to be on site-by-site basis. Unless you find a bunch of people using the same CMS, you’re probably going to have to rewrite chunks of code for each individual site, or do something dastardly and clever. More on that later.

Well, it comes down to that, and regular expressions, which is largely why I used perl and not PHP (Zing!).

It’s worth noting that the reliance on regexes or HTML parsers here has less to do with the fact that it’s easier or more efficient, and more to do with the fact that HTML parsers don’t always work - I’ve seen too many of them puke on invalid files, or return oddball structures that just aren’t easily parsable. Plus, regular expressions, in just the last couple of years, have gone from a largely unknown construct that I was terrible with, to one of my go-to solutions.

So, first things first. Look at the site you’re planning to scrape. Look, in particular, at the URLs. Odds are, there’s a page with a table of contents. This probably has a list of recent features, book reviews, some junk we don’t want, whatever. Suppose then, that this is located here:

http://example.com/magazine/table-of-contents/2008

The other thing is that your starting point can actually be an RSS feed. Sites that don’t offer full-content feeds generally still have one with short teasers and links back to the actual article pages, which is exactly what we were going to grab from the table of contents in the first place.

So, we’ve got our starting point, and this is where things get interesting. What we have to do now is find out where in that page are the base URLs of the articles we want to be reading on the train.

Suppose want this: http://example.com/magazine/features/12345

And this: http://example.com/magazine/features/5678

But not this: http://example.com/magazine/news/21046

Or this: http://example.com/magazine/boring/to/you

So our pattern (escaped), would be:

http:\/\/example.com\/magazine\/features\/(d+)

..and that’s how we know what to grab.View source on this, by the way - you’ll have to pattern match based whether they’re relative or absolute links, and there’s no way outside of the raw source to tell for sure if they are, since the browser bangs the domain name onto relative links. And don’t get greedy - I know people who have gotten IP-blocked from Slashdot for trying to use a scraper (though I can assure you that their’s were better-written and geared towards more useful ends) on an entire domain. The whole point of this exercise is to stay under the radar and not be a colossal prick about leeching content you aren’t paying for.

It’s worth pointing out that you can easily do the same if a site uses titles or something, instead of numerical IDs, in the links, but the regex is obviously going to be different.

The next obvious step is to find each of those example.com/features/blah links, and download the source. But there’s a trick here, that I wish worked more often. Remember, we’re yanking the full-page HTML to text here, so you’re yanking a lot of bytes that are just going to get binned.

Try the print version first.

More often than not, the print version is a JavaScript link, and it uses the same HTML, with a print CSS file that your script doesn’t care about, and can’t see (because we’re dealing with the source code, not the view). But sometimes you get lucky, and it’s something obvious:

http://example.com/magazine/features/12345/print

or http://example.com/magazine/features/12345?view=print

Whatever it is, check a page for the print view, and maybe you’ll get lucky. The other nice thing about printable views is that the good CMSes will concatenate all the pages of the article, because it makes it a bit easier on the person doing the printing, but which also makes it a bit easier on us. Again, you’re using good web design principles to your own nefarious ends.

Odds are you aren’t going to be able to do that, though, at least in my limited experience messing with this. So what you have to do is look for the pagination links. And this is tricky, more so than the previous bits. There are a couple of ways to handle pagination, and (again) you won’t know how your target site does until you look at it.

If there are just “next page” and “previous page” links, you can use those. Fetch and process a page, and while there’s a “next” link, grab and process that. If there isn’t, or you don’t have a solid way of finding it via regex (maybe it’s not classed seperately, or something).

The generalized solution, then, is to find something like this:

http://example.com/magazine/features/5678/1

Or maybe it’s:

http://example.com/magazine/features/5678?page=2

Whatever it is, you’ll have to do the legwork to find it, and then you’re set. You can iterate over those pages until you get one that isn’t found. This isn’t terribly efficient, by the way. Every story you pull on every site could have to wait for a timeout after the last page, before the script knows to stop and go on to the next article - and if it’s redirecting to a 404 page, you have to test against that. But it’s a more generalized solution than to code for site-by-site pagination links. The final decision on that, I leave to the reader, but I’m perfectly OK with putting in the up-front work, to avoid constantly trying to open URLs that don’t exist. Remember, it’s not your bandwidth, so try to tread lightly.

Keep track of whether a URL has already been fetched, as well - in the case of the same link being on a table of contents twice, multiple next/previous links, article subnavs, etc.

The actual page processing, stripping the content we want from the things we don’t (navigation links, headers, footers, images), is again done by regular expression, and I feel that this is really where the regex-based system shows it’s weakness. You have to look for clues in the HTML that say where the main page content begins and ends. There might be telling HTML comments (’CONTENT BEGINS/ENDS HERE’) if you’re lucky, but otherwise you need a real parser, that even works against malformed or wildly invalid documents. In my single evening or so of messing with this, I haven’t really found an ideal solution yet.

This is pretty much what it looks like:

sub scrape_site {
    my %fetched;
    my $toc_html = get 'http://example.com/mag/toc/2008/';

    while ($toc_html =~ /http://example.com/mag/toc/(d{8})/sg){
        my $issue_html = get "http://example.com/mag/toc/$1/";

        while ($issue_html =~ /http://example.com/news/features/(d+)/sg){
            my $feature_id = $1;

            if(!$fetched{$feature_id}){
                print "Getting MagazineName feature $feature_id...n";
                process_article("http://example.com/news/features/$feature_id");
                $fetched{$feature_id} = 1;
            }
        }
    }
}

sub process_article {
    my $article_url = shift;
    my $article_html;
    my $raw_html = get $article_url;
    my $title = $article_url;
    $article_html .= $title;
    my %pages;

    if($raw_html =~ /<title><(.*)/title>/isg){
        $title = substr($1, 0, 40).'...';
    }

    while($raw_html =~ /$article_url/index(d+).html/sg){
        my $page_url = "$article_url/index$1.html";

        if(!$pages{$page_url}){
            my $temp_html = get $page_url;
            $temp_html =~ /id="story">(.*)/end #story/s;
            $pages{$page_url} = strip_html($1);
        }
    }

    foreach(sort keys %pages){
        $article_html .= $pages{$_}."nn";
    }

    write_to_file("$title.txt", $article_html);
}

Not the prettiest code, but hopefully it gets the point across. It didn’t help that WordPress munged the bejesus out of this when I posted it.

Now, one of things to note here is that while the nuts and bolts of the matching are different site-by-site, the basic process is the same: fetch the table of contents, look through the HTML and find the links for the articles you want, then go through each page of those features, grabbing the content and ripping out the HTML, then finally save it out to a text file.

So, in theory, you could have one generalized set of functions to do all this, and just store your patterns in some kind of data structure. Say, a nested hash. That is left as an exercise to the reader, it shouldn’t be that hard.

The end result of this, for the time being, is that I’m able to pull quite a bit of content with very little code, and in a bare-bones format that takes up almost no disk space, while being perfectly readable on the reader - if I ever remembered to take it with me and kept the battery charged.

An obvious problem here are that this outputs plain text, thus no images, as opposed to the more fully-realized PDF solution I was using before. As a counterpoint to that, or any other criticism, I’d like to offer the excuse that I spent more time typing this article than I did writing the script it’s about. So no, it’s not exactly polished software.

I hope this has been informative - even if I’m not breaking any new ground here, it’s a fun little project that I enjoyed messing with, and ideally the thought process around it was slightly interesting to read. Further, I hope that you aren’t a media executive, and are not phoning your lawyer right now.

3 Responses to “get_them_ducats.pl: RSS parsing and screen-scraping, poorly.”

  1. look at all dem words

  2. tl;dr

  3. I thought the post made some good points on screen scrapers, I use python for simple html screen scrapers, but for larger projects i used extractingdata.com screen scraper which worked great, they build custom screen scrapers and data extracting programs

Leave a Reply

You must be logged in to post a comment.