Scraping ASP.Net site with Mechanize

gmarik 2 min

One day I needed to automate downloads from ASP.Net powered site.

I’ve never done any ASP befor so I don’t realy know how that works. While digging my site I’ve found that it’s just a pile of crap (BTW if you need scraping protection - use ASP.net :) Also it seems like ASP technology tries to make simple things complex. Ie even a basic link is actually a form submission. also ASP abuses javascript by making it required even for simple pages.

Turned out that it’s still possible to scrape the site with Mechanize with some extra ASP.Net magic.

In my case I had to:

  1. extract javascript arguments from javascript handlers

    def asp_link_args
      href = self.attributes['href']
      href =~ /\(([^()]+)\)/ && $1.split(/\W?\s*,\s*\W?/).map(&:strip).map {|i| i.gsub(/^['"]|['"]$/,'')}
    end
    
  2. asp-click asp-links(wich is a form submission)

    def asp_click(action_arg = nil)
      etarget,earg = asp_link_args.values_at(0, 1)
    
      f = self.page.form_with(:name => 'aspnetForm')
      f.action = asp_link_args.values_at(action_arg) if action_arg
      f['__EVENTTARGET'] = etarget
      f['__EVENTARGUMENT'] = earg
      f.submit
    end
    

Both methods defined in Mechanize::Page::Link context so we can express our clicking trail like:

file = page.link_with(:text => 'Journall').asp_click.
  link_with(:text => 'All Downloads').asp_click(4).
  link_with(:text => 'Download').asp_click

Which is pretty nice!

Above solution isn’t a generic approach for scrapign ASP.net sites rather just an example of how great Ruby and Mechanize are and how they can help you make complex solution easy.

References

Read More
Migration to rspec2
Grep vs Ack and weird Homebrew policy
Comments
read or add one↓