Scraping ASP.Net site with Mechanize

One day I needed to automate downloads from ASP.Net powered site.

I’ve never done any ASP befor so I don’t realy know how that works. While digging my site I’ve found that it’s just a pile of crap (BTW if you need scraping protection - use ASP.net :) Also it seems like ASP technology tries to make simple things complex. Ie even a basic link is actually a form submission. also ASP abuses javascript by making it required even for simple pages.

Turned out that it’s still possible to scrape the site with Mechanize with some extra ASP.Net magic.

In my case I had to:

extract javascript arguments from javascript handlers

def asp_link_args
  href = self.attributes['href']
  href =~ /\(([^()]+)\)/ && $1.split(/\W?\s*,\s*\W?/).map(&:strip).map {|i| i.gsub(/^['"]|['"]$/,'')}
end

asp-click asp-links(wich is a form submission)

def asp_click(action_arg = nil)
  etarget,earg = asp_link_args.values_at(0, 1)

  f = self.page.form_with(:name => 'aspnetForm')
  f.action = asp_link_args.values_at(action_arg) if action_arg
  f['__EVENTTARGET'] = etarget
  f['__EVENTARGUMENT'] = earg
  f.submit
end

Both methods defined in Mechanize::Page::Link context so we can express our clicking trail like:

file = page.link_with(:text => 'Journall').asp_click.
  link_with(:text => 'All Downloads').asp_click(4).
  link_with(:text => 'Download').asp_click

Which is pretty nice!

Above solution isn’t a generic approach for scrapign ASP.net sites rather just an example of how great Ruby and Mechanize are and how they can help you make complex solution easy.

Scraping ASP.Net site with Mechanize

References

Read More

Comments