One day I needed to automate downloads from ASP.Net powered site.
I’ve never done any ASP befor so I don’t realy know how that works. While digging my site I’ve found that it’s just a pile of crap (BTW if you need scraping protection - use ASP.net :) Also it seems like ASP technology tries to make simple things complex. Ie even a basic link is actually a form submission. also ASP abuses javascript by making it required even for simple pages.
Turned out that it’s still possible to scrape the site with Mechanize with some extra ASP.Net magic.
In my case I had to:
extract javascript arguments from javascript handlers
def asp_link_args href = self.attributes['href'] href =~ /\(([^()]+)\)/ && $1.split(/\W?\s*,\s*\W?/).map(&:strip).map {|i| i.gsub(/^['"]|['"]$/,'')} end
asp-click asp-links(wich is a form submission)
def asp_click(action_arg = nil) etarget,earg = asp_link_args.values_at(0, 1) f = self.page.form_with(:name => 'aspnetForm') f.action = asp_link_args.values_at(action_arg) if action_arg f['__EVENTTARGET'] = etarget f['__EVENTARGUMENT'] = earg f.submit end
Both methods defined in Mechanize::Page::Link
context so we can express our clicking trail like:
file = page.link_with(:text => 'Journall').asp_click.
link_with(:text => 'All Downloads').asp_click(4).
link_with(:text => 'Download').asp_click
Which is pretty nice!
Above solution isn’t a generic approach for scrapign ASP.net sites rather just an example of how great Ruby and Mechanize are and how they can help you make complex solution easy.