Web scraping with Ruby/Mechanize
Intro
The Mechanize library is used for automating interaction with websites. Mechanize is also in Perl and Python available. Mechanize automatically stores and sends cookies, follows redirects, can follow links, and submit forms. Form fields can be populated and submitted. Mechanize also keeps track of the sites that you have visited as a history.
Mechanize uses nokogiri to parse html. What does this mean for you? You can treat a mechanize page like an nokogiri object. After you have used Mechanize to navigate to the page that you need to scrape, then scrape it using nokogiri methods to search parts in the DOM via XPath or CSS3 selectors.
Example of nokogiri
To get all the result titles from a Google search for "rails":
Example of mechanize
Navigate to Google, fill out the form and submit it:
Real world example
For my study I need frequently a timetable which is inside a Joomla CMS system, where only registered users have access after login.
This is the code of the login form (attention on the last hidden field).
To get the timetable, I have to go on the login page extract the secret hash (last hidden field), fill out the login form and send it and surf to the timetable site.
Whenever you need a XPath you can use the Firefox extension Firebug; go to the element click on "Inspect Element" and than on "copy XPath".
Installation
On Ubuntu you can install mechanize with apt(itude):
$ sudo aptitude install libwww-mechanize-ruby
or via gem:
$ sudo gem install mechanize
$ sudo gem install nokogiri