An updated way to spider the Web with Ruby

I’ve released version 0.2.0 of Spider. Everything has changed:

  • Use RSpec to ensure that it mostly works.
  • Use WEBrick to create a small test server for additional testing.
  • Completely re-do the API to prepare for future expansion.
  • Add the ability to apply each URL to a series of custom allowed?-like matchers.
  • BSD license.

The new API is kinda cool. From the README:

Spider.start_at('http://mike-burns.com/') do |s|
  # Limit the pages to just this domain.
  s.add_url_check do |a_url|
    a_url =~ %r{^http://mike-burns.com.*}
  end

  # Handle 404s.
  s.on 404 do |a_url, err_code|
    puts "URL not found: #{a_url}"
  end

  # Handle 2xx.
  s.on :success do |a_url, code, headers, body|
    puts "body: #{body}"
  end

  # Handle everything.
  s.on :any do |a_url, resp|
    puts "URL returned anything: #{a_url} with this code #{resp.code}"
  end
end

I just uploaded it to Rubyforge, so give it a minute then gem update spider.

Advertisements

2 Comments

  1. Posted November 20, 2008 at 8:03 am | Permalink

    Hi Mike

    You need to update this example: the handle 2xx should be:

    s.on :success do |a_url, resp, prior_url|
    puts “body: #{resp.body}”
    end

    Although, for the record, I disapprove of contracting ‘response’ to ‘resp’, but just put it that way for consistency with the rest of your example.

    Thanks for the gem!

  2. Posted October 5, 2009 at 12:54 pm | Permalink

    Hey Mike –

    Any way to specify levels of depth for the spider to go? Say it starts at mike-burns.com and I don’t want to limit it by domain name, but by levels of links to follow. First might take you to Digg and then a Digg link takes me to Flickr, etc. Say I only want to find the first level (Digg) and not go any further. Any easy way to do that currently?

    Thanks again for the bot. It’s really helped me with my ruby project so far. I’m actually currently using it alongside a rails project that I’m working on.


3 Trackbacks/Pingbacks

  1. […] New version with a totally different API. This entry was written by Mike Burns and posted on March 31, 2007 at 3:03 am and filed under […]

  2. […] dizzy, but managed to release some code and blog about your desire to employ […]

  3. […] next step was running this test against a local development server (http://localhost:3000 for a Rails webapp). spiking around, I started looking for existing spider engines, written in Ruby so that I could easily integrate into our test codebase. it took me one hour for evaluating tools: the winner was ruby-spider, forked from Spider. […]

%d bloggers like this: