Spider the Web with Ruby

I wrote a Ruby library for crawling the Web. Use it to take down The Man, like so:

require 'rubygems'
require 'spider'
include Spider
spider(['http://del.icio.us/mikeburns']) do |a_url, a_web_page|
 puts "I am taking down The Man by knowing this URL: #{a_url}"
end

I used it to gets people’s addresses from around the Web. I plan to put them on a map. I like putting things on maps.

It once took obscene amounts of memory, until I discovered that Ruby does not optimize tail calls. Now it only takes large amounts of memory instead.

Want it? Install it (GNU GPL):

gem install spider

And then read up on it:

ri Spider\#spider

If you do something amazing with it let me know. I like amazing things more than I like putting things on maps.

Update: New version with a totally different API.

Advertisements

20 Comments

  1. Posted April 12, 2007 at 10:41 pm | Permalink

    Thanks, this is exactly what I was looking for. I don’t know if what I’m doing with it is __amazing__, but this (in conjunction with hpricot) will save me a lot of work.

  2. Robert Dempsey
    Posted June 23, 2007 at 9:45 pm | Permalink

    Mike,

    Thanks for the great gem. Where do I get the ‘Spider’ from? iIam getting an uninitialized constant error for Spider. Thanks.

    – Robert

  3. Posted June 23, 2007 at 11:59 pm | Permalink

    The Spider module should be loaded into your program with:

    require ‘rubygems’

    require ‘spider’

    If it isn’t, maybe you need to install it: gem install spider

    If that still doesn’t work let me know; I can try installing it from scratch to see if there are any errors there.

    • Eric
      Posted April 29, 2010 at 2:44 pm | Permalink

      This solution will work on Windows/Ruby:

      require ‘rubygems’
      gem ‘spider’
      require ‘spider’
      #include spider

      Spider.start_at(…

  4. Posted July 2, 2007 at 5:03 pm | Permalink

    I did the gem install:

    C:\Documents and Settings\lgrey>gem install spider
    Bulk updating Gem source index for: http://gems.rubyforge.org
    Successfully installed spider-0.1.0
    Installing ri documentation for spider-0.1.0…
    Installing RDoc documentation for spider-0.1.0…

    C:\Documents and Settings\lgrey>

    Then I ran this:

    require ‘rubygems’
    require ‘spider’
    include Spider

    puts “starting”
    spider([‘http://del.icio.us/mikeburns’]) do |a_url, a_web_page|
    puts “in block”
    puts “I am taking down The Man by knowing this URL: #{a_url}”
    end

    And got this result:

    C:\Documents and Settings\lgrey\user\blogxfer/spider.rb:3: uninitialized constant Spider (NameError)
    from C:/ruby/lib/ruby/site_ruby/1.8/rubygems/custom_require.rb:27:in `require’
    from C:/Documents and Settings/lgrey/user/blogxfer/spider.rb:2

    I’m too new to Ruby to be sure what’s going on. And advice? Thanks!

  5. Posted July 2, 2007 at 6:51 pm | Permalink

    Lee: The commenter just before you posted the same problem. I’ll play with the module today to try to reproduce it.

    Which version of Ruby is this? This is on Windows, I assume.

  6. Posted July 3, 2007 at 1:15 pm | Permalink

    Mike,

    I saw the previous post, and I wanted to give you more details. :-)

    Yes, ruby 1.8.4 on Windows:

    C:\Documents and Settings\lgrey>ruby -v
    ruby 1.8.4 (2006-04-14) [i386-mswin32]

    Thanks,
    Lee

  7. Posted July 7, 2007 at 3:59 pm | Permalink

    I’m checking back daily for your advice on how to get past this problem. Have you had a chance to check it out?

  8. David Jameson
    Posted July 9, 2007 at 4:01 pm | Permalink

    Just what I was looking for as I get up to speed with Ruby.

    I did run into an error however, as the site was spidered, I occasionally saw messages such as the following

    #

  9. David Jameson
    Posted July 9, 2007 at 4:02 pm | Permalink

    Hmmm, looks like I couldn’t post the error message because it was enclosed in angle brackets (sigh)
    Here it is again with the angle brackets removed.

    #RangeError: 0x29cb754 is recycled object

  10. Posted July 16, 2007 at 12:06 am | Permalink

    I finally resolved the NameError problem by adding require_gem:
    require ‘rubygems’
    require_gem ‘spider’
    require ‘spider’
    include Spider

    It also seems that the robot rules for del.icio.us are a problem. Even once I resolved the NameError problem, I was getting no results. When I finally came to the conclusion that my sample code couldn’t NOT work, I tried changing the URL to be spidered to my own domain, and, lo and behold, it worked!

    For everyone’s sanity, I highly recommend that you change the URL in your example code to something that can actually be spidered. ;-)

    Thanks for the gem!

  11. Posted July 18, 2007 at 2:50 pm | Permalink

    Lee: Thanks for solving that. Gems constantly cause similar annoyances for me.

  12. Posted July 18, 2007 at 2:51 pm | Permalink

    David: I’ve no idea what that error could mean. Perhaps it’s from your code? I’ll look through my code and see if I use a range anywhere.

  13. spidernewb
    Posted October 23, 2007 at 12:44 pm | Permalink

    How do you use this through a proxy server?

    • Eric
      Posted April 29, 2010 at 2:35 pm | Permalink

      From windows cmd, do:

      set http_proxy=http://yourproxy.com:port

  14. Posted October 23, 2007 at 1:00 pm | Permalink

    I’ve no idea how to use this through a proxy server. It uses Net::HTTP; do you know how to use a proxy server with that?

    It’s likely that the current version (0.2.0) cannot go through a proxy, but I can add that feature to the next version if we figure this out.

  15. kostia
    Posted November 6, 2008 at 9:46 am | Permalink

    thanks for great tool!
    is there any possibility to make it crawl breadth-first with a defined maximal depth?

  16. Posted November 6, 2008 at 10:11 am | Permalink

    @kostia: Nope. You can ask John Nargo.

  17. Carl
    Posted January 15, 2009 at 4:00 pm | Permalink

    Sry but doesnt work for me.
    I get no errors/warning with the installation but
    if i try your ecample Code i get this:

    TypeError: wrong argument type Class (expected Module)
    Line 3

    ruby 1.8.6 (2008-03-03 patchlevel 114) [universal-darwin9.0]
    Mac OS X Leopard 10.5.6

    thx for answers

    • Michael
      Posted August 24, 2010 at 3:40 am | Permalink

      @Carl

      You shouldn’t need to do:
      include Spider

      Spider should already be filled with things :)


2 Trackbacks/Pingbacks

  1. […] Spider the Web with Ruby « Mike Burns, Coder (tags: Programming Ruby Spider Crawler Library) […]

  2. […] to have the crawler in Ruby. After some Googling around, I found out this very neat crawler, Spider written in Ruby by Mike Burns. You can find the source code here. The code is currently maintained […]

%d bloggers like this: