The problem with Spider has been that it can use all your memory. The reason is that the Web is a graph, and to avoid cycles Spider stores each URL it encounters. Since the Web is a really, really, really gigantic graph, you eventually run out of memory.
Now you can use memcached to use not only all the memory on one computer, but all the memory on many computers!
require 'spider' require 'spider/included_in_memcached' SERVERS = ['10.0.10.2:11211','10.0.10.3:11211','10.0.10.4:11211'] Spider.start_at('http://mike-burns.com/') do |s| s.check_already_seen_with IncludedInMemcached.new(SERVERS) end
Also new in this version is a tutorial on the main Spider Web site. Yeah!