Aha: if you need to proxy your Spider calls, look no further than the HTTP Configuration gem.
I didn’t write this, and have yet to use it, but I think it goes like this:
http_conf = Net::HTTP::Configuration.new(:proxy_host => 'localhost', :proxy_port => 8881)
http_conf.apply do
Spider.start_at('http://example.com/')
end
So next up will be a tutorial with stuff like this and other cool stuff, plus a way to use memcached with Spider.
One Comment
It seems the spider should handle ports. I believe this is a little different then a proxy. Here is my construct_complete_url(), it allows you to spider sites at ports other than 80.
def construct_complete_url(base_url, additional_url, parsed_additional_url = nil) #:nodoc:
parsed_additional_url ||= URI.parse(additional_url)
case parsed_additional_url.scheme
when nil
u = base_url.is_a?(URI) ? base_url : URI.parse(base_url)
if additional_url[0].chr == ‘/’
“#{u.scheme}://#{u.host}:#{u.port}#{additional_url}”
elsif u.path.nil? || u.path == ”
“#{u.scheme}://#{u.host}:#{u.port}/#{additional_url}”
elsif u.path[0].chr == ‘/’
“#{u.scheme}://#{u.host}:#{u.port}#{u.path}/#{additional_url}”
else
“#{u.scheme}://#{u.host}:#{u.port}/#{u.path}/#{additional_url}”
end
else
additional_url
end
end