Proxied Spider

Aha: if you need to proxy your Spider calls, look no further than the HTTP Configuration gem.

I didn’t write this, and have yet to use it, but I think it goes like this:

http_conf = Net::HTTP::Configuration.new(:proxy_host => 'localhost', :proxy_port => 8881)
http_conf.apply do
  Spider.start_at('http://example.com/')
end

So next up will be a tutorial with stuff like this and other cool stuff, plus a way to use memcached with Spider.

Advertisements

One Comment

  1. Tom White
    Posted January 13, 2009 at 5:01 pm | Permalink

    It seems the spider should handle ports. I believe this is a little different then a proxy. Here is my construct_complete_url(), it allows you to spider sites at ports other than 80.

    def construct_complete_url(base_url, additional_url, parsed_additional_url = nil) #:nodoc:
    parsed_additional_url ||= URI.parse(additional_url)
    case parsed_additional_url.scheme
    when nil
    u = base_url.is_a?(URI) ? base_url : URI.parse(base_url)
    if additional_url[0].chr == ‘/’
    “#{u.scheme}://#{u.host}:#{u.port}#{additional_url}”
    elsif u.path.nil? || u.path == ”
    “#{u.scheme}://#{u.host}:#{u.port}/#{additional_url}”
    elsif u.path[0].chr == ‘/’
    “#{u.scheme}://#{u.host}:#{u.port}#{u.path}/#{additional_url}”
    else
    “#{u.scheme}://#{u.host}:#{u.port}/#{u.path}/#{additional_url}”
    end
    else
    additional_url
    end
    end


%d bloggers like this: