Monthly Archives: November 2007

Spider bugfix

There were two issues with version 0.4.0 of Spider, both caught by Henri Cook. These are now fixed in 0.4.1:

As documented, you use IncludedInMemcached like this: require ’spider/included_in_memcached’ .
Sometimes HTTP redirects assume a base URL; this is now handled.

Spider with memcached

The problem with Spider has been that it can use all your memory. The reason is that the Web is a graph, and to avoid cycles Spider stores each URL it encounters. Since the Web is a really, really, really gigantic graph, you eventually run out of memory.
Now you can use memcached to use not [...]

Proxied Spider

Aha: if you need to proxy your Spider calls, look no further than the HTTP Configuration gem.
I didn’t write this, and have yet to use it, but I think it goes like this:

http_conf = Net::HTTP::Configuration.new(:proxy_host => ‘localhost’, :proxy_port => 8881)
http_conf.apply do
Spider.start_at(‘http://example.com/’)
end

So next up will be a tutorial with stuff like this and other [...]

Spider: API changes, setup and teardown, HTTP headers

The newest version of Spider, 0.3.0, is hitting your gem tree Real Soon Now. This release features:

Set the headers to a HTTP request.
This can be used to set the cookies, user agent, and many other fine things.
setup and teardown handlers.
Seems like a good place to set the headers if the headers are conditional on the [...]