There were two issues with version 0.4.0 of Spider, both caught by Henri Cook. These are now fixed in 0.4.1:
- As documented, you use
IncludedInMemcachedlike this:require 'spider/included_in_memcached'. - Sometimes HTTP redirects assume a base URL; this is now handled.
3 Comments
Hey Mike,
Thanks for the spider. I am planning to use this for one of my projects. I was looking at the code and had two questions:
(I am using the latest version 0.4.3)
1. File: spider_instance.rb, Line: 209 Function: start!
generate_next_urls(a_url, response).each do |a_next_url|
@next_urls.push a_url => a_next_url
end
I was wondering why we have this, instead of :
@next_urls.push a_url => generate_next_urls(a_url, response)
The former would generate mutiple array elements ( for next urls) for one url. The latter would generate one, and looking at the code suggests that we want the latter ?
2. File: spider_instance.rb, Line: 284 Function: generate_next_urls
parsed_link = URI.parse(link)
if parsed_link.fragment == ‘#’
nil
I did not understand what this conditional is checking here ?
Thanks once again for the crawler. Its been very useful.
Balpreet,
I’m not the current maintainer of Spider, so I’ll take guesses.
Your point about SpiderInstance#start! seems to be a bug in the code; you should email John Nagro about it.
parsed_link.fragment == ‘#’ seems to be checking whether the URL is for an anchor or fragment URL. However, if it is it should recur on the non-fragment URL (I think). Again, you should email John about this.
Glad to hear the crawler has been useful though! Happy hacking.
Thanks Mike for the response. Even I feel that these might be minor bugs. I have mailed John abt it.