Spider bugfix

There were two issues with version 0.4.0 of Spider, both caught by Henri Cook. These are now fixed in 0.4.1:

  • As documented, you use IncludedInMemcached like this: require 'spider/included_in_memcached' .
  • Sometimes HTTP redirects assume a base URL; this is now handled.
Advertisements

3 Comments

  1. Posted October 27, 2008 at 10:06 pm | Permalink

    Hey Mike,
    Thanks for the spider. I am planning to use this for one of my projects. I was looking at the code and had two questions:
    (I am using the latest version 0.4.3)

    1. File: spider_instance.rb, Line: 209 Function: start!
    generate_next_urls(a_url, response).each do |a_next_url|
    @next_urls.push a_url => a_next_url
    end
    I was wondering why we have this, instead of :
    @next_urls.push a_url => generate_next_urls(a_url, response)
    The former would generate mutiple array elements ( for next urls) for one url. The latter would generate one, and looking at the code suggests that we want the latter ?

    2. File: spider_instance.rb, Line: 284 Function: generate_next_urls
    parsed_link = URI.parse(link)
    if parsed_link.fragment == ‘#’
    nil
    I did not understand what this conditional is checking here ?

    Thanks once again for the crawler. Its been very useful.

  2. Posted October 27, 2008 at 10:27 pm | Permalink

    Balpreet,

    I’m not the current maintainer of Spider, so I’ll take guesses.

    Your point about SpiderInstance#start! seems to be a bug in the code; you should email John Nagro about it.

    parsed_link.fragment == ‘#’ seems to be checking whether the URL is for an anchor or fragment URL. However, if it is it should recur on the non-fragment URL (I think). Again, you should email John about this.

    Glad to hear the crawler has been useful though! Happy hacking.

  3. Posted October 28, 2008 at 12:18 am | Permalink

    Thanks Mike for the response. Even I feel that these might be minor bugs. I have mailed John abt it.


%d bloggers like this: