Uniquify an array of hashes in Ruby

Array#uniq is a handy Ruby method that produces an array of distinct elements. So, for example, [1,2,3,3,3].uniq produces [1,2,3].

However, this fails on an array of hashes. For example, [{}, {}].uniq actually produces [{},{}]. That’s not what I wanted, recently.

It turns out that #uniq uses the methods #hash and #eql? to determine uniqueness, and Hash#hash produces the object_id, and {}.eql?({}) produces false.

To solve this, I redefined #hash and #eql? for the hash instances before I put them in the array:


h = {}
class <<h
  def hash
    values.inject(0) { |acc,value| acc + value.hash }
  end

  def eql?(a_hash)
    self == a_hash
  end
end

It should be noted that calling #dup on this hash will then screw things up, but that’s common to all Ruby eigenclass fun.

However, you could just use a proper class instead of a hash, ’cause this isn’t Perl.

Spider bugfix

There were two issues with version 0.4.0 of Spider, both caught by Henri Cook. These are now fixed in 0.4.1:

  • As documented, you use IncludedInMemcached like this: require 'spider/included_in_memcached' .
  • Sometimes HTTP redirects assume a base URL; this is now handled.

Spider with memcached

The problem with Spider has been that it can use all your memory. The reason is that the Web is a graph, and to avoid cycles Spider stores each URL it encounters. Since the Web is a really, really, really gigantic graph, you eventually run out of memory.

Now you can use memcached to use not only all the memory on one computer, but all the memory on many computers!


require 'spider'
require 'spider/included_in_memcached'
SERVERS = ['10.0.10.2:11211','10.0.10.3:11211','10.0.10.4:11211']
Spider.start_at('http://mike-burns.com/') do |s|
  s.check_already_seen_with IncludedInMemcached.new(SERVERS)
end

Also new in this version is a tutorial on the main Spider Web site. Yeah!

Proxied Spider

Aha: if you need to proxy your Spider calls, look no further than the HTTP Configuration gem.

I didn’t write this, and have yet to use it, but I think it goes like this:


http_conf = Net::HTTP::Configuration.new(:proxy_host => 'localhost', :proxy_port => 8881)
http_conf.apply do
  Spider.start_at('http://example.com/')
end

So next up will be a tutorial with stuff like this and other cool stuff, plus a way to use memcached with Spider.

Spider: API changes, setup and teardown, HTTP headers

The newest version of Spider, 0.3.0, is hitting your gem tree Real Soon Now. This release features:

Set the headers to a HTTP request.
This can be used to set the cookies, user agent, and many other fine things.
setup and teardown handlers.
Seems like a good place to set the headers if the headers are conditional on the URL.
Say :every, not :any.
Makes more sense this way, I claim.
All the handlers take the same three arguments.
The URL, the response, and—new—the calling URL.

Next on my list: proxies, a better way to store whether an URL has been seen, then a tutorial.

Get it the usual way: gem install spider

Spider bug fix release

John Nagro immediately reported errors with the Spider Ruby gem, so I’ve fixed them in 0.2.1. You should upgrade, especially if you want support for:

  • URLs without any path component (e.g. http://example.com?s=1).
  • HTTP redirects.
  • HTTPS.

John also had some good ideas, so here is what is in the works:

  • The ability to construct a complete graph of every node found.
  • Defeat cycles using memcached instead of memory (closely related to the above bullet).
  • A Net::HTTP abstraction that makes it easier to e.g. use a proxy or replace HTTP.

Coder for hire

Ah, the thrill of working for a Web startup: one day you think you might get an obscene raise; the next you discover that the company is out of business.

And yet, I’m addicted to them.

If you or someone you know needs a programmer for their Web startup (or any other job, really), I would be delighted to fill that position.

Unsure? Check out why you should not hire Mike Burns, my résumé, and the rest of this blog.

Edit: This is in and around Boston, MA.

An updated way to spider the Web with Ruby

I’ve released version 0.2.0 of Spider. Everything has changed:

  • Use RSpec to ensure that it mostly works.
  • Use WEBrick to create a small test server for additional testing.
  • Completely re-do the API to prepare for future expansion.
  • Add the ability to apply each URL to a series of custom allowed?-like matchers.
  • BSD license.

The new API is kinda cool. From the README:


Spider.start_at('http://mike-burns.com/') do |s|
  # Limit the pages to just this domain.
  s.add_url_check do |a_url|
    a_url =~ %r{^http://mike-burns.com.*}
  end

  # Handle 404s.
  s.on 404 do |a_url, err_code|
    puts "URL not found: #{a_url}"
  end

  # Handle 2xx.
  s.on :success do |a_url, code, headers, body|
    puts "body: #{body}"
  end

  # Handle everything.
  s.on :any do |a_url, resp|
    puts "URL returned anything: #{a_url} with this code #{resp.code}"
  end
end

I just uploaded it to Rubyforge, so give it a minute then gem update spider.

Uberman sleep minder

I’ve been working on the uberman sleep schedule for the past week-and-a-half. Here the crontab I use to help remember to follow the schedule:


DISPLAY=:0.0
0 2,6,10,14,18,22 * * * /usr/local/bin/dbus-launch /usr/local/bin/notify-send -u low -t 5000 -i stock_timer 'Sleep in 30 minutes' 'Your sleep is coming up in a half hour'
25 2,6,10,14,18,22 * * * /usr/local/bin/dbus-launch /usr/local/bin/notify-send -u normal -t 5000 -i stock_timer 'Sleep in 5 minutes' 'You should head to bed now'
30 2,6,10,14,18,22 * * * /usr/local/bin/dbus-launch /usr/local/bin/notify-send -u critical -t 5000 -i stock_timer 'Sleep now' 'You should not be awake to read this'

This uses notify-send to display a DBus alert for about a five seconds. Three such alerts exist: one 30 minutes before I sleep, one 5 before, and one when I should be asleep.

Adjust for your own schedule as you see fit.

Yahoo! Local vs. Yahoo! Local’s API

Yahoo! Local is pretty awesome; it can tell you all the stores named “Shaw’s” near the zip code 02215, for example. Yahoo! Local’s API, however, gives different results depending on its mood— sometimes the Shaw’s in 02215 is returned second; sometimes fifth; sometimes not at all.

Try this URL instead: http://api.maps.yahoo.com/ajax/locsrch

It takes these parameters:

  • appid
  • ll
  • t
  • q
  • r
  • n

appid is your application ID. If you use the Yahoo! APIs, you have one.

ll is the latitude and longitude, with a | (pipe) between them. Encoded, the | is %7C ; remember this if you are using Ruby’s URI.parse, for example.

t is 1; 1 means local search.

q is the query. It may only contain these characters, so delete the rest: [a-zA-Z0-9_.-]

r is the radius in miles.

n is the number of items to return. There’s probably an upper bound, somewhere around 20.

This produces JavaScript which can be evaled. In Ruby, you can do this instead (let javascript be the result of calling this URL):


require 'json'

if javascript =~ /\\\\((.*),.*,.*\\\\);/
  json = $1
  results = JSON.parse(json)
  items = results['ITEMS']
end

Now items is an array of hashes with at least these keys as strings: CITY, LATITUDE, LONGITUDE, STATE, ADDRESS, TITLE. No ZIP, though.