Quick and Dirty URL Validation

I’ve come across a few different ways to validate URLs in my day, but they all seem a bit more complicated than necessary. Perhaps I’ll see the wisdom of these techniques soon, but for now it seems like there’s an easy solution to the problem:

class Link < ActiveRecord::Base
  
  attr_accessible :url  
  validate :validate_url
  
private

  def validate_url
    errors.add(:url) unless %w(200 301 302).include?(Link.status_code(self.url))
  end
  
  def self.status_code(url)
    regexp = url.match(/https?://([^/]+)(.*)/)
    path = regexp[2].blank? ? '/' : regexp[2]
    Net::HTTP.start(regexp[1]) {|http| http.head(path).code}
  rescue 
    nil
  end

end

Et Voilà.

Published by

Trevor Turk

A chess-playing machine of the late 18th century, promoted as an automaton but later proved a hoax.

22 thoughts on “Quick and Dirty URL Validation”

  1. Abusable? Sure HEAD isn't *supposed* to do anything, but I'll bet money there are sites out there that have URLs you shouldn't be blindly hitting. I guess a simple timeout wouldn't be the end of the world, but could be moderately annoying. Also, I could make your servers show connections to TERRORISTS or kiddie porn servers, if I knew where to connect to them (which I don't).

    All in all, I'm not sure you want to leave your server connecting up to another one to a blind check.

  2. My last attempt got swallowed by your spam filter, I think.

    Not a problem for your server getting hacked, but more in this line

    http:/alqueda.com
    http:/www.kiddiepr0n.com

    http:/www.wellsfargo (thanks for the loan of your fat pipes for my DDOS)

    http:/www.vulnerableserver.com/troublesome_url (at least they got your IP as the one that brought them down).

    Basically, while I think HEADs will be mostly harmless, this still does leave you as an anonymous proxy in at least one way, for people who may know what to actually exploit (unlike me).

  3. Yeah, I think the HEAD request is harmless. Maybe I'm wrong, though.

    This technique I'm talking about isn't a method for preventing spam or anything – it's just a quick way to validate URLs are accessible (e.g. not http:/sdf38830.com or something nonsensical like that).

  4. Trevor, you may be entirely right, I do not know. My only point is counting on everyone else on the internet on honor "this MUST NOT have side effects" or whatever the language is, causes things like the Google Accelerator deleting lots of people info fiasco.

  5. This does allow for a really simple denial-of-service attack on any site using it. Basically, you write a very simple server (you can do it in Sinatra if you like) that sleeps for a really long time on a head request to a specific URL. Then you just submit the URL multiple times to the form validating this URL. Repeat until you preoccupy ever Mongrel or cause Passenger to spawn enough Apache processes to thrash. Of course, you could add a timeout on the check, but I could also use Mechanize…😉

  6. I like tenderlove's refinement of sending back a very large response (is there somewhere you can cram base64 encoded video into a HTTP response header, maybe, to additionally add in the illegal/copyrighted content problems), when you finely do respond, to help chew up memory, too.

  7. Hmm… yeah, I guess you could use Timeout to help, but it does seem like making requests of arbitrary websites may have unindented consequences.

    I'm not sure what you could do about receiving large header responses. I suppose a timeout could help there, too.

    Here's where I would start looking in terms of doing timeout stuff:

    http://www.ruby-doc.org/core/classes/Timeout.html http://www.slashdotdash.net/2008/02/15/ruby-tidbi

    Although I came across these not too long ago:

    http://blog.segment7.net/articles/2006/04/11/carehttp://blog.headius.com/2008/02/rubys-threadraise

    So, I'm not sure if timeout is safe to use or not🙂

  8. This is the same basic concept of the older http_url_validation_improved plugin. It's not on github, so perhaps that is why you missed it.

    We use it in Kete (http://kete.net.nz) to achieve what you are after. It does look like it could use that URI.regexp refactoring added to it.

    One nice thing is that it will check for allowed content types with configuration. It also gives more finegrained feedback when validation fails.

  9. If I wanted to break this, I think I'd write a custom "webserver" that responded to a HEAD request with an unending stream of headers. Server never closes the connection, just keeps sending you headers until the client finally gives up, assuming it ever does.

Comments are closed.