Quick and Dirty URL Validation
by Trevor Turk
I’ve come across a few different ways to validate URLs in my day, but they all seem a bit more complicated than necessary. Perhaps I’ll see the wisdom of these techniques soon, but for now it seems like there’s an easy solution to the problem:
class Link < ActiveRecord::Base
attr_accessible :url
validate :validate_url
private
def validate_url
errors.add(:url) unless %w(200 301 302).include?(Link.status_code(self.url))
end
def self.status_code(url)
regexp = url.match(/https?://([^/]+)(.*)/)
path = regexp[2].blank? ? '/' : regexp[2]
Net::HTTP.start(regexp[1]) {|http| http.head(path).code}
rescue
nil
end
end
Et VoilĂ .
Advertisement
Abusable? Sure HEAD isn't *supposed* to do anything, but I'll bet money there are sites out there that have URLs you shouldn't be blindly hitting. I guess a simple timeout wouldn't be the end of the world, but could be moderately annoying. Also, I could make your servers show connections to TERRORISTS or kiddie porn servers, if I knew where to connect to them (which I don't).
All in all, I'm not sure you want to leave your server connecting up to another one to a blind check.
Err… I'm not sure how exactly this could be a security problem. Perhaps an example would help?
My last attempt got swallowed by your spam filter, I think.
Not a problem for your server getting hacked, but more in this line
http:/alqueda.com
http:/www.kiddiepr0n.com
http:/www.wellsfargo (thanks for the loan of your fat pipes for my DDOS)
http:/www.vulnerableserver.com/troublesome_url (at least they got your IP as the one that brought them down).
Basically, while I think HEADs will be mostly harmless, this still does leave you as an anonymous proxy in at least one way, for people who may know what to actually exploit (unlike me).
Umm, I didn't mean for those to actually be turned into links, sorry.
Well that was completely bizarre. Try adding more foil Tim. You need more foil.
Yeah, I think the HEAD request is harmless. Maybe I'm wrong, though.
This technique I'm talking about isn't a method for preventing spam or anything – it's just a quick way to validate URLs are accessible (e.g. not http:/sdf38830.com or something nonsensical like that).
require 'uri'
URI.parse(url).host
Ah I forgot to add the merely annoying one
http:/serverthattimesout.com
Trevor, you may be entirely right, I do not know. My only point is counting on everyone else on the internet on honor "this MUST NOT have side effects" or whatever the language is, causes things like the Google Accelerator deleting lots of people info fiasco.
This does allow for a really simple denial-of-service attack on any site using it. Basically, you write a very simple server (you can do it in Sinatra if you like) that sleeps for a really long time on a head request to a specific URL. Then you just submit the URL multiple times to the form validating this URL. Repeat until you preoccupy ever Mongrel or cause Passenger to spawn enough Apache processes to thrash. Of course, you could add a timeout on the check, but I could also use Mechanize…
I like tenderlove's refinement of sending back a very large response (is there somewhere you can cram base64 encoded video into a HTTP response header, maybe, to additionally add in the illegal/copyrighted content problems), when you finely do respond, to help chew up memory, too.
Hmm… yeah, I guess you could use Timeout to help, but it does seem like making requests of arbitrary websites may have unindented consequences.
I'm not sure what you could do about receiving large header responses. I suppose a timeout could help there, too.
Here's where I would start looking in terms of doing timeout stuff:
http://www.ruby-doc.org/core/classes/Timeout.html http://www.slashdotdash.net/2008/02/15/ruby-tidbi…
Although I came across these not too long ago:
http://blog.segment7.net/articles/2006/04/11/care… http://blog.headius.com/2008/02/rubys-threadraise…
So, I'm not sure if timeout is safe to use or not
As an aside, found this easy way to do basic URL format validation:
http://www.ruby-doc.org/core/classes/URI.html#M00…
require 'uri'
validates_format_of :uri, :with => URI.regexp
Very nice – no need for a custom regexp
This is the same basic concept of the older http_url_validation_improved plugin. It's not on github, so perhaps that is why you missed it.
We use it in Kete (http://kete.net.nz) to achieve what you are after. It does look like it could use that URI.regexp refactoring added to it.
One nice thing is that it will check for allowed content types with configuration. It also gives more finegrained feedback when validation fails.
Sorry, meant to add link to the source. It can be found here:
https://modzer0.cs.uaf.edu/repos/hank/code/http_u…
Thanks, Walter. That looks like a good plugin to consider if you're worried about overkill but need more than quick and dirty.
Btw, regexp is wrong. You should escape slashes:
regexp = url.match(/https?://([^/]+)(.*)/)
Ah yes. Fixed, thanks!
This is my favorite URL format validator right now:
http://github.com/henrik/validates_url_format_of/…
URI.regexp didn't catch a lot of invalid stuff I tried to trow at it.
There is still one slash left unescaped.
Hmm… yes… I think I got it now? I pasted it from running code, so I hope so
If I wanted to break this, I think I'd write a custom "webserver" that responded to a HEAD request with an unending stream of headers. Server never closes the connection, just keeps sending you headers until the client finally gives up, assuming it ever does.