Ruby and XML not-so-simple?

Posted by Nick Sieger Thu, 02 Nov 2006 02:12:00 GMT

Update: Koz already fixed the issue in trunk, and the changes are also going into the 1.2 release as well. Thanks!

Man, I think I’ve been reading too much Sam Ruby lately (ok, that was a year ago, but not much has changed). You have to admit, though, that XML handling in Ruby is one of those things that just doesn’t feel quite right. REXML is pretty much the standard API for Ruby, yet it suffers from two showstoppers in my opinion:

  • In Ruby 1.8.4 it still has the glaring hole Sam mentioned last year with well-formedness. (No exception raised below!)

    irb(main):001:0> require 'rexml/document'
    => true
    irb(main):002:0> d = '<div>at&t'
    => <UNDEFINED> ... </>
    irb(main):003:0> d.root
    => <div> ... </>
    irb(main):004:0> d.root.text
    => "at&t"
  • The REXML::Text#to_s method violates the principle of least surprise. In just about every other XML parser written, when you ask a text node for its contents, it returns you the value with entities resolved. Not so Text#to_s. You have to call Text#value instead. Unfortunately, this would be difficult to reverse in future versions of REXML without breaking existing apps.

    irb(main):001:0> require 'rexml/document'
    => true
    irb(main):002:0> t ='at&t')
    => "at&t"
    irb(main):003:0> t.to_s
    => "at&amp;t"
    irb(main):004:0> t.value
    => "at&t"

This second problem manifests itself in subtle ways. If you’re calling Element#text (which is probably the most common way), you’re fine, because it implicitly does self.texts.first.value under the hood. But if you want to make sure you’re grabbing all the text content, you might be inclined to write element.texts.join('') to concatenate them together. But this method bypasses the value method and instead uses to_s, leaving you with unresolved entities.

It turns out this problem is exhibited in the version of XmlSimple now included with Edge Rails as of rev 4453. So if you’re living on the edge using the newly minted ActiveResource fetching XML from remote resources like a champion, you just got benched as soon as you tried to fetch XML that had normalized entities inside.

XmlSimple version 1.0.9 has a partial fix for this issue, but I submitted another patch to Maik Schmidt for review that he subsequently released as 1.0.10. I’ve attached the 1.0.10 version to ticket 6532 in hopes that it will be patched in Rails soon.

Posted in ,  | Tags ,  | no comments | no trackbacks



Use the following link to trackback from your own site: