Nick Sieger: Ruby and XML not-so-simple?tag:blog.nicksieger.com,2005:TypoTypo2007-08-31T16:41:36+00:00Nick Siegerurn:uuid:d060e313-e46a-49dc-9093-fcb0f2ae77832006-11-02T02:12:00+00:002007-08-31T16:41:36+00:00Ruby and XML not-so-simple?<p><em>Update: <a href="http://www.koziarski.net/">Koz</a> already <a href="http://dev.rubyonrails.org/changeset/5414">fixed the issue in trunk</a>, and the changes are also <a href="http://dev.rubyonrails.org/changeset/5415">going into the 1.2 release</a> as well. Thanks!</em></p>
<p>Man, I think I’ve been reading too much <a href="http://intertwingly.net/blog/2005/09/14/Last-Line-of-Defense/">Sam Ruby</a> lately (ok, that was a year ago, but not much has changed). You have to admit, though, that XML handling in Ruby is one of those things that just doesn’t feel quite right. REXML is pretty much the standard API for Ruby, yet it suffers from two showstoppers in my opinion:</p>
<ul>
<li><p>In Ruby 1.8.4 it still has the glaring hole Sam mentioned last year with well-formedness. (No exception raised below!)</p>
<pre><code>irb(main):001:0> require 'rexml/document'
=> true
irb(main):002:0> d = REXML::Document.new '<div>at&t'
=> <UNDEFINED> ... </>
irb(main):003:0> d.root
=> <div> ... </>
irb(main):004:0> d.root.text
=> "at&t"
</code></pre></li>
<li><p>The <code>REXML::Text#to_s</code> method violates the principle of least surprise. In just about every other XML parser written, when you ask a
text node for its contents, it returns you the value with entities resolved. Not so <code>Text#to_s</code>. You have to call <code>Text#value</code>
instead. Unfortunately, this would be difficult to reverse in future versions of REXML without breaking existing apps.</p>
<pre><code>irb(main):001:0> require 'rexml/document'
=> true
irb(main):002:0> t = REXML::Text.new('at&t')
=> "at&t"
irb(main):003:0> t.to_s
=> "at&amp;t"
irb(main):004:0> t.value
=> "at&t"
</code></pre></li>
</ul>
<p>This second problem manifests itself in subtle ways. If you’re calling <code>Element#text</code> (which is probably the most common way), you’re fine, because it implicitly does <code>self.texts.first.value</code> under the hood. But if you want to make sure you’re grabbing all the text content, you might be inclined to write <code>element.texts.join('')</code> to concatenate them together. But this method bypasses the <code>value</code> method and instead uses <code>to_s</code>, leaving you with unresolved entities.</p>
<p>It turns out this problem is exhibited in the version of <a href="http://xml-simple.rubyforge.org/">XmlSimple</a> now included with <a href="http://dev.rubyonrails.org/browser/trunk/activesupport/lib/active_support/vendor/xml_simple.rb?rev=4453&format=txt">Edge Rails as of rev 4453</a>. So if
you’re living on the edge using the newly minted ActiveResource fetching XML from remote resources like a champion, you just got benched
as soon as you tried to fetch XML that had normalized entities inside.</p>
<p><a href="http://rubyforge.org/frs/download.php/11388/xml-simple-1.0.9.gem">XmlSimple version 1.0.9</a> has a partial fix for this issue, but I submitted <a href="/files/test_xmlsimple.rb">another patch</a> to <a href="http://www.maik-schmidt.de/">Maik Schmidt</a> for review that he subsequently released as 1.0.10. I’ve attached the 1.0.10 version to ticket <a href="http://dev.rubyonrails.org/ticket/6532">6532</a> in hopes that it will be patched in Rails soon.</p>