Nick Sieger: Ruby and XML not-so-simple? http://blog.nicksieger.com/articles/2006/11/02/ruby-and-xml-not-so-simple en-us 40 Ruby and XML not-so-simple? <p><em>Update: <a href="http://www.koziarski.net/">Koz</a> already <a href="http://dev.rubyonrails.org/changeset/5414">fixed the issue in trunk</a>, and the changes are also <a href="http://dev.rubyonrails.org/changeset/5415">going into the 1.2 release</a> as well. Thanks!</em></p> <p>Man, I think I&#8217;ve been reading too much <a href="http://intertwingly.net/blog/2005/09/14/Last-Line-of-Defense/">Sam Ruby</a> lately (ok, that was a year ago, but not much has changed). You have to admit, though, that XML handling in Ruby is one of those things that just doesn&#8217;t feel quite right. REXML is pretty much the standard API for Ruby, yet it suffers from two showstoppers in my opinion:</p> <ul> <li><p>In Ruby 1.8.4 it still has the glaring hole Sam mentioned last year with well-formedness. (No exception raised below!)</p> <pre><code>irb(main):001:0&gt; require 'rexml/document' =&gt; true irb(main):002:0&gt; d = REXML::Document.new '&lt;div&gt;at&amp;t' =&gt; &lt;UNDEFINED&gt; ... &lt;/&gt; irb(main):003:0&gt; d.root =&gt; &lt;div&gt; ... &lt;/&gt; irb(main):004:0&gt; d.root.text =&gt; "at&amp;t" </code></pre></li> <li><p>The <code>REXML::Text#to_s</code> method violates the principle of least surprise. In just about every other XML parser written, when you ask a text node for its contents, it returns you the value with entities resolved. Not so <code>Text#to_s</code>. You have to call <code>Text#value</code> instead. Unfortunately, this would be difficult to reverse in future versions of REXML without breaking existing apps.</p> <pre><code>irb(main):001:0&gt; require 'rexml/document' =&gt; true irb(main):002:0&gt; t = REXML::Text.new('at&amp;t') =&gt; "at&amp;t" irb(main):003:0&gt; t.to_s =&gt; "at&amp;amp;t" irb(main):004:0&gt; t.value =&gt; "at&amp;t" </code></pre></li> </ul> <p>This second problem manifests itself in subtle ways. If you&#8217;re calling <code>Element#text</code> (which is probably the most common way), you&#8217;re fine, because it implicitly does <code>self.texts.first.value</code> under the hood. But if you want to make sure you&#8217;re grabbing all the text content, you might be inclined to write <code>element.texts.join('')</code> to concatenate them together. But this method bypasses the <code>value</code> method and instead uses <code>to_s</code>, leaving you with unresolved entities.</p> <p>It turns out this problem is exhibited in the version of <a href="http://xml-simple.rubyforge.org/">XmlSimple</a> now included with <a href="http://dev.rubyonrails.org/browser/trunk/activesupport/lib/active_support/vendor/xml_simple.rb?rev=4453&amp;format=txt">Edge Rails as of rev 4453</a>. So if you&#8217;re living on the edge using the newly minted ActiveResource fetching XML from remote resources like a champion, you just got benched as soon as you tried to fetch XML that had normalized entities inside.</p> <p><a href="http://rubyforge.org/frs/download.php/11388/xml-simple-1.0.9.gem">XmlSimple version 1.0.9</a> has a partial fix for this issue, but I submitted <a href="/files/test_xmlsimple.rb">another patch</a> to <a href="http://www.maik-schmidt.de/">Maik Schmidt</a> for review that he subsequently released as 1.0.10. I&#8217;ve attached the 1.0.10 version to ticket <a href="http://dev.rubyonrails.org/ticket/6532">6532</a> in hopes that it will be patched in Rails soon.</p> Thu, 02 Nov 2006 02:12:00 +0000 urn:uuid:d060e313-e46a-49dc-9093-fcb0f2ae7783 Nick Sieger http://blog.nicksieger.com/articles/2006/11/02/ruby-and-xml-not-so-simple ruby rails ruby xml http://blog.nicksieger.com/articles/trackback/142