<?xml version="1.0" encoding="UTF-8"?>
<feed xml:lang="en-US" xmlns="http://www.w3.org/2005/Atom">
  <title>Nick Sieger: Ruby and XML not-so-simple?</title>
  <id>tag:blog.nicksieger.com,2005:Typo</id>
  <generator uri="http://www.typosphere.org" version="4.0">Typo</generator>
  <link rel="self" type="application/atom+xml" href="http://blog.nicksieger.com/xml/atom10/article/142/feed.xml"/>
  <link rel="alternate" type="text/html" href="http://blog.nicksieger.com/articles/2006/11/02/ruby-and-xml-not-so-simple"/>
  <updated>2007-08-31T16:41:36+00:00</updated>
  <entry>
    <author>
      <name>Nick Sieger</name>
    </author>
    <id>urn:uuid:d060e313-e46a-49dc-9093-fcb0f2ae7783</id>
    <published>2006-11-02T02:12:00+00:00</published>
    <updated>2007-08-31T16:41:36+00:00</updated>
    <title>Ruby and XML not-so-simple?</title>
    <link rel="alternate" type="text/html" href="http://blog.nicksieger.com/articles/2006/11/02/ruby-and-xml-not-so-simple"/>
    <category term="ruby" label="ruby" scheme="http://blog.nicksieger.com/articles/category/ruby"/>
    <category term="rails" label="rails" scheme="http://blog.nicksieger.com/articles/category/rails"/>
    <category term="ruby" scheme="http://blog.nicksieger.com/articles/tag/ruby"/>
    <category term="xml" scheme="http://blog.nicksieger.com/articles/tag/xml"/>
    <content type="html">&lt;p&gt;&lt;em&gt;Update: &lt;a href="http://www.koziarski.net/"&gt;Koz&lt;/a&gt; already &lt;a href="http://dev.rubyonrails.org/changeset/5414"&gt;fixed the issue in trunk&lt;/a&gt;, and the changes are also &lt;a href="http://dev.rubyonrails.org/changeset/5415"&gt;going into the 1.2 release&lt;/a&gt; as well.  Thanks!&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Man, I think I&amp;#8217;ve been reading too much &lt;a href="http://intertwingly.net/blog/2005/09/14/Last-Line-of-Defense/"&gt;Sam Ruby&lt;/a&gt; lately (ok, that was a year ago, but not much has changed). You have to admit, though, that XML handling in Ruby is one of those things that just doesn&amp;#8217;t feel quite right.  REXML is pretty much the standard API for Ruby, yet it suffers from two showstoppers in my opinion:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;In Ruby 1.8.4 it still has the glaring hole Sam mentioned last year with well-formedness.  (No exception raised below!)&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;irb(main):001:0&amp;gt; require 'rexml/document'
=&amp;gt; true
irb(main):002:0&amp;gt; d = REXML::Document.new '&amp;lt;div&amp;gt;at&amp;amp;t'
=&amp;gt; &amp;lt;UNDEFINED&amp;gt; ... &amp;lt;/&amp;gt;
irb(main):003:0&amp;gt; d.root
=&amp;gt; &amp;lt;div&amp;gt; ... &amp;lt;/&amp;gt;
irb(main):004:0&amp;gt; d.root.text
=&amp;gt; "at&amp;amp;t"
&lt;/code&gt;&lt;/pre&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The &lt;code&gt;REXML::Text#to_s&lt;/code&gt; method violates the principle of least surprise. In just about every other XML parser written, when you ask a
text node for its contents, it returns you the value with entities resolved. Not so &lt;code&gt;Text#to_s&lt;/code&gt;. You have to call &lt;code&gt;Text#value&lt;/code&gt;
instead.  Unfortunately, this would be difficult to reverse in future versions of REXML without breaking existing apps.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;irb(main):001:0&amp;gt; require 'rexml/document'
=&amp;gt; true
irb(main):002:0&amp;gt; t = REXML::Text.new('at&amp;amp;t')
=&amp;gt; "at&amp;amp;t"
irb(main):003:0&amp;gt; t.to_s
=&amp;gt; "at&amp;amp;amp;t"
irb(main):004:0&amp;gt; t.value
=&amp;gt; "at&amp;amp;t"
&lt;/code&gt;&lt;/pre&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This second problem manifests itself in subtle ways.  If you&amp;#8217;re calling &lt;code&gt;Element#text&lt;/code&gt; (which is probably the most common way), you&amp;#8217;re fine, because it implicitly does &lt;code&gt;self.texts.first.value&lt;/code&gt; under the hood.  But if you want to make sure you&amp;#8217;re grabbing all the text content, you might be inclined to write &lt;code&gt;element.texts.join('')&lt;/code&gt; to concatenate them together.  But this method bypasses the &lt;code&gt;value&lt;/code&gt; method and instead uses &lt;code&gt;to_s&lt;/code&gt;, leaving you with unresolved entities.&lt;/p&gt;

&lt;p&gt;It turns out this problem is exhibited in the version of &lt;a href="http://xml-simple.rubyforge.org/"&gt;XmlSimple&lt;/a&gt; now included with &lt;a href="http://dev.rubyonrails.org/browser/trunk/activesupport/lib/active_support/vendor/xml_simple.rb?rev=4453&amp;amp;format=txt"&gt;Edge Rails as of rev 4453&lt;/a&gt;. So if
you&amp;#8217;re living on the edge using the newly minted ActiveResource fetching XML from remote resources like a champion, you just got benched
as soon as you tried to fetch XML that had normalized entities inside.&lt;/p&gt;

&lt;p&gt;&lt;a href="http://rubyforge.org/frs/download.php/11388/xml-simple-1.0.9.gem"&gt;XmlSimple version 1.0.9&lt;/a&gt; has a partial fix for this issue, but I submitted &lt;a href="/files/test_xmlsimple.rb"&gt;another patch&lt;/a&gt; to &lt;a href="http://www.maik-schmidt.de/"&gt;Maik Schmidt&lt;/a&gt; for review that he subsequently released as 1.0.10.  I&amp;#8217;ve attached the 1.0.10 version to ticket &lt;a href="http://dev.rubyonrails.org/ticket/6532"&gt;6532&lt;/a&gt; in hopes that it will be patched in Rails soon.&lt;/p&gt;</content>
  </entry>
</feed>
