<?xml version="1.0" encoding="UTF-8"?>
<feed xml:lang="en-US" xmlns="http://www.w3.org/2005/Atom">
  <title>Nick Sieger: Tag xml</title>
  <id>tag:blog.nicksieger.com,2005:Typo</id>
  <generator uri="http://www.typosphere.org" version="4.0">Typo</generator>
  <link rel="self" type="application/atom+xml" href="http://blog.nicksieger.com/xml/atom10/tag/xml/feed.xml"/>
  <link rel="alternate" type="text/html" href="http://blog.nicksieger.com/articles/tag/xml?tag=xml"/>
  <updated>2007-08-31T16:41:36+00:00</updated>
  <entry>
    <author>
      <name>Nick Sieger</name>
    </author>
    <id>urn:uuid:d060e313-e46a-49dc-9093-fcb0f2ae7783</id>
    <published>2006-11-02T02:12:00+00:00</published>
    <updated>2007-08-31T16:41:36+00:00</updated>
    <title>Ruby and XML not-so-simple?</title>
    <link rel="alternate" type="text/html" href="http://blog.nicksieger.com/articles/2006/11/02/ruby-and-xml-not-so-simple"/>
    <category term="ruby" label="ruby" scheme="http://blog.nicksieger.com/articles/category/ruby"/>
    <category term="rails" label="rails" scheme="http://blog.nicksieger.com/articles/category/rails"/>
    <category term="ruby" scheme="http://blog.nicksieger.com/articles/tag/ruby"/>
    <category term="xml" scheme="http://blog.nicksieger.com/articles/tag/xml"/>
    <content type="html">&lt;p&gt;&lt;em&gt;Update: &lt;a href="http://www.koziarski.net/"&gt;Koz&lt;/a&gt; already &lt;a href="http://dev.rubyonrails.org/changeset/5414"&gt;fixed the issue in trunk&lt;/a&gt;, and the changes are also &lt;a href="http://dev.rubyonrails.org/changeset/5415"&gt;going into the 1.2 release&lt;/a&gt; as well.  Thanks!&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Man, I think I&amp;#8217;ve been reading too much &lt;a href="http://intertwingly.net/blog/2005/09/14/Last-Line-of-Defense/"&gt;Sam Ruby&lt;/a&gt; lately (ok, that was a year ago, but not much has changed). You have to admit, though, that XML handling in Ruby is one of those things that just doesn&amp;#8217;t feel quite right.  REXML is pretty much the standard API for Ruby, yet it suffers from two showstoppers in my opinion:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;In Ruby 1.8.4 it still has the glaring hole Sam mentioned last year with well-formedness.  (No exception raised below!)&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;irb(main):001:0&amp;gt; require 'rexml/document'
=&amp;gt; true
irb(main):002:0&amp;gt; d = REXML::Document.new '&amp;lt;div&amp;gt;at&amp;amp;t'
=&amp;gt; &amp;lt;UNDEFINED&amp;gt; ... &amp;lt;/&amp;gt;
irb(main):003:0&amp;gt; d.root
=&amp;gt; &amp;lt;div&amp;gt; ... &amp;lt;/&amp;gt;
irb(main):004:0&amp;gt; d.root.text
=&amp;gt; "at&amp;amp;t"
&lt;/code&gt;&lt;/pre&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The &lt;code&gt;REXML::Text#to_s&lt;/code&gt; method violates the principle of least surprise. In just about every other XML parser written, when you ask a
text node for its contents, it returns you the value with entities resolved. Not so &lt;code&gt;Text#to_s&lt;/code&gt;. You have to call &lt;code&gt;Text#value&lt;/code&gt;
instead.  Unfortunately, this would be difficult to reverse in future versions of REXML without breaking existing apps.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;irb(main):001:0&amp;gt; require 'rexml/document'
=&amp;gt; true
irb(main):002:0&amp;gt; t = REXML::Text.new('at&amp;amp;t')
=&amp;gt; "at&amp;amp;t"
irb(main):003:0&amp;gt; t.to_s
=&amp;gt; "at&amp;amp;amp;t"
irb(main):004:0&amp;gt; t.value
=&amp;gt; "at&amp;amp;t"
&lt;/code&gt;&lt;/pre&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This second problem manifests itself in subtle ways.  If you&amp;#8217;re calling &lt;code&gt;Element#text&lt;/code&gt; (which is probably the most common way), you&amp;#8217;re fine, because it implicitly does &lt;code&gt;self.texts.first.value&lt;/code&gt; under the hood.  But if you want to make sure you&amp;#8217;re grabbing all the text content, you might be inclined to write &lt;code&gt;element.texts.join('')&lt;/code&gt; to concatenate them together.  But this method bypasses the &lt;code&gt;value&lt;/code&gt; method and instead uses &lt;code&gt;to_s&lt;/code&gt;, leaving you with unresolved entities.&lt;/p&gt;

&lt;p&gt;It turns out this problem is exhibited in the version of &lt;a href="http://xml-simple.rubyforge.org/"&gt;XmlSimple&lt;/a&gt; now included with &lt;a href="http://dev.rubyonrails.org/browser/trunk/activesupport/lib/active_support/vendor/xml_simple.rb?rev=4453&amp;amp;format=txt"&gt;Edge Rails as of rev 4453&lt;/a&gt;. So if
you&amp;#8217;re living on the edge using the newly minted ActiveResource fetching XML from remote resources like a champion, you just got benched
as soon as you tried to fetch XML that had normalized entities inside.&lt;/p&gt;

&lt;p&gt;&lt;a href="http://rubyforge.org/frs/download.php/11388/xml-simple-1.0.9.gem"&gt;XmlSimple version 1.0.9&lt;/a&gt; has a partial fix for this issue, but I submitted &lt;a href="/files/test_xmlsimple.rb"&gt;another patch&lt;/a&gt; to &lt;a href="http://www.maik-schmidt.de/"&gt;Maik Schmidt&lt;/a&gt; for review that he subsequently released as 1.0.10.  I&amp;#8217;ve attached the 1.0.10 version to ticket &lt;a href="http://dev.rubyonrails.org/ticket/6532"&gt;6532&lt;/a&gt; in hopes that it will be patched in Rails soon.&lt;/p&gt;</content>
  </entry>
  <entry>
    <author>
      <name>Nick Sieger</name>
    </author>
    <id>urn:uuid:0d137df7-5bca-451d-97f6-34af4d5a36b7</id>
    <published>2008-01-17T04:07:00+00:00</published>
    <updated>2008-01-17T17:08:14+00:00</updated>
    <title>REXML a Drag...Again</title>
    <link rel="alternate" type="text/html" href="http://blog.nicksieger.com/articles/2008/01/17/rexml-a-drag-again"/>
    <category term="ruby" scheme="http://blog.nicksieger.com/articles/tag/ruby"/>
    <category term="xml" scheme="http://blog.nicksieger.com/articles/tag/xml"/>
    <category term="jruby" scheme="http://blog.nicksieger.com/articles/tag/jruby"/>
    <content type="html">&lt;p&gt;&lt;a href="http://blog.nicksieger.com/articles/2006/11/02/ruby-and-xml-not-so-simple"&gt;We&amp;#8217;ve been here before.&lt;/a&gt; So here&amp;#8217;s the scenario: You&amp;#8217;re feeding medium-to-large chunks of XML out of one Rails app, to be consumed by another via &lt;a href="http://ryandaigle.com/articles/2006/06/30/whats-new-in-edge-rails-activeresource-is-here" title="Ryan's Scraps: What's New in Edge Rails: The Ins and Outs of ActiveResource"&gt;ActiveResource&lt;/a&gt;. Maybe those chunks have embedded HTML, or maybe they&amp;#8217;re an Atom feed containing several pieces of HTML with all the entities escaped. Maybe &lt;a href="http://blog.nicksieger.com/files/page.xml"&gt;they contain entire Wikipedia pages in them&lt;/a&gt;. Lots of entities that need expansion when the file is parsed.&lt;/p&gt;

&lt;p&gt;So what does ActiveResource do with this? &lt;code&gt;Hash.from_xml&lt;/code&gt;. Which uses &lt;a href="http://xml-simple.rubyforge.org/" title="XmlSimple - XML made easy"&gt;xml-simple&lt;/a&gt;. Which constructs a &lt;code&gt;REXML::Document&lt;/code&gt;, and proceeds to navigate the entire DOM, scraping the text nodes out of it so they can be stuffed in a hash to be handed back to ActiveResource. And how does REXML expand all the entities it runs across? With this little lovely:&lt;/p&gt;

&lt;div class="typocode"&gt;&lt;pre&gt;&lt;code class="typocode_ruby "&gt;&lt;span class="comment"&gt;# Unescapes all possible entities&lt;/span&gt;
&lt;span class="keyword"&gt;def &lt;/span&gt;&lt;span class="method"&gt;Text::unnormalize&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt; &lt;span class="ident"&gt;string&lt;/span&gt;&lt;span class="punct"&gt;,&lt;/span&gt; &lt;span class="ident"&gt;doctype&lt;/span&gt;&lt;span class="punct"&gt;=&lt;/span&gt;&lt;span class="constant"&gt;nil&lt;/span&gt;&lt;span class="punct"&gt;,&lt;/span&gt; &lt;span class="ident"&gt;filter&lt;/span&gt;&lt;span class="punct"&gt;=&lt;/span&gt;&lt;span class="constant"&gt;nil&lt;/span&gt;&lt;span class="punct"&gt;,&lt;/span&gt; &lt;span class="ident"&gt;illegal&lt;/span&gt;&lt;span class="punct"&gt;=&lt;/span&gt;&lt;span class="constant"&gt;nil&lt;/span&gt; &lt;span class="punct"&gt;)&lt;/span&gt;
  &lt;span class="ident"&gt;rv&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="ident"&gt;string&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;clone&lt;/span&gt;
  &lt;span class="ident"&gt;rv&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;gsub!&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt; &lt;span class="punct"&gt;/&lt;/span&gt;&lt;span class="regex"&gt;&lt;span class="escape"&gt;\r\n&lt;/span&gt;?&lt;/span&gt;&lt;span class="punct"&gt;/,&lt;/span&gt; &lt;span class="punct"&gt;&amp;quot;&lt;/span&gt;&lt;span class="string"&gt;&lt;span class="escape"&gt;\n&lt;/span&gt;&lt;/span&gt;&lt;span class="punct"&gt;&amp;quot;&lt;/span&gt; &lt;span class="punct"&gt;)&lt;/span&gt;
  &lt;span class="ident"&gt;matches&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="ident"&gt;rv&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;scan&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt; &lt;span class="constant"&gt;REFERENCE&lt;/span&gt; &lt;span class="punct"&gt;)&lt;/span&gt;
  &lt;span class="keyword"&gt;return&lt;/span&gt; &lt;span class="ident"&gt;rv&lt;/span&gt; &lt;span class="keyword"&gt;if&lt;/span&gt; &lt;span class="ident"&gt;matches&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;size&lt;/span&gt; &lt;span class="punct"&gt;==&lt;/span&gt; &lt;span class="number"&gt;0&lt;/span&gt;
  &lt;span class="ident"&gt;rv&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;gsub!&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt; &lt;span class="constant"&gt;NUMERICENTITY&lt;/span&gt; &lt;span class="punct"&gt;)&lt;/span&gt; &lt;span class="punct"&gt;{|&lt;/span&gt;&lt;span class="ident"&gt;m&lt;/span&gt;&lt;span class="punct"&gt;|&lt;/span&gt;
    &lt;span class="ident"&gt;m&lt;/span&gt;&lt;span class="punct"&gt;=&lt;/span&gt;&lt;span class="global"&gt;$1&lt;/span&gt;
    &lt;span class="ident"&gt;m&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="punct"&gt;&amp;quot;&lt;/span&gt;&lt;span class="string"&gt;0&lt;span class="expr"&gt;#{m}&lt;/span&gt;&lt;/span&gt;&lt;span class="punct"&gt;&amp;quot;&lt;/span&gt; &lt;span class="keyword"&gt;if&lt;/span&gt; &lt;span class="ident"&gt;m&lt;/span&gt;&lt;span class="punct"&gt;[&lt;/span&gt;&lt;span class="number"&gt;0&lt;/span&gt;&lt;span class="punct"&gt;]&lt;/span&gt; &lt;span class="punct"&gt;==&lt;/span&gt; &lt;span class="char"&gt;?x&lt;/span&gt;
    &lt;span class="punct"&gt;[&lt;/span&gt;&lt;span class="constant"&gt;Integer&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt;&lt;span class="ident"&gt;m&lt;/span&gt;&lt;span class="punct"&gt;)].&lt;/span&gt;&lt;span class="ident"&gt;pack&lt;/span&gt;&lt;span class="punct"&gt;('&lt;/span&gt;&lt;span class="string"&gt;U*&lt;/span&gt;&lt;span class="punct"&gt;')&lt;/span&gt;
  &lt;span class="punct"&gt;}&lt;/span&gt;
  &lt;span class="ident"&gt;matches&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;collect!&lt;/span&gt;&lt;span class="punct"&gt;{|&lt;/span&gt;&lt;span class="ident"&gt;x&lt;/span&gt;&lt;span class="punct"&gt;|&lt;/span&gt;&lt;span class="ident"&gt;x&lt;/span&gt;&lt;span class="punct"&gt;[&lt;/span&gt;&lt;span class="number"&gt;0&lt;/span&gt;&lt;span class="punct"&gt;]}.&lt;/span&gt;&lt;span class="ident"&gt;compact!&lt;/span&gt;
  &lt;span class="keyword"&gt;if&lt;/span&gt; &lt;span class="ident"&gt;matches&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;size&lt;/span&gt; &lt;span class="punct"&gt;&amp;gt;&lt;/span&gt; &lt;span class="number"&gt;0&lt;/span&gt;
    &lt;span class="keyword"&gt;if&lt;/span&gt; &lt;span class="ident"&gt;doctype&lt;/span&gt;
      &lt;span class="ident"&gt;matches&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;each&lt;/span&gt; &lt;span class="keyword"&gt;do&lt;/span&gt; &lt;span class="punct"&gt;|&lt;/span&gt;&lt;span class="ident"&gt;entity_reference&lt;/span&gt;&lt;span class="punct"&gt;|&lt;/span&gt;
        &lt;span class="keyword"&gt;unless&lt;/span&gt; &lt;span class="ident"&gt;filter&lt;/span&gt; &lt;span class="keyword"&gt;and&lt;/span&gt; &lt;span class="ident"&gt;filter&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;include?&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt;&lt;span class="ident"&gt;entity_reference&lt;/span&gt;&lt;span class="punct"&gt;)&lt;/span&gt;
          &lt;span class="ident"&gt;entity_value&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="ident"&gt;doctype&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;entity&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt; &lt;span class="ident"&gt;entity_reference&lt;/span&gt; &lt;span class="punct"&gt;)&lt;/span&gt;
          &lt;span class="ident"&gt;re&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="punct"&gt;/&lt;/span&gt;&lt;span class="regex"&gt;&amp;amp;&lt;span class="expr"&gt;#{entity_reference}&lt;/span&gt;;&lt;/span&gt;&lt;span class="punct"&gt;/&lt;/span&gt;
          &lt;span class="ident"&gt;rv&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;gsub!&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt; &lt;span class="ident"&gt;re&lt;/span&gt;&lt;span class="punct"&gt;,&lt;/span&gt; &lt;span class="ident"&gt;entity_value&lt;/span&gt; &lt;span class="punct"&gt;)&lt;/span&gt; &lt;span class="keyword"&gt;if&lt;/span&gt; &lt;span class="ident"&gt;entity_value&lt;/span&gt;
        &lt;span class="keyword"&gt;end&lt;/span&gt;
      &lt;span class="keyword"&gt;end&lt;/span&gt;
    &lt;span class="keyword"&gt;else&lt;/span&gt;
      &lt;span class="ident"&gt;matches&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;each&lt;/span&gt; &lt;span class="keyword"&gt;do&lt;/span&gt; &lt;span class="punct"&gt;|&lt;/span&gt;&lt;span class="ident"&gt;entity_reference&lt;/span&gt;&lt;span class="punct"&gt;|&lt;/span&gt;
        &lt;span class="keyword"&gt;unless&lt;/span&gt; &lt;span class="ident"&gt;filter&lt;/span&gt; &lt;span class="keyword"&gt;and&lt;/span&gt; &lt;span class="ident"&gt;filter&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;include?&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt;&lt;span class="ident"&gt;entity_reference&lt;/span&gt;&lt;span class="punct"&gt;)&lt;/span&gt;
          &lt;span class="ident"&gt;entity_value&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="constant"&gt;DocType&lt;/span&gt;&lt;span class="punct"&gt;::&lt;/span&gt;&lt;span class="constant"&gt;DEFAULT_ENTITIES&lt;/span&gt;&lt;span class="punct"&gt;[&lt;/span&gt; &lt;span class="ident"&gt;entity_reference&lt;/span&gt; &lt;span class="punct"&gt;]&lt;/span&gt;
          &lt;span class="ident"&gt;re&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="punct"&gt;/&lt;/span&gt;&lt;span class="regex"&gt;&amp;amp;&lt;span class="expr"&gt;#{entity_reference}&lt;/span&gt;;&lt;/span&gt;&lt;span class="punct"&gt;/&lt;/span&gt;
          &lt;span class="ident"&gt;rv&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;gsub!&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt; &lt;span class="ident"&gt;re&lt;/span&gt;&lt;span class="punct"&gt;,&lt;/span&gt; &lt;span class="ident"&gt;entity_value&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;value&lt;/span&gt; &lt;span class="punct"&gt;)&lt;/span&gt; &lt;span class="keyword"&gt;if&lt;/span&gt; &lt;span class="ident"&gt;entity_value&lt;/span&gt;
        &lt;span class="keyword"&gt;end&lt;/span&gt;
      &lt;span class="keyword"&gt;end&lt;/span&gt;
    &lt;span class="keyword"&gt;end&lt;/span&gt;
    &lt;span class="ident"&gt;rv&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;gsub!&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt; &lt;span class="punct"&gt;/&lt;/span&gt;&lt;span class="regex"&gt;&amp;amp;amp;&lt;/span&gt;&lt;span class="punct"&gt;/,&lt;/span&gt; &lt;span class="punct"&gt;'&lt;/span&gt;&lt;span class="string"&gt;&amp;amp;&lt;/span&gt;&lt;span class="punct"&gt;'&lt;/span&gt; &lt;span class="punct"&gt;)&lt;/span&gt;
  &lt;span class="keyword"&gt;end&lt;/span&gt;
  &lt;span class="ident"&gt;rv&lt;/span&gt;
&lt;span class="keyword"&gt;end&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Now, when you look at this, your first impression is that it just &lt;em&gt;screams&lt;/em&gt; fast, right? Let&amp;#8217;s run &lt;code&gt;Hash.from_xml&lt;/code&gt; on the &lt;a href="http://blog.nicksieger.com/files/page.xml"&gt;file I mentioned above&lt;/a&gt;.&lt;/p&gt;

&lt;div class="typocode"&gt;&lt;pre&gt;&lt;code class="typocode_ruby "&gt;&lt;span class="comment"&gt;# unnormalize.rb&lt;/span&gt;
&lt;span class="ident"&gt;require&lt;/span&gt; &lt;span class="punct"&gt;'&lt;/span&gt;&lt;span class="string"&gt;rubygems&lt;/span&gt;&lt;span class="punct"&gt;'&lt;/span&gt;
&lt;span class="ident"&gt;gem&lt;/span&gt; &lt;span class="punct"&gt;'&lt;/span&gt;&lt;span class="string"&gt;activesupport&lt;/span&gt;&lt;span class="punct"&gt;'&lt;/span&gt;
&lt;span class="ident"&gt;require&lt;/span&gt; &lt;span class="punct"&gt;'&lt;/span&gt;&lt;span class="string"&gt;active_support&lt;/span&gt;&lt;span class="punct"&gt;'&lt;/span&gt;

&lt;span class="constant"&gt;File&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;open&lt;/span&gt;&lt;span class="punct"&gt;(&amp;quot;&lt;/span&gt;&lt;span class="string"&gt;page.xml&lt;/span&gt;&lt;span class="punct"&gt;&amp;quot;)&lt;/span&gt; &lt;span class="keyword"&gt;do&lt;/span&gt; &lt;span class="punct"&gt;|&lt;/span&gt;&lt;span class="ident"&gt;f&lt;/span&gt;&lt;span class="punct"&gt;|&lt;/span&gt;
  &lt;span class="constant"&gt;Hash&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;from_xml&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt;&lt;span class="ident"&gt;f&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;read&lt;/span&gt;&lt;span class="punct"&gt;)&lt;/span&gt;
&lt;span class="keyword"&gt;end&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;pre&gt;&lt;code&gt;$ time ruby unnormalize.rb

real    0m16.221s
user    0m14.447s
sys     0m0.346s
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Whoa! Knock me over with a feather! It blows chunks! You mean calling &lt;code&gt;#gsub!&lt;/code&gt; repeatedly in a loop with dregexps (regexp literals with interpolated strings) doesn&amp;#8217;t go fast? It&amp;#8217;s doubly worse on JRuby, too:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ time jruby unnormalize.rb

real    0m33.637s
user    0m32.897s
sys     0m0.573s
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;All this on a paltry 393K xml file. Makes me wonder how anyone ever does any serious XML processing in Ruby.&lt;/p&gt;

&lt;p&gt;I know, this is open source, I should be whipping up a patch for this and submitting it. Well, I did cook up a solution, but it unfortunately is only available for JRuby at the moment. (I also have much more faith in &lt;a href="http://www.intertwingly.net/blog/2007/12/31/Porting-REXML-to-Ruby-1-9"&gt;Sam Ruby&lt;/a&gt; than myself to get the semantics of a rewritten &lt;code&gt;REXML::Text::unnormalize&lt;/code&gt; correct.)&lt;/p&gt;

&lt;p&gt;&lt;a href="http://www.nabble.com/-ANN--JREXML-0.5-t4234591.html"&gt;A while back&lt;/a&gt; I cooked up &lt;a href="http://caldersphere.rubyforge.org/jrexml"&gt;JREXML&lt;/a&gt; because Regexp processing in JRuby was slow at the time, and the guts of REXML is driven by a Regexp-based parser. JREXML swaps out that regexp parser with a Java pull parser library, and at the time it provided a modest speedup.&lt;/p&gt;

&lt;p&gt;So, in the context of JREXML, the solution now becomes simple, by taking advantage of the fact that Java XML parsers typically expand entities for you. The just-released JREXML 0.5.3 circumvents &lt;code&gt;REXML::Text::unnormalize&lt;/code&gt; when constructing a document from the Java-based parser. And the results again don&amp;#8217;t disappoint:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ time jruby unnormalize_jrexml.rb

real    0m5.802s
user    0m5.315s
sys     0m0.345s
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;&lt;em&gt;Update:&lt;/em&gt; At Sam&amp;#8217;s request, I ran the numbers again with REXML trunk, which &lt;a href="http://www.germane-software.com/projects/rexml/changeset/1289"&gt;condenses entity expansion into a single &lt;code&gt;gsub&lt;/code&gt;&lt;/a&gt;. Speed is more in line for MRI, but didn&amp;#8217;t move much for JRuby (probably more a datapoint for JRuby developers than REXML developers).&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ time ruby -I~/Projects/ruby/rexml/src unnormalize.rb 

real    0m6.592s
user    0m0.845s
sys     0m0.345s

$ time jruby -I~/Projects/ruby/rexml/src unnormalize.rb

real    0m34.353s
user    0m33.023s
sys     0m0.714s
&lt;/code&gt;&lt;/pre&gt;</content>
  </entry>
</feed>
