<?xml version="1.0" encoding="UTF-8"?>
<feed xml:lang="en-US" xmlns="http://www.w3.org/2005/Atom">
  <title>Nick Sieger: REXML a Drag...Again</title>
  <id>tag:blog.nicksieger.com,2005:Typo</id>
  <generator uri="http://www.typosphere.org" version="4.0">Typo</generator>
  <link href="http://blog.nicksieger.com/xml/atom10/article/361/feed.xml" rel="self" type="application/atom+xml"/>
  <link href="http://blog.nicksieger.com/articles/2008/01/17/rexml-a-drag-again" rel="alternate" type="text/html"/>
  <updated>2008-01-19T10:30:27+00:00</updated>
  <entry>
    <author>
      <name>Dieta</name>
    </author>
    <id>urn:uuid:207fd193-6f52-42bf-a45a-77259896efa7</id>
    <published>2008-01-19T10:30:27+00:00</published>
    <updated>2008-01-19T10:30:27+00:00</updated>
    <title>Comment on REXML a Drag...Again by Dieta</title>
    <link href="http://blog.nicksieger.com/articles/2008/01/17/rexml-a-drag-again#comment-370" rel="alternate" type="text/html"/>
    <content type="html">&lt;p&gt;Yep. As Phil said I use libxml2 and it&amp;#8217;s ok.&lt;/p&gt;</content>
  </entry>
  <entry>
    <author>
      <name>Jeff Shadap</name>
    </author>
    <id>urn:uuid:776411c1-5a55-4179-96bb-688c55a19cdb</id>
    <published>2008-01-18T07:22:57+00:00</published>
    <updated>2008-01-18T07:22:57+00:00</updated>
    <title>Comment on REXML a Drag...Again by Jeff Shadap</title>
    <link href="http://blog.nicksieger.com/articles/2008/01/17/rexml-a-drag-again#comment-369" rel="alternate" type="text/html"/>
    <content type="html">&lt;p&gt;If you want speed use libxml, REXML is too slow for parsing large files&lt;/p&gt;</content>
  </entry>
  <entry>
    <author>
      <name>Sam Ruby</name>
    </author>
    <id>urn:uuid:9ff80076-b5af-4ebe-a045-0f576a0c0afc</id>
    <published>2008-01-17T20:20:02+00:00</published>
    <updated>2008-01-17T20:20:02+00:00</updated>
    <title>Comment on REXML a Drag...Again by Sam Ruby</title>
    <link href="http://blog.nicksieger.com/articles/2008/01/17/rexml-a-drag-again#comment-365" rel="alternate" type="text/html"/>
    <content type="html">&lt;p&gt;Sweet!  And I would bet that Ruby 1.9 is even better.  And yes, I am looking into making the parser pluggable.&lt;/p&gt;</content>
  </entry>
  <entry>
    <author>
      <name>Phil</name>
    </author>
    <id>urn:uuid:b5bb3a71-1f06-4efc-b789-74b50753aeaa</id>
    <published>2008-01-17T18:36:16+00:00</published>
    <updated>2008-01-17T18:36:16+00:00</updated>
    <title>Comment on REXML a Drag...Again by Phil</title>
    <link href="http://blog.nicksieger.com/articles/2008/01/17/rexml-a-drag-again#comment-364" rel="alternate" type="text/html"/>
    <content type="html">&lt;p&gt;I think everyone who cares just uses libxml2 or Hpricot. As far as I can tell REXML is only used by people who have to stick to the standard library. Of course, the standard library badly needs Hpricot, so here&amp;#8217;s hoping.&lt;/p&gt;</content>
  </entry>
  <entry>
    <author>
      <name>Nick</name>
    </author>
    <id>urn:uuid:01fdc687-f081-43f2-af7f-9099727f6d5b</id>
    <published>2008-01-17T17:07:06+00:00</published>
    <updated>2008-01-17T17:07:07+00:00</updated>
    <title>Comment on REXML a Drag...Again by Nick</title>
    <link href="http://blog.nicksieger.com/articles/2008/01/17/rexml-a-drag-again#comment-363" rel="alternate" type="text/html"/>
    <content type="html">&lt;p&gt;Thanks for your prompt reply and stewardship for REXML, Sam. Updated numbers posted. Looking much better &amp;#8211; but using a java parser for JRuby still looks like the best option.&lt;/p&gt;

&lt;p&gt;Any plans to replace REXML&amp;#8217;s base parser with expat or similar to speed it up further?&lt;/p&gt;</content>
  </entry>
  <entry>
    <author>
      <name>Sam Ruby</name>
    </author>
    <id>urn:uuid:c9e59832-9df1-4cb6-8860-8bbc9dfc3d72</id>
    <published>2008-01-17T09:45:00+00:00</published>
    <updated>2008-01-17T09:45:01+00:00</updated>
    <title>Comment on REXML a Drag...Again by Sam Ruby</title>
    <link href="http://blog.nicksieger.com/articles/2008/01/17/rexml-a-drag-again#comment-362" rel="alternate" type="text/html"/>
    <content type="html">&lt;p&gt;I agree that plugging in a real parser is the way to go.  But can I get you to try the latest REXML from SVN to see if there is a difference?  In particular, I&amp;#8217;m interested in whether or not the following change makes things better or worse in a real life scenario:&lt;/p&gt;

&lt;p&gt;&lt;a href='http://www.germane-software.com/projects/rexml/changeset/1289' rel="nofollow"&gt;http://www.germane-software.com/projects/rexml/changeset/1289&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The essence of the change is to do exactly &lt;em&gt;ONE&lt;/em&gt; gsub across the entire string for all entities.&lt;/p&gt;

&lt;p&gt;svn co &lt;a href='http://www.germane-software.com/repos/rexml/trunk/' rel="nofollow"&gt;http://www.germane-software.com/repos/rexml/trunk/&lt;/a&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <author>
      <name>Nick Sieger</name>
    </author>
    <id>urn:uuid:0d137df7-5bca-451d-97f6-34af4d5a36b7</id>
    <published>2008-01-17T04:07:00+00:00</published>
    <updated>2008-01-17T17:08:14+00:00</updated>
    <title>REXML a Drag...Again</title>
    <link href="http://blog.nicksieger.com/articles/2008/01/17/rexml-a-drag-again" rel="alternate" type="text/html"/>
    <category term="ruby" scheme="http://blog.nicksieger.com/articles/tag/ruby"/>
    <category term="xml" scheme="http://blog.nicksieger.com/articles/tag/xml"/>
    <category term="jruby" scheme="http://blog.nicksieger.com/articles/tag/jruby"/>
    <content type="html">&lt;p&gt;&lt;a href="http://blog.nicksieger.com/articles/2006/11/02/ruby-and-xml-not-so-simple"&gt;We&amp;#8217;ve been here before.&lt;/a&gt; So here&amp;#8217;s the scenario: You&amp;#8217;re feeding medium-to-large chunks of XML out of one Rails app, to be consumed by another via &lt;a href="http://ryandaigle.com/articles/2006/06/30/whats-new-in-edge-rails-activeresource-is-here" title="Ryan's Scraps: What's New in Edge Rails: The Ins and Outs of ActiveResource"&gt;ActiveResource&lt;/a&gt;. Maybe those chunks have embedded HTML, or maybe they&amp;#8217;re an Atom feed containing several pieces of HTML with all the entities escaped. Maybe &lt;a href="http://blog.nicksieger.com/files/page.xml"&gt;they contain entire Wikipedia pages in them&lt;/a&gt;. Lots of entities that need expansion when the file is parsed.&lt;/p&gt;

&lt;p&gt;So what does ActiveResource do with this? &lt;code&gt;Hash.from_xml&lt;/code&gt;. Which uses &lt;a href="http://xml-simple.rubyforge.org/" title="XmlSimple - XML made easy"&gt;xml-simple&lt;/a&gt;. Which constructs a &lt;code&gt;REXML::Document&lt;/code&gt;, and proceeds to navigate the entire DOM, scraping the text nodes out of it so they can be stuffed in a hash to be handed back to ActiveResource. And how does REXML expand all the entities it runs across? With this little lovely:&lt;/p&gt;

&lt;div class="typocode"&gt;&lt;pre&gt;&lt;code class="typocode_ruby "&gt;&lt;span class="comment"&gt;# Unescapes all possible entities&lt;/span&gt;
&lt;span class="keyword"&gt;def &lt;/span&gt;&lt;span class="method"&gt;Text::unnormalize&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt; &lt;span class="ident"&gt;string&lt;/span&gt;&lt;span class="punct"&gt;,&lt;/span&gt; &lt;span class="ident"&gt;doctype&lt;/span&gt;&lt;span class="punct"&gt;=&lt;/span&gt;&lt;span class="constant"&gt;nil&lt;/span&gt;&lt;span class="punct"&gt;,&lt;/span&gt; &lt;span class="ident"&gt;filter&lt;/span&gt;&lt;span class="punct"&gt;=&lt;/span&gt;&lt;span class="constant"&gt;nil&lt;/span&gt;&lt;span class="punct"&gt;,&lt;/span&gt; &lt;span class="ident"&gt;illegal&lt;/span&gt;&lt;span class="punct"&gt;=&lt;/span&gt;&lt;span class="constant"&gt;nil&lt;/span&gt; &lt;span class="punct"&gt;)&lt;/span&gt;
  &lt;span class="ident"&gt;rv&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="ident"&gt;string&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;clone&lt;/span&gt;
  &lt;span class="ident"&gt;rv&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;gsub!&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt; &lt;span class="punct"&gt;/&lt;/span&gt;&lt;span class="regex"&gt;&lt;span class="escape"&gt;\r\n&lt;/span&gt;?&lt;/span&gt;&lt;span class="punct"&gt;/,&lt;/span&gt; &lt;span class="punct"&gt;&amp;quot;&lt;/span&gt;&lt;span class="string"&gt;&lt;span class="escape"&gt;\n&lt;/span&gt;&lt;/span&gt;&lt;span class="punct"&gt;&amp;quot;&lt;/span&gt; &lt;span class="punct"&gt;)&lt;/span&gt;
  &lt;span class="ident"&gt;matches&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="ident"&gt;rv&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;scan&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt; &lt;span class="constant"&gt;REFERENCE&lt;/span&gt; &lt;span class="punct"&gt;)&lt;/span&gt;
  &lt;span class="keyword"&gt;return&lt;/span&gt; &lt;span class="ident"&gt;rv&lt;/span&gt; &lt;span class="keyword"&gt;if&lt;/span&gt; &lt;span class="ident"&gt;matches&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;size&lt;/span&gt; &lt;span class="punct"&gt;==&lt;/span&gt; &lt;span class="number"&gt;0&lt;/span&gt;
  &lt;span class="ident"&gt;rv&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;gsub!&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt; &lt;span class="constant"&gt;NUMERICENTITY&lt;/span&gt; &lt;span class="punct"&gt;)&lt;/span&gt; &lt;span class="punct"&gt;{|&lt;/span&gt;&lt;span class="ident"&gt;m&lt;/span&gt;&lt;span class="punct"&gt;|&lt;/span&gt;
    &lt;span class="ident"&gt;m&lt;/span&gt;&lt;span class="punct"&gt;=&lt;/span&gt;&lt;span class="global"&gt;$1&lt;/span&gt;
    &lt;span class="ident"&gt;m&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="punct"&gt;&amp;quot;&lt;/span&gt;&lt;span class="string"&gt;0&lt;span class="expr"&gt;#{m}&lt;/span&gt;&lt;/span&gt;&lt;span class="punct"&gt;&amp;quot;&lt;/span&gt; &lt;span class="keyword"&gt;if&lt;/span&gt; &lt;span class="ident"&gt;m&lt;/span&gt;&lt;span class="punct"&gt;[&lt;/span&gt;&lt;span class="number"&gt;0&lt;/span&gt;&lt;span class="punct"&gt;]&lt;/span&gt; &lt;span class="punct"&gt;==&lt;/span&gt; &lt;span class="char"&gt;?x&lt;/span&gt;
    &lt;span class="punct"&gt;[&lt;/span&gt;&lt;span class="constant"&gt;Integer&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt;&lt;span class="ident"&gt;m&lt;/span&gt;&lt;span class="punct"&gt;)].&lt;/span&gt;&lt;span class="ident"&gt;pack&lt;/span&gt;&lt;span class="punct"&gt;('&lt;/span&gt;&lt;span class="string"&gt;U*&lt;/span&gt;&lt;span class="punct"&gt;')&lt;/span&gt;
  &lt;span class="punct"&gt;}&lt;/span&gt;
  &lt;span class="ident"&gt;matches&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;collect!&lt;/span&gt;&lt;span class="punct"&gt;{|&lt;/span&gt;&lt;span class="ident"&gt;x&lt;/span&gt;&lt;span class="punct"&gt;|&lt;/span&gt;&lt;span class="ident"&gt;x&lt;/span&gt;&lt;span class="punct"&gt;[&lt;/span&gt;&lt;span class="number"&gt;0&lt;/span&gt;&lt;span class="punct"&gt;]}.&lt;/span&gt;&lt;span class="ident"&gt;compact!&lt;/span&gt;
  &lt;span class="keyword"&gt;if&lt;/span&gt; &lt;span class="ident"&gt;matches&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;size&lt;/span&gt; &lt;span class="punct"&gt;&amp;gt;&lt;/span&gt; &lt;span class="number"&gt;0&lt;/span&gt;
    &lt;span class="keyword"&gt;if&lt;/span&gt; &lt;span class="ident"&gt;doctype&lt;/span&gt;
      &lt;span class="ident"&gt;matches&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;each&lt;/span&gt; &lt;span class="keyword"&gt;do&lt;/span&gt; &lt;span class="punct"&gt;|&lt;/span&gt;&lt;span class="ident"&gt;entity_reference&lt;/span&gt;&lt;span class="punct"&gt;|&lt;/span&gt;
        &lt;span class="keyword"&gt;unless&lt;/span&gt; &lt;span class="ident"&gt;filter&lt;/span&gt; &lt;span class="keyword"&gt;and&lt;/span&gt; &lt;span class="ident"&gt;filter&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;include?&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt;&lt;span class="ident"&gt;entity_reference&lt;/span&gt;&lt;span class="punct"&gt;)&lt;/span&gt;
          &lt;span class="ident"&gt;entity_value&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="ident"&gt;doctype&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;entity&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt; &lt;span class="ident"&gt;entity_reference&lt;/span&gt; &lt;span class="punct"&gt;)&lt;/span&gt;
          &lt;span class="ident"&gt;re&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="punct"&gt;/&lt;/span&gt;&lt;span class="regex"&gt;&amp;amp;&lt;span class="expr"&gt;#{entity_reference}&lt;/span&gt;;&lt;/span&gt;&lt;span class="punct"&gt;/&lt;/span&gt;
          &lt;span class="ident"&gt;rv&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;gsub!&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt; &lt;span class="ident"&gt;re&lt;/span&gt;&lt;span class="punct"&gt;,&lt;/span&gt; &lt;span class="ident"&gt;entity_value&lt;/span&gt; &lt;span class="punct"&gt;)&lt;/span&gt; &lt;span class="keyword"&gt;if&lt;/span&gt; &lt;span class="ident"&gt;entity_value&lt;/span&gt;
        &lt;span class="keyword"&gt;end&lt;/span&gt;
      &lt;span class="keyword"&gt;end&lt;/span&gt;
    &lt;span class="keyword"&gt;else&lt;/span&gt;
      &lt;span class="ident"&gt;matches&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;each&lt;/span&gt; &lt;span class="keyword"&gt;do&lt;/span&gt; &lt;span class="punct"&gt;|&lt;/span&gt;&lt;span class="ident"&gt;entity_reference&lt;/span&gt;&lt;span class="punct"&gt;|&lt;/span&gt;
        &lt;span class="keyword"&gt;unless&lt;/span&gt; &lt;span class="ident"&gt;filter&lt;/span&gt; &lt;span class="keyword"&gt;and&lt;/span&gt; &lt;span class="ident"&gt;filter&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;include?&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt;&lt;span class="ident"&gt;entity_reference&lt;/span&gt;&lt;span class="punct"&gt;)&lt;/span&gt;
          &lt;span class="ident"&gt;entity_value&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="constant"&gt;DocType&lt;/span&gt;&lt;span class="punct"&gt;::&lt;/span&gt;&lt;span class="constant"&gt;DEFAULT_ENTITIES&lt;/span&gt;&lt;span class="punct"&gt;[&lt;/span&gt; &lt;span class="ident"&gt;entity_reference&lt;/span&gt; &lt;span class="punct"&gt;]&lt;/span&gt;
          &lt;span class="ident"&gt;re&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="punct"&gt;/&lt;/span&gt;&lt;span class="regex"&gt;&amp;amp;&lt;span class="expr"&gt;#{entity_reference}&lt;/span&gt;;&lt;/span&gt;&lt;span class="punct"&gt;/&lt;/span&gt;
          &lt;span class="ident"&gt;rv&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;gsub!&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt; &lt;span class="ident"&gt;re&lt;/span&gt;&lt;span class="punct"&gt;,&lt;/span&gt; &lt;span class="ident"&gt;entity_value&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;value&lt;/span&gt; &lt;span class="punct"&gt;)&lt;/span&gt; &lt;span class="keyword"&gt;if&lt;/span&gt; &lt;span class="ident"&gt;entity_value&lt;/span&gt;
        &lt;span class="keyword"&gt;end&lt;/span&gt;
      &lt;span class="keyword"&gt;end&lt;/span&gt;
    &lt;span class="keyword"&gt;end&lt;/span&gt;
    &lt;span class="ident"&gt;rv&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;gsub!&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt; &lt;span class="punct"&gt;/&lt;/span&gt;&lt;span class="regex"&gt;&amp;amp;amp;&lt;/span&gt;&lt;span class="punct"&gt;/,&lt;/span&gt; &lt;span class="punct"&gt;'&lt;/span&gt;&lt;span class="string"&gt;&amp;amp;&lt;/span&gt;&lt;span class="punct"&gt;'&lt;/span&gt; &lt;span class="punct"&gt;)&lt;/span&gt;
  &lt;span class="keyword"&gt;end&lt;/span&gt;
  &lt;span class="ident"&gt;rv&lt;/span&gt;
&lt;span class="keyword"&gt;end&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Now, when you look at this, your first impression is that it just &lt;em&gt;screams&lt;/em&gt; fast, right? Let&amp;#8217;s run &lt;code&gt;Hash.from_xml&lt;/code&gt; on the &lt;a href="http://blog.nicksieger.com/files/page.xml"&gt;file I mentioned above&lt;/a&gt;.&lt;/p&gt;

&lt;div class="typocode"&gt;&lt;pre&gt;&lt;code class="typocode_ruby "&gt;&lt;span class="comment"&gt;# unnormalize.rb&lt;/span&gt;
&lt;span class="ident"&gt;require&lt;/span&gt; &lt;span class="punct"&gt;'&lt;/span&gt;&lt;span class="string"&gt;rubygems&lt;/span&gt;&lt;span class="punct"&gt;'&lt;/span&gt;
&lt;span class="ident"&gt;gem&lt;/span&gt; &lt;span class="punct"&gt;'&lt;/span&gt;&lt;span class="string"&gt;activesupport&lt;/span&gt;&lt;span class="punct"&gt;'&lt;/span&gt;
&lt;span class="ident"&gt;require&lt;/span&gt; &lt;span class="punct"&gt;'&lt;/span&gt;&lt;span class="string"&gt;active_support&lt;/span&gt;&lt;span class="punct"&gt;'&lt;/span&gt;

&lt;span class="constant"&gt;File&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;open&lt;/span&gt;&lt;span class="punct"&gt;(&amp;quot;&lt;/span&gt;&lt;span class="string"&gt;page.xml&lt;/span&gt;&lt;span class="punct"&gt;&amp;quot;)&lt;/span&gt; &lt;span class="keyword"&gt;do&lt;/span&gt; &lt;span class="punct"&gt;|&lt;/span&gt;&lt;span class="ident"&gt;f&lt;/span&gt;&lt;span class="punct"&gt;|&lt;/span&gt;
  &lt;span class="constant"&gt;Hash&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;from_xml&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt;&lt;span class="ident"&gt;f&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;read&lt;/span&gt;&lt;span class="punct"&gt;)&lt;/span&gt;
&lt;span class="keyword"&gt;end&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;pre&gt;&lt;code&gt;$ time ruby unnormalize.rb

real    0m16.221s
user    0m14.447s
sys     0m0.346s
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Whoa! Knock me over with a feather! It blows chunks! You mean calling &lt;code&gt;#gsub!&lt;/code&gt; repeatedly in a loop with dregexps (regexp literals with interpolated strings) doesn&amp;#8217;t go fast? It&amp;#8217;s doubly worse on JRuby, too:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ time jruby unnormalize.rb

real    0m33.637s
user    0m32.897s
sys     0m0.573s
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;All this on a paltry 393K xml file. Makes me wonder how anyone ever does any serious XML processing in Ruby.&lt;/p&gt;

&lt;p&gt;I know, this is open source, I should be whipping up a patch for this and submitting it. Well, I did cook up a solution, but it unfortunately is only available for JRuby at the moment. (I also have much more faith in &lt;a href="http://www.intertwingly.net/blog/2007/12/31/Porting-REXML-to-Ruby-1-9"&gt;Sam Ruby&lt;/a&gt; than myself to get the semantics of a rewritten &lt;code&gt;REXML::Text::unnormalize&lt;/code&gt; correct.)&lt;/p&gt;

&lt;p&gt;&lt;a href="http://www.nabble.com/-ANN--JREXML-0.5-t4234591.html"&gt;A while back&lt;/a&gt; I cooked up &lt;a href="http://caldersphere.rubyforge.org/jrexml"&gt;JREXML&lt;/a&gt; because Regexp processing in JRuby was slow at the time, and the guts of REXML is driven by a Regexp-based parser. JREXML swaps out that regexp parser with a Java pull parser library, and at the time it provided a modest speedup.&lt;/p&gt;

&lt;p&gt;So, in the context of JREXML, the solution now becomes simple, by taking advantage of the fact that Java XML parsers typically expand entities for you. The just-released JREXML 0.5.3 circumvents &lt;code&gt;REXML::Text::unnormalize&lt;/code&gt; when constructing a document from the Java-based parser. And the results again don&amp;#8217;t disappoint:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ time jruby unnormalize_jrexml.rb

real    0m5.802s
user    0m5.315s
sys     0m0.345s
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;&lt;em&gt;Update:&lt;/em&gt; At Sam&amp;#8217;s request, I ran the numbers again with REXML trunk, which &lt;a href="http://www.germane-software.com/projects/rexml/changeset/1289"&gt;condenses entity expansion into a single &lt;code&gt;gsub&lt;/code&gt;&lt;/a&gt;. Speed is more in line for MRI, but didn&amp;#8217;t move much for JRuby (probably more a datapoint for JRuby developers than REXML developers).&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ time ruby -I~/Projects/ruby/rexml/src unnormalize.rb 

real    0m6.592s
user    0m0.845s
sys     0m0.345s

$ time jruby -I~/Projects/ruby/rexml/src unnormalize.rb

real    0m34.353s
user    0m33.023s
sys     0m0.714s
&lt;/code&gt;&lt;/pre&gt;</content>
  </entry>
</feed>
