Nick Sieger: REXML a Drag...Againtag:blog.nicksieger.com,2005:TypoTypo2010-11-22T19:00:54+00:00Dietaurn:uuid:207fd193-6f52-42bf-a45a-77259896efa72008-01-19T10:30:27+00:002010-11-22T19:00:54+00:00Comment on REXML a Drag...Again by Dieta<p>Yep. As Phil said I use libxml2 and it’s ok.</p>Jeff Shadapurn:uuid:776411c1-5a55-4179-96bb-688c55a19cdb2008-01-18T07:22:57+00:002010-11-22T19:00:54+00:00Comment on REXML a Drag...Again by Jeff Shadap<p>If you want speed use libxml, REXML is too slow for parsing large files</p>Sam Rubyurn:uuid:9ff80076-b5af-4ebe-a045-0f576a0c0afc2008-01-17T20:20:02+00:002010-11-22T19:00:54+00:00Comment on REXML a Drag...Again by Sam Ruby<p>Sweet! And I would bet that Ruby 1.9 is even better. And yes, I am looking into making the parser pluggable.</p>Philurn:uuid:b5bb3a71-1f06-4efc-b789-74b50753aeaa2008-01-17T18:36:16+00:002010-11-22T19:00:54+00:00Comment on REXML a Drag...Again by Phil<p>I think everyone who cares just uses libxml2 or Hpricot. As far as I can tell REXML is only used by people who have to stick to the standard library. Of course, the standard library badly needs Hpricot, so here’s hoping.</p>Nickurn:uuid:01fdc687-f081-43f2-af7f-9099727f6d5b2008-01-17T17:07:06+00:002010-11-22T19:00:53+00:00Comment on REXML a Drag...Again by Nick<p>Thanks for your prompt reply and stewardship for REXML, Sam. Updated numbers posted. Looking much better -- but using a java parser for JRuby still looks like the best option.</p>
<p>Any plans to replace REXML’s base parser with expat or similar to speed it up further?</p>Sam Rubyurn:uuid:c9e59832-9df1-4cb6-8860-8bbc9dfc3d722008-01-17T09:45:00+00:002010-11-22T19:00:53+00:00Comment on REXML a Drag...Again by Sam Ruby<p>I agree that plugging in a real parser is the way to go. But can I get you to try the latest REXML from SVN to see if there is a difference? In particular, I’m interested in whether or not the following change makes things better or worse in a real life scenario:</p>
<p><a href='http://www' rel="nofollow">http://www</a>.germane-software.com/projects/rexml/changeset/1289</p>
<p>The essence of the change is to do exactly <em>ONE</em> gsub across the entire string for all entities.</p>
<p>svn co <a href='http://www' rel="nofollow">http://www</a>.germane-software.com/repos/rexml/trunk/</p>Nick Siegerurn:uuid:0d137df7-5bca-451d-97f6-34af4d5a36b72008-01-17T04:07:00+00:002010-11-22T19:00:53+00:00REXML a Drag...Again<p><a href="http://blog.nicksieger.com/articles/2006/11/02/ruby-and-xml-not-so-simple">We’ve been here before.</a> So here’s the scenario: You’re feeding medium-to-large chunks of XML out of one Rails app, to be consumed by another via <a href="http://ryandaigle.com/articles/2006/06/30/whats-new-in-edge-rails-activeresource-is-here" title="Ryan's Scraps: What's New in Edge Rails: The Ins and Outs of ActiveResource">ActiveResource</a>. Maybe those chunks have embedded HTML, or maybe they’re an Atom feed containing several pieces of HTML with all the entities escaped. Maybe <a href="http://blog.nicksieger.com/files/page.xml">they contain entire Wikipedia pages in them</a>. Lots of entities that need expansion when the file is parsed.</p>
<p>So what does ActiveResource do with this? <code>Hash.from_xml</code>. Which uses <a href="http://xml-simple.rubyforge.org/" title="XmlSimple - XML made easy">xml-simple</a>. Which constructs a <code>REXML::Document</code>, and proceeds to navigate the entire DOM, scraping the text nodes out of it so they can be stuffed in a hash to be handed back to ActiveResource. And how does REXML expand all the entities it runs across? With this little lovely:</p>
<div class="typocode"><pre><code class="typocode_ruby "><span class="comment"># Unescapes all possible entities</span>
<span class="keyword">def </span><span class="method">Text::unnormalize</span><span class="punct">(</span> <span class="ident">string</span><span class="punct">,</span> <span class="ident">doctype</span><span class="punct">=</span><span class="constant">nil</span><span class="punct">,</span> <span class="ident">filter</span><span class="punct">=</span><span class="constant">nil</span><span class="punct">,</span> <span class="ident">illegal</span><span class="punct">=</span><span class="constant">nil</span> <span class="punct">)</span>
<span class="ident">rv</span> <span class="punct">=</span> <span class="ident">string</span><span class="punct">.</span><span class="ident">clone</span>
<span class="ident">rv</span><span class="punct">.</span><span class="ident">gsub!</span><span class="punct">(</span> <span class="punct">/</span><span class="regex"><span class="escape">\r\n</span>?</span><span class="punct">/,</span> <span class="punct">"</span><span class="string"><span class="escape">\n</span></span><span class="punct">"</span> <span class="punct">)</span>
<span class="ident">matches</span> <span class="punct">=</span> <span class="ident">rv</span><span class="punct">.</span><span class="ident">scan</span><span class="punct">(</span> <span class="constant">REFERENCE</span> <span class="punct">)</span>
<span class="keyword">return</span> <span class="ident">rv</span> <span class="keyword">if</span> <span class="ident">matches</span><span class="punct">.</span><span class="ident">size</span> <span class="punct">==</span> <span class="number">0</span>
<span class="ident">rv</span><span class="punct">.</span><span class="ident">gsub!</span><span class="punct">(</span> <span class="constant">NUMERICENTITY</span> <span class="punct">)</span> <span class="punct">{|</span><span class="ident">m</span><span class="punct">|</span>
<span class="ident">m</span><span class="punct">=</span><span class="global">$1</span>
<span class="ident">m</span> <span class="punct">=</span> <span class="punct">"</span><span class="string">0<span class="expr">#{m}</span></span><span class="punct">"</span> <span class="keyword">if</span> <span class="ident">m</span><span class="punct">[</span><span class="number">0</span><span class="punct">]</span> <span class="punct">==</span> <span class="char">?x</span>
<span class="punct">[</span><span class="constant">Integer</span><span class="punct">(</span><span class="ident">m</span><span class="punct">)].</span><span class="ident">pack</span><span class="punct">('</span><span class="string">U*</span><span class="punct">')</span>
<span class="punct">}</span>
<span class="ident">matches</span><span class="punct">.</span><span class="ident">collect!</span><span class="punct">{|</span><span class="ident">x</span><span class="punct">|</span><span class="ident">x</span><span class="punct">[</span><span class="number">0</span><span class="punct">]}.</span><span class="ident">compact!</span>
<span class="keyword">if</span> <span class="ident">matches</span><span class="punct">.</span><span class="ident">size</span> <span class="punct">></span> <span class="number">0</span>
<span class="keyword">if</span> <span class="ident">doctype</span>
<span class="ident">matches</span><span class="punct">.</span><span class="ident">each</span> <span class="keyword">do</span> <span class="punct">|</span><span class="ident">entity_reference</span><span class="punct">|</span>
<span class="keyword">unless</span> <span class="ident">filter</span> <span class="keyword">and</span> <span class="ident">filter</span><span class="punct">.</span><span class="ident">include?</span><span class="punct">(</span><span class="ident">entity_reference</span><span class="punct">)</span>
<span class="ident">entity_value</span> <span class="punct">=</span> <span class="ident">doctype</span><span class="punct">.</span><span class="ident">entity</span><span class="punct">(</span> <span class="ident">entity_reference</span> <span class="punct">)</span>
<span class="ident">re</span> <span class="punct">=</span> <span class="punct">/</span><span class="regex">&<span class="expr">#{entity_reference}</span>;</span><span class="punct">/</span>
<span class="ident">rv</span><span class="punct">.</span><span class="ident">gsub!</span><span class="punct">(</span> <span class="ident">re</span><span class="punct">,</span> <span class="ident">entity_value</span> <span class="punct">)</span> <span class="keyword">if</span> <span class="ident">entity_value</span>
<span class="keyword">end</span>
<span class="keyword">end</span>
<span class="keyword">else</span>
<span class="ident">matches</span><span class="punct">.</span><span class="ident">each</span> <span class="keyword">do</span> <span class="punct">|</span><span class="ident">entity_reference</span><span class="punct">|</span>
<span class="keyword">unless</span> <span class="ident">filter</span> <span class="keyword">and</span> <span class="ident">filter</span><span class="punct">.</span><span class="ident">include?</span><span class="punct">(</span><span class="ident">entity_reference</span><span class="punct">)</span>
<span class="ident">entity_value</span> <span class="punct">=</span> <span class="constant">DocType</span><span class="punct">::</span><span class="constant">DEFAULT_ENTITIES</span><span class="punct">[</span> <span class="ident">entity_reference</span> <span class="punct">]</span>
<span class="ident">re</span> <span class="punct">=</span> <span class="punct">/</span><span class="regex">&<span class="expr">#{entity_reference}</span>;</span><span class="punct">/</span>
<span class="ident">rv</span><span class="punct">.</span><span class="ident">gsub!</span><span class="punct">(</span> <span class="ident">re</span><span class="punct">,</span> <span class="ident">entity_value</span><span class="punct">.</span><span class="ident">value</span> <span class="punct">)</span> <span class="keyword">if</span> <span class="ident">entity_value</span>
<span class="keyword">end</span>
<span class="keyword">end</span>
<span class="keyword">end</span>
<span class="ident">rv</span><span class="punct">.</span><span class="ident">gsub!</span><span class="punct">(</span> <span class="punct">/</span><span class="regex">&amp;</span><span class="punct">/,</span> <span class="punct">'</span><span class="string">&</span><span class="punct">'</span> <span class="punct">)</span>
<span class="keyword">end</span>
<span class="ident">rv</span>
<span class="keyword">end</span></code></pre></div>
<p>Now, when you look at this, your first impression is that it just <em>screams</em> fast, right? Let’s run <code>Hash.from_xml</code> on the <a href="http://blog.nicksieger.com/files/page.xml">file I mentioned above</a>.</p>
<div class="typocode"><pre><code class="typocode_ruby "><span class="comment"># unnormalize.rb</span>
<span class="ident">require</span> <span class="punct">'</span><span class="string">rubygems</span><span class="punct">'</span>
<span class="ident">gem</span> <span class="punct">'</span><span class="string">activesupport</span><span class="punct">'</span>
<span class="ident">require</span> <span class="punct">'</span><span class="string">active_support</span><span class="punct">'</span>
<span class="constant">File</span><span class="punct">.</span><span class="ident">open</span><span class="punct">("</span><span class="string">page.xml</span><span class="punct">")</span> <span class="keyword">do</span> <span class="punct">|</span><span class="ident">f</span><span class="punct">|</span>
<span class="constant">Hash</span><span class="punct">.</span><span class="ident">from_xml</span><span class="punct">(</span><span class="ident">f</span><span class="punct">.</span><span class="ident">read</span><span class="punct">)</span>
<span class="keyword">end</span></code></pre></div>
<pre><code>$ time ruby unnormalize.rb
real 0m16.221s
user 0m14.447s
sys 0m0.346s
</code></pre>
<p>Whoa! Knock me over with a feather! It blows chunks! You mean calling <code>#gsub!</code> repeatedly in a loop with dregexps (regexp literals with interpolated strings) doesn’t go fast? It’s doubly worse on JRuby, too:</p>
<pre><code>$ time jruby unnormalize.rb
real 0m33.637s
user 0m32.897s
sys 0m0.573s
</code></pre>
<p>All this on a paltry 393K xml file. Makes me wonder how anyone ever does any serious XML processing in Ruby.</p>
<p>I know, this is open source, I should be whipping up a patch for this and submitting it. Well, I did cook up a solution, but it unfortunately is only available for JRuby at the moment. (I also have much more faith in <a href="http://www.intertwingly.net/blog/2007/12/31/Porting-REXML-to-Ruby-1-9">Sam Ruby</a> than myself to get the semantics of a rewritten <code>REXML::Text::unnormalize</code> correct.)</p>
<p><a href="http://www.nabble.com/-ANN--JREXML-0.5-t4234591.html">A while back</a> I cooked up <a href="http://caldersphere.rubyforge.org/jrexml">JREXML</a> because Regexp processing in JRuby was slow at the time, and the guts of REXML is driven by a Regexp-based parser. JREXML swaps out that regexp parser with a Java pull parser library, and at the time it provided a modest speedup.</p>
<p>So, in the context of JREXML, the solution now becomes simple, by taking advantage of the fact that Java XML parsers typically expand entities for you. The just-released JREXML 0.5.3 circumvents <code>REXML::Text::unnormalize</code> when constructing a document from the Java-based parser. And the results again don’t disappoint:</p>
<pre><code>$ time jruby unnormalize_jrexml.rb
real 0m5.802s
user 0m5.315s
sys 0m0.345s
</code></pre>
<p><em>Update:</em> At Sam’s request, I ran the numbers again with REXML trunk, which <a href="http://www.germane-software.com/projects/rexml/changeset/1289">condenses entity expansion into a single <code>gsub</code></a>. Speed is more in line for MRI, but didn’t move much for JRuby (probably more a datapoint for JRuby developers than REXML developers).</p>
<pre><code>$ time ruby -I~/Projects/ruby/rexml/src unnormalize.rb
real 0m6.592s
user 0m0.845s
sys 0m0.345s
$ time jruby -I~/Projects/ruby/rexml/src unnormalize.rb
real 0m34.353s
user 0m33.023s
sys 0m0.714s
</code></pre>