Nick Sieger: REXML a Drag...Again http://blog.nicksieger.com/articles/2008/01/17/rexml-a-drag-again en-us 40 "REXML a Drag...Again" by Dieta <p>Yep. As Phil said I use libxml2 and it&#8217;s ok.</p> Sat, 19 Jan 2008 10:30:27 +0000 urn:uuid:207fd193-6f52-42bf-a45a-77259896efa7 http://blog.nicksieger.com/articles/2008/01/17/rexml-a-drag-again#comment-370 "REXML a Drag...Again" by Jeff Shadap <p>If you want speed use libxml, REXML is too slow for parsing large files</p> Fri, 18 Jan 2008 07:22:57 +0000 urn:uuid:776411c1-5a55-4179-96bb-688c55a19cdb http://blog.nicksieger.com/articles/2008/01/17/rexml-a-drag-again#comment-369 "REXML a Drag...Again" by Sam Ruby <p>Sweet! And I would bet that Ruby 1.9 is even better. And yes, I am looking into making the parser pluggable.</p> Thu, 17 Jan 2008 20:20:02 +0000 urn:uuid:9ff80076-b5af-4ebe-a045-0f576a0c0afc http://blog.nicksieger.com/articles/2008/01/17/rexml-a-drag-again#comment-365 "REXML a Drag...Again" by Phil <p>I think everyone who cares just uses libxml2 or Hpricot. As far as I can tell REXML is only used by people who have to stick to the standard library. Of course, the standard library badly needs Hpricot, so here&#8217;s hoping.</p> Thu, 17 Jan 2008 18:36:16 +0000 urn:uuid:b5bb3a71-1f06-4efc-b789-74b50753aeaa http://blog.nicksieger.com/articles/2008/01/17/rexml-a-drag-again#comment-364 "REXML a Drag...Again" by Nick <p>Thanks for your prompt reply and stewardship for REXML, Sam. Updated numbers posted. Looking much better &#8211; but using a java parser for JRuby still looks like the best option.</p> <p>Any plans to replace REXML&#8217;s base parser with expat or similar to speed it up further?</p> Thu, 17 Jan 2008 17:07:06 +0000 urn:uuid:01fdc687-f081-43f2-af7f-9099727f6d5b http://blog.nicksieger.com/articles/2008/01/17/rexml-a-drag-again#comment-363 "REXML a Drag...Again" by Sam Ruby <p>I agree that plugging in a real parser is the way to go. But can I get you to try the latest REXML from SVN to see if there is a difference? In particular, I&#8217;m interested in whether or not the following change makes things better or worse in a real life scenario:</p> <p><a href='http://www.germane-software.com/projects/rexml/changeset/1289' rel="nofollow">http://www.germane-software.com/projects/rexml/changeset/1289</a></p> <p>The essence of the change is to do exactly <em>ONE</em> gsub across the entire string for all entities.</p> <p>svn co <a href='http://www.germane-software.com/repos/rexml/trunk/' rel="nofollow">http://www.germane-software.com/repos/rexml/trunk/</a></p> Thu, 17 Jan 2008 09:45:00 +0000 urn:uuid:c9e59832-9df1-4cb6-8860-8bbc9dfc3d72 http://blog.nicksieger.com/articles/2008/01/17/rexml-a-drag-again#comment-362 REXML a Drag...Again <p><a href="http://blog.nicksieger.com/articles/2006/11/02/ruby-and-xml-not-so-simple">We&#8217;ve been here before.</a> So here&#8217;s the scenario: You&#8217;re feeding medium-to-large chunks of XML out of one Rails app, to be consumed by another via <a href="http://ryandaigle.com/articles/2006/06/30/whats-new-in-edge-rails-activeresource-is-here" title="Ryan's Scraps: What's New in Edge Rails: The Ins and Outs of ActiveResource">ActiveResource</a>. Maybe those chunks have embedded HTML, or maybe they&#8217;re an Atom feed containing several pieces of HTML with all the entities escaped. Maybe <a href="http://blog.nicksieger.com/files/page.xml">they contain entire Wikipedia pages in them</a>. Lots of entities that need expansion when the file is parsed.</p> <p>So what does ActiveResource do with this? <code>Hash.from_xml</code>. Which uses <a href="http://xml-simple.rubyforge.org/" title="XmlSimple - XML made easy">xml-simple</a>. Which constructs a <code>REXML::Document</code>, and proceeds to navigate the entire DOM, scraping the text nodes out of it so they can be stuffed in a hash to be handed back to ActiveResource. And how does REXML expand all the entities it runs across? With this little lovely:</p> <div class="typocode"><pre><code class="typocode_ruby "><span class="comment"># Unescapes all possible entities</span> <span class="keyword">def </span><span class="method">Text::unnormalize</span><span class="punct">(</span> <span class="ident">string</span><span class="punct">,</span> <span class="ident">doctype</span><span class="punct">=</span><span class="constant">nil</span><span class="punct">,</span> <span class="ident">filter</span><span class="punct">=</span><span class="constant">nil</span><span class="punct">,</span> <span class="ident">illegal</span><span class="punct">=</span><span class="constant">nil</span> <span class="punct">)</span> <span class="ident">rv</span> <span class="punct">=</span> <span class="ident">string</span><span class="punct">.</span><span class="ident">clone</span> <span class="ident">rv</span><span class="punct">.</span><span class="ident">gsub!</span><span class="punct">(</span> <span class="punct">/</span><span class="regex"><span class="escape">\r\n</span>?</span><span class="punct">/,</span> <span class="punct">&quot;</span><span class="string"><span class="escape">\n</span></span><span class="punct">&quot;</span> <span class="punct">)</span> <span class="ident">matches</span> <span class="punct">=</span> <span class="ident">rv</span><span class="punct">.</span><span class="ident">scan</span><span class="punct">(</span> <span class="constant">REFERENCE</span> <span class="punct">)</span> <span class="keyword">return</span> <span class="ident">rv</span> <span class="keyword">if</span> <span class="ident">matches</span><span class="punct">.</span><span class="ident">size</span> <span class="punct">==</span> <span class="number">0</span> <span class="ident">rv</span><span class="punct">.</span><span class="ident">gsub!</span><span class="punct">(</span> <span class="constant">NUMERICENTITY</span> <span class="punct">)</span> <span class="punct">{|</span><span class="ident">m</span><span class="punct">|</span> <span class="ident">m</span><span class="punct">=</span><span class="global">$1</span> <span class="ident">m</span> <span class="punct">=</span> <span class="punct">&quot;</span><span class="string">0<span class="expr">#{m}</span></span><span class="punct">&quot;</span> <span class="keyword">if</span> <span class="ident">m</span><span class="punct">[</span><span class="number">0</span><span class="punct">]</span> <span class="punct">==</span> <span class="char">?x</span> <span class="punct">[</span><span class="constant">Integer</span><span class="punct">(</span><span class="ident">m</span><span class="punct">)].</span><span class="ident">pack</span><span class="punct">('</span><span class="string">U*</span><span class="punct">')</span> <span class="punct">}</span> <span class="ident">matches</span><span class="punct">.</span><span class="ident">collect!</span><span class="punct">{|</span><span class="ident">x</span><span class="punct">|</span><span class="ident">x</span><span class="punct">[</span><span class="number">0</span><span class="punct">]}.</span><span class="ident">compact!</span> <span class="keyword">if</span> <span class="ident">matches</span><span class="punct">.</span><span class="ident">size</span> <span class="punct">&gt;</span> <span class="number">0</span> <span class="keyword">if</span> <span class="ident">doctype</span> <span class="ident">matches</span><span class="punct">.</span><span class="ident">each</span> <span class="keyword">do</span> <span class="punct">|</span><span class="ident">entity_reference</span><span class="punct">|</span> <span class="keyword">unless</span> <span class="ident">filter</span> <span class="keyword">and</span> <span class="ident">filter</span><span class="punct">.</span><span class="ident">include?</span><span class="punct">(</span><span class="ident">entity_reference</span><span class="punct">)</span> <span class="ident">entity_value</span> <span class="punct">=</span> <span class="ident">doctype</span><span class="punct">.</span><span class="ident">entity</span><span class="punct">(</span> <span class="ident">entity_reference</span> <span class="punct">)</span> <span class="ident">re</span> <span class="punct">=</span> <span class="punct">/</span><span class="regex">&amp;<span class="expr">#{entity_reference}</span>;</span><span class="punct">/</span> <span class="ident">rv</span><span class="punct">.</span><span class="ident">gsub!</span><span class="punct">(</span> <span class="ident">re</span><span class="punct">,</span> <span class="ident">entity_value</span> <span class="punct">)</span> <span class="keyword">if</span> <span class="ident">entity_value</span> <span class="keyword">end</span> <span class="keyword">end</span> <span class="keyword">else</span> <span class="ident">matches</span><span class="punct">.</span><span class="ident">each</span> <span class="keyword">do</span> <span class="punct">|</span><span class="ident">entity_reference</span><span class="punct">|</span> <span class="keyword">unless</span> <span class="ident">filter</span> <span class="keyword">and</span> <span class="ident">filter</span><span class="punct">.</span><span class="ident">include?</span><span class="punct">(</span><span class="ident">entity_reference</span><span class="punct">)</span> <span class="ident">entity_value</span> <span class="punct">=</span> <span class="constant">DocType</span><span class="punct">::</span><span class="constant">DEFAULT_ENTITIES</span><span class="punct">[</span> <span class="ident">entity_reference</span> <span class="punct">]</span> <span class="ident">re</span> <span class="punct">=</span> <span class="punct">/</span><span class="regex">&amp;<span class="expr">#{entity_reference}</span>;</span><span class="punct">/</span> <span class="ident">rv</span><span class="punct">.</span><span class="ident">gsub!</span><span class="punct">(</span> <span class="ident">re</span><span class="punct">,</span> <span class="ident">entity_value</span><span class="punct">.</span><span class="ident">value</span> <span class="punct">)</span> <span class="keyword">if</span> <span class="ident">entity_value</span> <span class="keyword">end</span> <span class="keyword">end</span> <span class="keyword">end</span> <span class="ident">rv</span><span class="punct">.</span><span class="ident">gsub!</span><span class="punct">(</span> <span class="punct">/</span><span class="regex">&amp;amp;</span><span class="punct">/,</span> <span class="punct">'</span><span class="string">&amp;</span><span class="punct">'</span> <span class="punct">)</span> <span class="keyword">end</span> <span class="ident">rv</span> <span class="keyword">end</span></code></pre></div> <p>Now, when you look at this, your first impression is that it just <em>screams</em> fast, right? Let&#8217;s run <code>Hash.from_xml</code> on the <a href="http://blog.nicksieger.com/files/page.xml">file I mentioned above</a>.</p> <div class="typocode"><pre><code class="typocode_ruby "><span class="comment"># unnormalize.rb</span> <span class="ident">require</span> <span class="punct">'</span><span class="string">rubygems</span><span class="punct">'</span> <span class="ident">gem</span> <span class="punct">'</span><span class="string">activesupport</span><span class="punct">'</span> <span class="ident">require</span> <span class="punct">'</span><span class="string">active_support</span><span class="punct">'</span> <span class="constant">File</span><span class="punct">.</span><span class="ident">open</span><span class="punct">(&quot;</span><span class="string">page.xml</span><span class="punct">&quot;)</span> <span class="keyword">do</span> <span class="punct">|</span><span class="ident">f</span><span class="punct">|</span> <span class="constant">Hash</span><span class="punct">.</span><span class="ident">from_xml</span><span class="punct">(</span><span class="ident">f</span><span class="punct">.</span><span class="ident">read</span><span class="punct">)</span> <span class="keyword">end</span></code></pre></div> <pre><code>$ time ruby unnormalize.rb real 0m16.221s user 0m14.447s sys 0m0.346s </code></pre> <p>Whoa! Knock me over with a feather! It blows chunks! You mean calling <code>#gsub!</code> repeatedly in a loop with dregexps (regexp literals with interpolated strings) doesn&#8217;t go fast? It&#8217;s doubly worse on JRuby, too:</p> <pre><code>$ time jruby unnormalize.rb real 0m33.637s user 0m32.897s sys 0m0.573s </code></pre> <p>All this on a paltry 393K xml file. Makes me wonder how anyone ever does any serious XML processing in Ruby.</p> <p>I know, this is open source, I should be whipping up a patch for this and submitting it. Well, I did cook up a solution, but it unfortunately is only available for JRuby at the moment. (I also have much more faith in <a href="http://www.intertwingly.net/blog/2007/12/31/Porting-REXML-to-Ruby-1-9">Sam Ruby</a> than myself to get the semantics of a rewritten <code>REXML::Text::unnormalize</code> correct.)</p> <p><a href="http://www.nabble.com/-ANN--JREXML-0.5-t4234591.html">A while back</a> I cooked up <a href="http://caldersphere.rubyforge.org/jrexml">JREXML</a> because Regexp processing in JRuby was slow at the time, and the guts of REXML is driven by a Regexp-based parser. JREXML swaps out that regexp parser with a Java pull parser library, and at the time it provided a modest speedup.</p> <p>So, in the context of JREXML, the solution now becomes simple, by taking advantage of the fact that Java XML parsers typically expand entities for you. The just-released JREXML 0.5.3 circumvents <code>REXML::Text::unnormalize</code> when constructing a document from the Java-based parser. And the results again don&#8217;t disappoint:</p> <pre><code>$ time jruby unnormalize_jrexml.rb real 0m5.802s user 0m5.315s sys 0m0.345s </code></pre> <p><em>Update:</em> At Sam&#8217;s request, I ran the numbers again with REXML trunk, which <a href="http://www.germane-software.com/projects/rexml/changeset/1289">condenses entity expansion into a single <code>gsub</code></a>. Speed is more in line for MRI, but didn&#8217;t move much for JRuby (probably more a datapoint for JRuby developers than REXML developers).</p> <pre><code>$ time ruby -I~/Projects/ruby/rexml/src unnormalize.rb real 0m6.592s user 0m0.845s sys 0m0.345s $ time jruby -I~/Projects/ruby/rexml/src unnormalize.rb real 0m34.353s user 0m33.023s sys 0m0.714s </code></pre> Thu, 17 Jan 2008 04:07:00 +0000 urn:uuid:0d137df7-5bca-451d-97f6-34af4d5a36b7 Nick Sieger http://blog.nicksieger.com/articles/2008/01/17/rexml-a-drag-again ruby xml jruby http://blog.nicksieger.com/articles/trackback/361