Nick Sieger: Tag xmltag:blog.nicksieger.com,2005:TypoTypo2010-11-22T18:55:37+00:00Nick Siegerurn:uuid:d060e313-e46a-49dc-9093-fcb0f2ae77832006-11-02T02:12:00+00:002010-11-22T18:55:37+00:00Ruby and XML not-so-simple?<p><em>Update: <a href="http://www.koziarski.net/">Koz</a> already <a href="http://dev.rubyonrails.org/changeset/5414">fixed the issue in trunk</a>, and the changes are also <a href="http://dev.rubyonrails.org/changeset/5415">going into the 1.2 release</a> as well. Thanks!</em></p>
<p>Man, I think I’ve been reading too much <a href="http://intertwingly.net/blog/2005/09/14/Last-Line-of-Defense/">Sam Ruby</a> lately (ok, that was a year ago, but not much has changed). You have to admit, though, that XML handling in Ruby is one of those things that just doesn’t feel quite right. REXML is pretty much the standard API for Ruby, yet it suffers from two showstoppers in my opinion:</p>
<ul>
<li><p>In Ruby 1.8.4 it still has the glaring hole Sam mentioned last year with well-formedness. (No exception raised below!)</p>
<pre><code>irb(main):001:0> require 'rexml/document'
=> true
irb(main):002:0> d = REXML::Document.new '<div>at&t'
=> <UNDEFINED> ... </>
irb(main):003:0> d.root
=> <div> ... </>
irb(main):004:0> d.root.text
=> "at&t"
</code></pre></li>
<li><p>The <code>REXML::Text#to_s</code> method violates the principle of least surprise. In just about every other XML parser written, when you ask a
text node for its contents, it returns you the value with entities resolved. Not so <code>Text#to_s</code>. You have to call <code>Text#value</code>
instead. Unfortunately, this would be difficult to reverse in future versions of REXML without breaking existing apps.</p>
<pre><code>irb(main):001:0> require 'rexml/document'
=> true
irb(main):002:0> t = REXML::Text.new('at&t')
=> "at&t"
irb(main):003:0> t.to_s
=> "at&amp;t"
irb(main):004:0> t.value
=> "at&t"
</code></pre></li>
</ul>
<p>This second problem manifests itself in subtle ways. If you’re calling <code>Element#text</code> (which is probably the most common way), you’re fine, because it implicitly does <code>self.texts.first.value</code> under the hood. But if you want to make sure you’re grabbing all the text content, you might be inclined to write <code>element.texts.join('')</code> to concatenate them together. But this method bypasses the <code>value</code> method and instead uses <code>to_s</code>, leaving you with unresolved entities.</p>
<p>It turns out this problem is exhibited in the version of <a href="http://xml-simple.rubyforge.org/">XmlSimple</a> now included with <a href="http://dev.rubyonrails.org/browser/trunk/activesupport/lib/active_support/vendor/xml_simple.rb?rev=4453&format=txt">Edge Rails as of rev 4453</a>. So if
you’re living on the edge using the newly minted ActiveResource fetching XML from remote resources like a champion, you just got benched
as soon as you tried to fetch XML that had normalized entities inside.</p>
<p><a href="http://rubyforge.org/frs/download.php/11388/xml-simple-1.0.9.gem">XmlSimple version 1.0.9</a> has a partial fix for this issue, but I submitted <a href="/files/test_xmlsimple.rb">another patch</a> to <a href="http://www.maik-schmidt.de/">Maik Schmidt</a> for review that he subsequently released as 1.0.10. I’ve attached the 1.0.10 version to ticket <a href="http://dev.rubyonrails.org/ticket/6532">6532</a> in hopes that it will be patched in Rails soon.</p>Nick Siegerurn:uuid:0d137df7-5bca-451d-97f6-34af4d5a36b72008-01-17T04:07:00+00:002010-11-22T19:00:53+00:00REXML a Drag...Again<p><a href="http://blog.nicksieger.com/articles/2006/11/02/ruby-and-xml-not-so-simple">We’ve been here before.</a> So here’s the scenario: You’re feeding medium-to-large chunks of XML out of one Rails app, to be consumed by another via <a href="http://ryandaigle.com/articles/2006/06/30/whats-new-in-edge-rails-activeresource-is-here" title="Ryan's Scraps: What's New in Edge Rails: The Ins and Outs of ActiveResource">ActiveResource</a>. Maybe those chunks have embedded HTML, or maybe they’re an Atom feed containing several pieces of HTML with all the entities escaped. Maybe <a href="http://blog.nicksieger.com/files/page.xml">they contain entire Wikipedia pages in them</a>. Lots of entities that need expansion when the file is parsed.</p>
<p>So what does ActiveResource do with this? <code>Hash.from_xml</code>. Which uses <a href="http://xml-simple.rubyforge.org/" title="XmlSimple - XML made easy">xml-simple</a>. Which constructs a <code>REXML::Document</code>, and proceeds to navigate the entire DOM, scraping the text nodes out of it so they can be stuffed in a hash to be handed back to ActiveResource. And how does REXML expand all the entities it runs across? With this little lovely:</p>
<div class="typocode"><pre><code class="typocode_ruby "><span class="comment"># Unescapes all possible entities</span>
<span class="keyword">def </span><span class="method">Text::unnormalize</span><span class="punct">(</span> <span class="ident">string</span><span class="punct">,</span> <span class="ident">doctype</span><span class="punct">=</span><span class="constant">nil</span><span class="punct">,</span> <span class="ident">filter</span><span class="punct">=</span><span class="constant">nil</span><span class="punct">,</span> <span class="ident">illegal</span><span class="punct">=</span><span class="constant">nil</span> <span class="punct">)</span>
<span class="ident">rv</span> <span class="punct">=</span> <span class="ident">string</span><span class="punct">.</span><span class="ident">clone</span>
<span class="ident">rv</span><span class="punct">.</span><span class="ident">gsub!</span><span class="punct">(</span> <span class="punct">/</span><span class="regex"><span class="escape">\r\n</span>?</span><span class="punct">/,</span> <span class="punct">"</span><span class="string"><span class="escape">\n</span></span><span class="punct">"</span> <span class="punct">)</span>
<span class="ident">matches</span> <span class="punct">=</span> <span class="ident">rv</span><span class="punct">.</span><span class="ident">scan</span><span class="punct">(</span> <span class="constant">REFERENCE</span> <span class="punct">)</span>
<span class="keyword">return</span> <span class="ident">rv</span> <span class="keyword">if</span> <span class="ident">matches</span><span class="punct">.</span><span class="ident">size</span> <span class="punct">==</span> <span class="number">0</span>
<span class="ident">rv</span><span class="punct">.</span><span class="ident">gsub!</span><span class="punct">(</span> <span class="constant">NUMERICENTITY</span> <span class="punct">)</span> <span class="punct">{|</span><span class="ident">m</span><span class="punct">|</span>
<span class="ident">m</span><span class="punct">=</span><span class="global">$1</span>
<span class="ident">m</span> <span class="punct">=</span> <span class="punct">"</span><span class="string">0<span class="expr">#{m}</span></span><span class="punct">"</span> <span class="keyword">if</span> <span class="ident">m</span><span class="punct">[</span><span class="number">0</span><span class="punct">]</span> <span class="punct">==</span> <span class="char">?x</span>
<span class="punct">[</span><span class="constant">Integer</span><span class="punct">(</span><span class="ident">m</span><span class="punct">)].</span><span class="ident">pack</span><span class="punct">('</span><span class="string">U*</span><span class="punct">')</span>
<span class="punct">}</span>
<span class="ident">matches</span><span class="punct">.</span><span class="ident">collect!</span><span class="punct">{|</span><span class="ident">x</span><span class="punct">|</span><span class="ident">x</span><span class="punct">[</span><span class="number">0</span><span class="punct">]}.</span><span class="ident">compact!</span>
<span class="keyword">if</span> <span class="ident">matches</span><span class="punct">.</span><span class="ident">size</span> <span class="punct">></span> <span class="number">0</span>
<span class="keyword">if</span> <span class="ident">doctype</span>
<span class="ident">matches</span><span class="punct">.</span><span class="ident">each</span> <span class="keyword">do</span> <span class="punct">|</span><span class="ident">entity_reference</span><span class="punct">|</span>
<span class="keyword">unless</span> <span class="ident">filter</span> <span class="keyword">and</span> <span class="ident">filter</span><span class="punct">.</span><span class="ident">include?</span><span class="punct">(</span><span class="ident">entity_reference</span><span class="punct">)</span>
<span class="ident">entity_value</span> <span class="punct">=</span> <span class="ident">doctype</span><span class="punct">.</span><span class="ident">entity</span><span class="punct">(</span> <span class="ident">entity_reference</span> <span class="punct">)</span>
<span class="ident">re</span> <span class="punct">=</span> <span class="punct">/</span><span class="regex">&<span class="expr">#{entity_reference}</span>;</span><span class="punct">/</span>
<span class="ident">rv</span><span class="punct">.</span><span class="ident">gsub!</span><span class="punct">(</span> <span class="ident">re</span><span class="punct">,</span> <span class="ident">entity_value</span> <span class="punct">)</span> <span class="keyword">if</span> <span class="ident">entity_value</span>
<span class="keyword">end</span>
<span class="keyword">end</span>
<span class="keyword">else</span>
<span class="ident">matches</span><span class="punct">.</span><span class="ident">each</span> <span class="keyword">do</span> <span class="punct">|</span><span class="ident">entity_reference</span><span class="punct">|</span>
<span class="keyword">unless</span> <span class="ident">filter</span> <span class="keyword">and</span> <span class="ident">filter</span><span class="punct">.</span><span class="ident">include?</span><span class="punct">(</span><span class="ident">entity_reference</span><span class="punct">)</span>
<span class="ident">entity_value</span> <span class="punct">=</span> <span class="constant">DocType</span><span class="punct">::</span><span class="constant">DEFAULT_ENTITIES</span><span class="punct">[</span> <span class="ident">entity_reference</span> <span class="punct">]</span>
<span class="ident">re</span> <span class="punct">=</span> <span class="punct">/</span><span class="regex">&<span class="expr">#{entity_reference}</span>;</span><span class="punct">/</span>
<span class="ident">rv</span><span class="punct">.</span><span class="ident">gsub!</span><span class="punct">(</span> <span class="ident">re</span><span class="punct">,</span> <span class="ident">entity_value</span><span class="punct">.</span><span class="ident">value</span> <span class="punct">)</span> <span class="keyword">if</span> <span class="ident">entity_value</span>
<span class="keyword">end</span>
<span class="keyword">end</span>
<span class="keyword">end</span>
<span class="ident">rv</span><span class="punct">.</span><span class="ident">gsub!</span><span class="punct">(</span> <span class="punct">/</span><span class="regex">&amp;</span><span class="punct">/,</span> <span class="punct">'</span><span class="string">&</span><span class="punct">'</span> <span class="punct">)</span>
<span class="keyword">end</span>
<span class="ident">rv</span>
<span class="keyword">end</span></code></pre></div>
<p>Now, when you look at this, your first impression is that it just <em>screams</em> fast, right? Let’s run <code>Hash.from_xml</code> on the <a href="http://blog.nicksieger.com/files/page.xml">file I mentioned above</a>.</p>
<div class="typocode"><pre><code class="typocode_ruby "><span class="comment"># unnormalize.rb</span>
<span class="ident">require</span> <span class="punct">'</span><span class="string">rubygems</span><span class="punct">'</span>
<span class="ident">gem</span> <span class="punct">'</span><span class="string">activesupport</span><span class="punct">'</span>
<span class="ident">require</span> <span class="punct">'</span><span class="string">active_support</span><span class="punct">'</span>
<span class="constant">File</span><span class="punct">.</span><span class="ident">open</span><span class="punct">("</span><span class="string">page.xml</span><span class="punct">")</span> <span class="keyword">do</span> <span class="punct">|</span><span class="ident">f</span><span class="punct">|</span>
<span class="constant">Hash</span><span class="punct">.</span><span class="ident">from_xml</span><span class="punct">(</span><span class="ident">f</span><span class="punct">.</span><span class="ident">read</span><span class="punct">)</span>
<span class="keyword">end</span></code></pre></div>
<pre><code>$ time ruby unnormalize.rb
real 0m16.221s
user 0m14.447s
sys 0m0.346s
</code></pre>
<p>Whoa! Knock me over with a feather! It blows chunks! You mean calling <code>#gsub!</code> repeatedly in a loop with dregexps (regexp literals with interpolated strings) doesn’t go fast? It’s doubly worse on JRuby, too:</p>
<pre><code>$ time jruby unnormalize.rb
real 0m33.637s
user 0m32.897s
sys 0m0.573s
</code></pre>
<p>All this on a paltry 393K xml file. Makes me wonder how anyone ever does any serious XML processing in Ruby.</p>
<p>I know, this is open source, I should be whipping up a patch for this and submitting it. Well, I did cook up a solution, but it unfortunately is only available for JRuby at the moment. (I also have much more faith in <a href="http://www.intertwingly.net/blog/2007/12/31/Porting-REXML-to-Ruby-1-9">Sam Ruby</a> than myself to get the semantics of a rewritten <code>REXML::Text::unnormalize</code> correct.)</p>
<p><a href="http://www.nabble.com/-ANN--JREXML-0.5-t4234591.html">A while back</a> I cooked up <a href="http://caldersphere.rubyforge.org/jrexml">JREXML</a> because Regexp processing in JRuby was slow at the time, and the guts of REXML is driven by a Regexp-based parser. JREXML swaps out that regexp parser with a Java pull parser library, and at the time it provided a modest speedup.</p>
<p>So, in the context of JREXML, the solution now becomes simple, by taking advantage of the fact that Java XML parsers typically expand entities for you. The just-released JREXML 0.5.3 circumvents <code>REXML::Text::unnormalize</code> when constructing a document from the Java-based parser. And the results again don’t disappoint:</p>
<pre><code>$ time jruby unnormalize_jrexml.rb
real 0m5.802s
user 0m5.315s
sys 0m0.345s
</code></pre>
<p><em>Update:</em> At Sam’s request, I ran the numbers again with REXML trunk, which <a href="http://www.germane-software.com/projects/rexml/changeset/1289">condenses entity expansion into a single <code>gsub</code></a>. Speed is more in line for MRI, but didn’t move much for JRuby (probably more a datapoint for JRuby developers than REXML developers).</p>
<pre><code>$ time ruby -I~/Projects/ruby/rexml/src unnormalize.rb
real 0m6.592s
user 0m0.845s
sys 0m0.345s
$ time jruby -I~/Projects/ruby/rexml/src unnormalize.rb
real 0m34.353s
user 0m33.023s
sys 0m0.714s
</code></pre>