Nick Sieger: Tag xml http://blog.nicksieger.com/articles/tag/xml?tag=xml en-us 40 Ruby and XML not-so-simple? <p><em>Update: <a href="http://www.koziarski.net/">Koz</a> already <a href="http://dev.rubyonrails.org/changeset/5414">fixed the issue in trunk</a>, and the changes are also <a href="http://dev.rubyonrails.org/changeset/5415">going into the 1.2 release</a> as well. Thanks!</em></p> <p>Man, I think I&#8217;ve been reading too much <a href="http://intertwingly.net/blog/2005/09/14/Last-Line-of-Defense/">Sam Ruby</a> lately (ok, that was a year ago, but not much has changed). You have to admit, though, that XML handling in Ruby is one of those things that just doesn&#8217;t feel quite right. REXML is pretty much the standard API for Ruby, yet it suffers from two showstoppers in my opinion:</p> <ul> <li><p>In Ruby 1.8.4 it still has the glaring hole Sam mentioned last year with well-formedness. (No exception raised below!)</p> <pre><code>irb(main):001:0&gt; require 'rexml/document' =&gt; true irb(main):002:0&gt; d = REXML::Document.new '&lt;div&gt;at&amp;t' =&gt; &lt;UNDEFINED&gt; ... &lt;/&gt; irb(main):003:0&gt; d.root =&gt; &lt;div&gt; ... &lt;/&gt; irb(main):004:0&gt; d.root.text =&gt; "at&amp;t" </code></pre></li> <li><p>The <code>REXML::Text#to_s</code> method violates the principle of least surprise. In just about every other XML parser written, when you ask a text node for its contents, it returns you the value with entities resolved. Not so <code>Text#to_s</code>. You have to call <code>Text#value</code> instead. Unfortunately, this would be difficult to reverse in future versions of REXML without breaking existing apps.</p> <pre><code>irb(main):001:0&gt; require 'rexml/document' =&gt; true irb(main):002:0&gt; t = REXML::Text.new('at&amp;t') =&gt; "at&amp;t" irb(main):003:0&gt; t.to_s =&gt; "at&amp;amp;t" irb(main):004:0&gt; t.value =&gt; "at&amp;t" </code></pre></li> </ul> <p>This second problem manifests itself in subtle ways. If you&#8217;re calling <code>Element#text</code> (which is probably the most common way), you&#8217;re fine, because it implicitly does <code>self.texts.first.value</code> under the hood. But if you want to make sure you&#8217;re grabbing all the text content, you might be inclined to write <code>element.texts.join('')</code> to concatenate them together. But this method bypasses the <code>value</code> method and instead uses <code>to_s</code>, leaving you with unresolved entities.</p> <p>It turns out this problem is exhibited in the version of <a href="http://xml-simple.rubyforge.org/">XmlSimple</a> now included with <a href="http://dev.rubyonrails.org/browser/trunk/activesupport/lib/active_support/vendor/xml_simple.rb?rev=4453&amp;format=txt">Edge Rails as of rev 4453</a>. So if you&#8217;re living on the edge using the newly minted ActiveResource fetching XML from remote resources like a champion, you just got benched as soon as you tried to fetch XML that had normalized entities inside.</p> <p><a href="http://rubyforge.org/frs/download.php/11388/xml-simple-1.0.9.gem">XmlSimple version 1.0.9</a> has a partial fix for this issue, but I submitted <a href="/files/test_xmlsimple.rb">another patch</a> to <a href="http://www.maik-schmidt.de/">Maik Schmidt</a> for review that he subsequently released as 1.0.10. I&#8217;ve attached the 1.0.10 version to ticket <a href="http://dev.rubyonrails.org/ticket/6532">6532</a> in hopes that it will be patched in Rails soon.</p> Thu, 02 Nov 2006 02:12:00 +0000 urn:uuid:d060e313-e46a-49dc-9093-fcb0f2ae7783 Nick Sieger http://blog.nicksieger.com/articles/2006/11/02/ruby-and-xml-not-so-simple ruby rails ruby xml http://blog.nicksieger.com/articles/trackback/142 REXML a Drag...Again <p><a href="http://blog.nicksieger.com/articles/2006/11/02/ruby-and-xml-not-so-simple">We&#8217;ve been here before.</a> So here&#8217;s the scenario: You&#8217;re feeding medium-to-large chunks of XML out of one Rails app, to be consumed by another via <a href="http://ryandaigle.com/articles/2006/06/30/whats-new-in-edge-rails-activeresource-is-here" title="Ryan's Scraps: What's New in Edge Rails: The Ins and Outs of ActiveResource">ActiveResource</a>. Maybe those chunks have embedded HTML, or maybe they&#8217;re an Atom feed containing several pieces of HTML with all the entities escaped. Maybe <a href="http://blog.nicksieger.com/files/page.xml">they contain entire Wikipedia pages in them</a>. Lots of entities that need expansion when the file is parsed.</p> <p>So what does ActiveResource do with this? <code>Hash.from_xml</code>. Which uses <a href="http://xml-simple.rubyforge.org/" title="XmlSimple - XML made easy">xml-simple</a>. Which constructs a <code>REXML::Document</code>, and proceeds to navigate the entire DOM, scraping the text nodes out of it so they can be stuffed in a hash to be handed back to ActiveResource. And how does REXML expand all the entities it runs across? With this little lovely:</p> <div class="typocode"><pre><code class="typocode_ruby "><span class="comment"># Unescapes all possible entities</span> <span class="keyword">def </span><span class="method">Text::unnormalize</span><span class="punct">(</span> <span class="ident">string</span><span class="punct">,</span> <span class="ident">doctype</span><span class="punct">=</span><span class="constant">nil</span><span class="punct">,</span> <span class="ident">filter</span><span class="punct">=</span><span class="constant">nil</span><span class="punct">,</span> <span class="ident">illegal</span><span class="punct">=</span><span class="constant">nil</span> <span class="punct">)</span> <span class="ident">rv</span> <span class="punct">=</span> <span class="ident">string</span><span class="punct">.</span><span class="ident">clone</span> <span class="ident">rv</span><span class="punct">.</span><span class="ident">gsub!</span><span class="punct">(</span> <span class="punct">/</span><span class="regex"><span class="escape">\r\n</span>?</span><span class="punct">/,</span> <span class="punct">&quot;</span><span class="string"><span class="escape">\n</span></span><span class="punct">&quot;</span> <span class="punct">)</span> <span class="ident">matches</span> <span class="punct">=</span> <span class="ident">rv</span><span class="punct">.</span><span class="ident">scan</span><span class="punct">(</span> <span class="constant">REFERENCE</span> <span class="punct">)</span> <span class="keyword">return</span> <span class="ident">rv</span> <span class="keyword">if</span> <span class="ident">matches</span><span class="punct">.</span><span class="ident">size</span> <span class="punct">==</span> <span class="number">0</span> <span class="ident">rv</span><span class="punct">.</span><span class="ident">gsub!</span><span class="punct">(</span> <span class="constant">NUMERICENTITY</span> <span class="punct">)</span> <span class="punct">{|</span><span class="ident">m</span><span class="punct">|</span> <span class="ident">m</span><span class="punct">=</span><span class="global">$1</span> <span class="ident">m</span> <span class="punct">=</span> <span class="punct">&quot;</span><span class="string">0<span class="expr">#{m}</span></span><span class="punct">&quot;</span> <span class="keyword">if</span> <span class="ident">m</span><span class="punct">[</span><span class="number">0</span><span class="punct">]</span> <span class="punct">==</span> <span class="char">?x</span> <span class="punct">[</span><span class="constant">Integer</span><span class="punct">(</span><span class="ident">m</span><span class="punct">)].</span><span class="ident">pack</span><span class="punct">('</span><span class="string">U*</span><span class="punct">')</span> <span class="punct">}</span> <span class="ident">matches</span><span class="punct">.</span><span class="ident">collect!</span><span class="punct">{|</span><span class="ident">x</span><span class="punct">|</span><span class="ident">x</span><span class="punct">[</span><span class="number">0</span><span class="punct">]}.</span><span class="ident">compact!</span> <span class="keyword">if</span> <span class="ident">matches</span><span class="punct">.</span><span class="ident">size</span> <span class="punct">&gt;</span> <span class="number">0</span> <span class="keyword">if</span> <span class="ident">doctype</span> <span class="ident">matches</span><span class="punct">.</span><span class="ident">each</span> <span class="keyword">do</span> <span class="punct">|</span><span class="ident">entity_reference</span><span class="punct">|</span> <span class="keyword">unless</span> <span class="ident">filter</span> <span class="keyword">and</span> <span class="ident">filter</span><span class="punct">.</span><span class="ident">include?</span><span class="punct">(</span><span class="ident">entity_reference</span><span class="punct">)</span> <span class="ident">entity_value</span> <span class="punct">=</span> <span class="ident">doctype</span><span class="punct">.</span><span class="ident">entity</span><span class="punct">(</span> <span class="ident">entity_reference</span> <span class="punct">)</span> <span class="ident">re</span> <span class="punct">=</span> <span class="punct">/</span><span class="regex">&amp;<span class="expr">#{entity_reference}</span>;</span><span class="punct">/</span> <span class="ident">rv</span><span class="punct">.</span><span class="ident">gsub!</span><span class="punct">(</span> <span class="ident">re</span><span class="punct">,</span> <span class="ident">entity_value</span> <span class="punct">)</span> <span class="keyword">if</span> <span class="ident">entity_value</span> <span class="keyword">end</span> <span class="keyword">end</span> <span class="keyword">else</span> <span class="ident">matches</span><span class="punct">.</span><span class="ident">each</span> <span class="keyword">do</span> <span class="punct">|</span><span class="ident">entity_reference</span><span class="punct">|</span> <span class="keyword">unless</span> <span class="ident">filter</span> <span class="keyword">and</span> <span class="ident">filter</span><span class="punct">.</span><span class="ident">include?</span><span class="punct">(</span><span class="ident">entity_reference</span><span class="punct">)</span> <span class="ident">entity_value</span> <span class="punct">=</span> <span class="constant">DocType</span><span class="punct">::</span><span class="constant">DEFAULT_ENTITIES</span><span class="punct">[</span> <span class="ident">entity_reference</span> <span class="punct">]</span> <span class="ident">re</span> <span class="punct">=</span> <span class="punct">/</span><span class="regex">&amp;<span class="expr">#{entity_reference}</span>;</span><span class="punct">/</span> <span class="ident">rv</span><span class="punct">.</span><span class="ident">gsub!</span><span class="punct">(</span> <span class="ident">re</span><span class="punct">,</span> <span class="ident">entity_value</span><span class="punct">.</span><span class="ident">value</span> <span class="punct">)</span> <span class="keyword">if</span> <span class="ident">entity_value</span> <span class="keyword">end</span> <span class="keyword">end</span> <span class="keyword">end</span> <span class="ident">rv</span><span class="punct">.</span><span class="ident">gsub!</span><span class="punct">(</span> <span class="punct">/</span><span class="regex">&amp;amp;</span><span class="punct">/,</span> <span class="punct">'</span><span class="string">&amp;</span><span class="punct">'</span> <span class="punct">)</span> <span class="keyword">end</span> <span class="ident">rv</span> <span class="keyword">end</span></code></pre></div> <p>Now, when you look at this, your first impression is that it just <em>screams</em> fast, right? Let&#8217;s run <code>Hash.from_xml</code> on the <a href="http://blog.nicksieger.com/files/page.xml">file I mentioned above</a>.</p> <div class="typocode"><pre><code class="typocode_ruby "><span class="comment"># unnormalize.rb</span> <span class="ident">require</span> <span class="punct">'</span><span class="string">rubygems</span><span class="punct">'</span> <span class="ident">gem</span> <span class="punct">'</span><span class="string">activesupport</span><span class="punct">'</span> <span class="ident">require</span> <span class="punct">'</span><span class="string">active_support</span><span class="punct">'</span> <span class="constant">File</span><span class="punct">.</span><span class="ident">open</span><span class="punct">(&quot;</span><span class="string">page.xml</span><span class="punct">&quot;)</span> <span class="keyword">do</span> <span class="punct">|</span><span class="ident">f</span><span class="punct">|</span> <span class="constant">Hash</span><span class="punct">.</span><span class="ident">from_xml</span><span class="punct">(</span><span class="ident">f</span><span class="punct">.</span><span class="ident">read</span><span class="punct">)</span> <span class="keyword">end</span></code></pre></div> <pre><code>$ time ruby unnormalize.rb real 0m16.221s user 0m14.447s sys 0m0.346s </code></pre> <p>Whoa! Knock me over with a feather! It blows chunks! You mean calling <code>#gsub!</code> repeatedly in a loop with dregexps (regexp literals with interpolated strings) doesn&#8217;t go fast? It&#8217;s doubly worse on JRuby, too:</p> <pre><code>$ time jruby unnormalize.rb real 0m33.637s user 0m32.897s sys 0m0.573s </code></pre> <p>All this on a paltry 393K xml file. Makes me wonder how anyone ever does any serious XML processing in Ruby.</p> <p>I know, this is open source, I should be whipping up a patch for this and submitting it. Well, I did cook up a solution, but it unfortunately is only available for JRuby at the moment. (I also have much more faith in <a href="http://www.intertwingly.net/blog/2007/12/31/Porting-REXML-to-Ruby-1-9">Sam Ruby</a> than myself to get the semantics of a rewritten <code>REXML::Text::unnormalize</code> correct.)</p> <p><a href="http://www.nabble.com/-ANN--JREXML-0.5-t4234591.html">A while back</a> I cooked up <a href="http://caldersphere.rubyforge.org/jrexml">JREXML</a> because Regexp processing in JRuby was slow at the time, and the guts of REXML is driven by a Regexp-based parser. JREXML swaps out that regexp parser with a Java pull parser library, and at the time it provided a modest speedup.</p> <p>So, in the context of JREXML, the solution now becomes simple, by taking advantage of the fact that Java XML parsers typically expand entities for you. The just-released JREXML 0.5.3 circumvents <code>REXML::Text::unnormalize</code> when constructing a document from the Java-based parser. And the results again don&#8217;t disappoint:</p> <pre><code>$ time jruby unnormalize_jrexml.rb real 0m5.802s user 0m5.315s sys 0m0.345s </code></pre> <p><em>Update:</em> At Sam&#8217;s request, I ran the numbers again with REXML trunk, which <a href="http://www.germane-software.com/projects/rexml/changeset/1289">condenses entity expansion into a single <code>gsub</code></a>. Speed is more in line for MRI, but didn&#8217;t move much for JRuby (probably more a datapoint for JRuby developers than REXML developers).</p> <pre><code>$ time ruby -I~/Projects/ruby/rexml/src unnormalize.rb real 0m6.592s user 0m0.845s sys 0m0.345s $ time jruby -I~/Projects/ruby/rexml/src unnormalize.rb real 0m34.353s user 0m33.023s sys 0m0.714s </code></pre> Thu, 17 Jan 2008 04:07:00 +0000 urn:uuid:0d137df7-5bca-451d-97f6-34af4d5a36b7 Nick Sieger http://blog.nicksieger.com/articles/2008/01/17/rexml-a-drag-again ruby xml jruby http://blog.nicksieger.com/articles/trackback/361