REXML a Drag...Again

Posted by Nick Sieger Thu, 17 Jan 2008 04:07:00 GMT

We’ve been here before. So here’s the scenario: You’re feeding medium-to-large chunks of XML out of one Rails app, to be consumed by another via ActiveResource. Maybe those chunks have embedded HTML, or maybe they’re an Atom feed containing several pieces of HTML with all the entities escaped. Maybe they contain entire Wikipedia pages in them. Lots of entities that need expansion when the file is parsed.

So what does ActiveResource do with this? Hash.from_xml. Which uses xml-simple. Which constructs a REXML::Document, and proceeds to navigate the entire DOM, scraping the text nodes out of it so they can be stuffed in a hash to be handed back to ActiveResource. And how does REXML expand all the entities it runs across? With this little lovely:

# Unescapes all possible entities
def Text::unnormalize( string, doctype=nil, filter=nil, illegal=nil )
  rv = string.clone
  rv.gsub!( /\r\n?/, "\n" )
  matches = rv.scan( REFERENCE )
  return rv if matches.size == 0
  rv.gsub!( NUMERICENTITY ) {|m|
    m=$1
    m = "0#{m}" if m[0] == ?x
    [Integer(m)].pack('U*')
  }
  matches.collect!{|x|x[0]}.compact!
  if matches.size > 0
    if doctype
      matches.each do |entity_reference|
        unless filter and filter.include?(entity_reference)
          entity_value = doctype.entity( entity_reference )
          re = /&#{entity_reference};/
          rv.gsub!( re, entity_value ) if entity_value
        end
      end
    else
      matches.each do |entity_reference|
        unless filter and filter.include?(entity_reference)
          entity_value = DocType::DEFAULT_ENTITIES[ entity_reference ]
          re = /&#{entity_reference};/
          rv.gsub!( re, entity_value.value ) if entity_value
        end
      end
    end
    rv.gsub!( /&/, '&' )
  end
  rv
end

Now, when you look at this, your first impression is that it just screams fast, right? Let’s run Hash.from_xml on the file I mentioned above.

# unnormalize.rb
require 'rubygems'
gem 'activesupport'
require 'active_support'

File.open("page.xml") do |f|
  Hash.from_xml(f.read)
end
$ time ruby unnormalize.rb

real    0m16.221s
user    0m14.447s
sys     0m0.346s

Whoa! Knock me over with a feather! It blows chunks! You mean calling #gsub! repeatedly in a loop with dregexps (regexp literals with interpolated strings) doesn’t go fast? It’s doubly worse on JRuby, too:

$ time jruby unnormalize.rb

real    0m33.637s
user    0m32.897s
sys     0m0.573s

All this on a paltry 393K xml file. Makes me wonder how anyone ever does any serious XML processing in Ruby.

I know, this is open source, I should be whipping up a patch for this and submitting it. Well, I did cook up a solution, but it unfortunately is only available for JRuby at the moment. (I also have much more faith in Sam Ruby than myself to get the semantics of a rewritten REXML::Text::unnormalize correct.)

A while back I cooked up JREXML because Regexp processing in JRuby was slow at the time, and the guts of REXML is driven by a Regexp-based parser. JREXML swaps out that regexp parser with a Java pull parser library, and at the time it provided a modest speedup.

So, in the context of JREXML, the solution now becomes simple, by taking advantage of the fact that Java XML parsers typically expand entities for you. The just-released JREXML 0.5.3 circumvents REXML::Text::unnormalize when constructing a document from the Java-based parser. And the results again don’t disappoint:

$ time jruby unnormalize_jrexml.rb

real    0m5.802s
user    0m5.315s
sys     0m0.345s

Update: At Sam’s request, I ran the numbers again with REXML trunk, which condenses entity expansion into a single gsub. Speed is more in line for MRI, but didn’t move much for JRuby (probably more a datapoint for JRuby developers than REXML developers).

$ time ruby -I~/Projects/ruby/rexml/src unnormalize.rb 

real    0m6.592s
user    0m0.845s
sys     0m0.345s

$ time jruby -I~/Projects/ruby/rexml/src unnormalize.rb

real    0m34.353s
user    0m33.023s
sys     0m0.714s

Tags , ,  | 6 comments

Comments

  1. Avatar Sam Ruby said about 6 hours later:

    I agree that plugging in a real parser is the way to go. But can I get you to try the latest REXML from SVN to see if there is a difference? In particular, I’m interested in whether or not the following change makes things better or worse in a real life scenario:

    http://www.germane-software.com/projects/rexml/changeset/1289

    The essence of the change is to do exactly ONE gsub across the entire string for all entities.

    svn co http://www.germane-software.com/repos/rexml/trunk/

  2. Avatar Nick said about 13 hours later:

    Thanks for your prompt reply and stewardship for REXML, Sam. Updated numbers posted. Looking much better -- but using a java parser for JRuby still looks like the best option.

    Any plans to replace REXML’s base parser with expat or similar to speed it up further?

  3. Avatar Phil said about 14 hours later:

    I think everyone who cares just uses libxml2 or Hpricot. As far as I can tell REXML is only used by people who have to stick to the standard library. Of course, the standard library badly needs Hpricot, so here’s hoping.

  4. Avatar Sam Ruby said about 16 hours later:

    Sweet! And I would bet that Ruby 1.9 is even better. And yes, I am looking into making the parser pluggable.

  5. Avatar Jeff Shadap said 1 day later:

    If you want speed use libxml, REXML is too slow for parsing large files

  6. Avatar Dieta said 2 days later:

    Yep. As Phil said I use libxml2 and it’s ok.