REXML a Drag...Again
Posted by Nick Sieger Thu, 17 Jan 2008 04:07:00 GMT
We’ve been here before. So here’s the scenario: You’re feeding medium-to-large chunks of XML out of one Rails app, to be consumed by another via ActiveResource. Maybe those chunks have embedded HTML, or maybe they’re an Atom feed containing several pieces of HTML with all the entities escaped. Maybe they contain entire Wikipedia pages in them. Lots of entities that need expansion when the file is parsed.
So what does ActiveResource do with this? Hash.from_xml
. Which uses xml-simple. Which constructs a REXML::Document
, and proceeds to navigate the entire DOM, scraping the text nodes out of it so they can be stuffed in a hash to be handed back to ActiveResource. And how does REXML expand all the entities it runs across? With this little lovely:
# Unescapes all possible entities
def Text::unnormalize( string, doctype=nil, filter=nil, illegal=nil )
rv = string.clone
rv.gsub!( /\r\n?/, "\n" )
matches = rv.scan( REFERENCE )
return rv if matches.size == 0
rv.gsub!( NUMERICENTITY ) {|m|
m=$1
m = "0#{m}" if m[0] == ?x
[Integer(m)].pack('U*')
}
matches.collect!{|x|x[0]}.compact!
if matches.size > 0
if doctype
matches.each do |entity_reference|
unless filter and filter.include?(entity_reference)
entity_value = doctype.entity( entity_reference )
re = /&#{entity_reference};/
rv.gsub!( re, entity_value ) if entity_value
end
end
else
matches.each do |entity_reference|
unless filter and filter.include?(entity_reference)
entity_value = DocType::DEFAULT_ENTITIES[ entity_reference ]
re = /&#{entity_reference};/
rv.gsub!( re, entity_value.value ) if entity_value
end
end
end
rv.gsub!( /&/, '&' )
end
rv
end
Now, when you look at this, your first impression is that it just screams fast, right? Let’s run Hash.from_xml
on the file I mentioned above.
# unnormalize.rb
require 'rubygems'
gem 'activesupport'
require 'active_support'
File.open("page.xml") do |f|
Hash.from_xml(f.read)
end
$ time ruby unnormalize.rb
real 0m16.221s
user 0m14.447s
sys 0m0.346s
Whoa! Knock me over with a feather! It blows chunks! You mean calling #gsub!
repeatedly in a loop with dregexps (regexp literals with interpolated strings) doesn’t go fast? It’s doubly worse on JRuby, too:
$ time jruby unnormalize.rb
real 0m33.637s
user 0m32.897s
sys 0m0.573s
All this on a paltry 393K xml file. Makes me wonder how anyone ever does any serious XML processing in Ruby.
I know, this is open source, I should be whipping up a patch for this and submitting it. Well, I did cook up a solution, but it unfortunately is only available for JRuby at the moment. (I also have much more faith in Sam Ruby than myself to get the semantics of a rewritten REXML::Text::unnormalize
correct.)
A while back I cooked up JREXML because Regexp processing in JRuby was slow at the time, and the guts of REXML is driven by a Regexp-based parser. JREXML swaps out that regexp parser with a Java pull parser library, and at the time it provided a modest speedup.
So, in the context of JREXML, the solution now becomes simple, by taking advantage of the fact that Java XML parsers typically expand entities for you. The just-released JREXML 0.5.3 circumvents REXML::Text::unnormalize
when constructing a document from the Java-based parser. And the results again don’t disappoint:
$ time jruby unnormalize_jrexml.rb
real 0m5.802s
user 0m5.315s
sys 0m0.345s
Update: At Sam’s request, I ran the numbers again with REXML trunk, which condenses entity expansion into a single gsub
. Speed is more in line for MRI, but didn’t move much for JRuby (probably more a datapoint for JRuby developers than REXML developers).
$ time ruby -I~/Projects/ruby/rexml/src unnormalize.rb
real 0m6.592s
user 0m0.845s
sys 0m0.345s
$ time jruby -I~/Projects/ruby/rexml/src unnormalize.rb
real 0m34.353s
user 0m33.023s
sys 0m0.714s
I agree that plugging in a real parser is the way to go. But can I get you to try the latest REXML from SVN to see if there is a difference? In particular, I’m interested in whether or not the following change makes things better or worse in a real life scenario:
http://www.germane-software.com/projects/rexml/changeset/1289
The essence of the change is to do exactly ONE gsub across the entire string for all entities.
svn co http://www.germane-software.com/repos/rexml/trunk/
Thanks for your prompt reply and stewardship for REXML, Sam. Updated numbers posted. Looking much better -- but using a java parser for JRuby still looks like the best option.
Any plans to replace REXML’s base parser with expat or similar to speed it up further?
I think everyone who cares just uses libxml2 or Hpricot. As far as I can tell REXML is only used by people who have to stick to the standard library. Of course, the standard library badly needs Hpricot, so here’s hoping.
Sweet! And I would bet that Ruby 1.9 is even better. And yes, I am looking into making the parser pluggable.
If you want speed use libxml, REXML is too slow for parsing large files
Yep. As Phil said I use libxml2 and it’s ok.