Next performance fix: Builder::XChar

Posted by Nick Sieger Thu, 17 Jan 2008 23:48:00 GMT

Next up in our performance series: Builder::XChar. (Another fine Sam Ruby production!) While this piece of code in the Builder library strikes me as perfectly fine, it also tends to slow down quite a bit with larger documents or chunks of text.

Our path to the bottleneck is as follows: ActiveRecord::Base#to_xml => Builder::XMLMarkup#text! => String#to_xs => Fixnum#xchr. Consider:

require 'rubygems'
gem 'activesupport'
require 'active_support'
require 'benchmark'

module Benchmark
  class << self
    def report(&block)
      n = 10
      times = (1..10).map do
        bm = measure(&block)
        puts bm
        bm
      end
      sum = times.inject(0) {|s,t| s + t.real}
      mean = sum / n
      sumsq = times.inject(0) {|s,t| s + t.real * t.real}
      sd = Math.sqrt((sumsq - (sum * sum / n)) / (n - 1))
      puts("Mean: %0.6f SDev: %0.6f" % [mean, sd])
    end
  end
end

# http://blog.nicksieger.com/files/page.xml
page = File.open("page.xml") {|f| f.read }

Benchmark.report do
  20.times { page.to_xs }
end

On Ruby and JRuby, this produces:

$ ruby to_xs.rb 
 21.430000   0.400000  21.830000 ( 22.022769)
 21.530000   0.360000  21.890000 ( 22.005737)
 21.540000   0.370000  21.910000 ( 22.065165)
 21.530000   0.370000  21.900000 ( 22.028591)
 21.500000   0.350000  21.850000 ( 21.990395)
 21.550000   0.370000  21.920000 ( 22.033164)
 21.520000   0.360000  21.880000 ( 21.984129)
 21.550000   0.370000  21.920000 ( 22.116802)
 21.550000   0.370000  21.920000 ( 22.051421)
 21.520000   0.380000  21.900000 ( 22.084736)
Mean: 22.038291 SDev: 0.041985

$ jruby -J-server to_xs.rb
 79.112000   0.000000  79.112000 ( 79.112000)
 81.480000   0.000000  81.480000 ( 81.481000)
 84.745000   0.000000  84.745000 ( 84.745000)
 84.384000   0.000000  84.384000 ( 84.384000)
121.933000   0.000000 121.933000 (121.933000)
 85.533000   0.000000  85.533000 ( 85.532000)
 82.762000   0.000000  82.762000 ( 82.763000)
 82.090000   0.000000  82.090000 ( 82.090000)
 81.298000   0.000000  81.298000 ( 81.299000)
 80.774000   0.000000  80.774000 ( 80.773000)
Mean: 86.411200 SDev: 12.635700

(Hmm, I must have accidentally swapped in some large program in the middle of that JRuby run. The perils of benchmarking on a desktop machine. I don’t claim that the numbers are scientific, just illustrative!)

Fortunately, the fix again is very simple, and has previously been acknowledged. The latest (unreleased?) Hpricot has a new native extension, fast_xs, which is an almost drop-in replacement for the pure-ruby String#to_xs. (Almost, because it creates the method String#fast_xs instead of String#to_xs. ActiveSupport 2.0.2 and later take care of aliasing it for you). Unbeknownst to me, I ported fast_xs recently as part of upgrading JRuby extensions that have Java code in them. And so it happens to come in handy at this time. The patch for that is here.

I have the latest Hpricot gems on my server, so you can install it yourself (for either Ruby or JRuby):

gem install hpricot --source http://caldersphere.net

or

jruby -S gem install hpricot --source http://caldersphere.net

With that installed, the script now produces these results:

$ ruby to_xs.rb
  0.460000   0.080000   0.540000 (  0.537793)
  0.420000   0.070000   0.490000 (  0.501965)
  0.430000   0.070000   0.500000 (  0.501359)
  0.400000   0.070000   0.470000 (  0.484495)
  0.400000   0.070000   0.470000 (  0.479995)
  0.400000   0.070000   0.470000 (  0.469118)
  0.390000   0.070000   0.460000 (  0.468864)
  0.390000   0.070000   0.460000 (  0.465009)
  0.390000   0.060000   0.450000 (  0.452902)
  0.390000   0.070000   0.460000 (  0.466881)
Mean: 0.482838 SDev: 0.024926

$ jruby -J-server to_xs.rb 
  0.882000   0.000000   0.882000 (  0.883000)
  0.832000   0.000000   0.832000 (  0.832000)
  0.851000   0.000000   0.851000 (  0.850000)
  0.837000   0.000000   0.837000 (  0.837000)
  0.846000   0.000000   0.846000 (  0.846000)
  0.843000   0.000000   0.843000 (  0.843000)
  0.835000   0.000000   0.835000 (  0.835000)
  0.825000   0.000000   0.825000 (  0.826000)
  0.830000   0.000000   0.830000 (  0.830000)
  0.834000   0.000000   0.834000 (  0.833000)
Mean: 0.841500 SDev: 0.016379

Tags , , ,  | 3 comments

REXML a Drag...Again

Posted by Nick Sieger Thu, 17 Jan 2008 04:07:00 GMT

We’ve been here before. So here’s the scenario: You’re feeding medium-to-large chunks of XML out of one Rails app, to be consumed by another via ActiveResource. Maybe those chunks have embedded HTML, or maybe they’re an Atom feed containing several pieces of HTML with all the entities escaped. Maybe they contain entire Wikipedia pages in them. Lots of entities that need expansion when the file is parsed.

So what does ActiveResource do with this? Hash.from_xml. Which uses xml-simple. Which constructs a REXML::Document, and proceeds to navigate the entire DOM, scraping the text nodes out of it so they can be stuffed in a hash to be handed back to ActiveResource. And how does REXML expand all the entities it runs across? With this little lovely:

# Unescapes all possible entities
def Text::unnormalize( string, doctype=nil, filter=nil, illegal=nil )
  rv = string.clone
  rv.gsub!( /\r\n?/, "\n" )
  matches = rv.scan( REFERENCE )
  return rv if matches.size == 0
  rv.gsub!( NUMERICENTITY ) {|m|
    m=$1
    m = "0#{m}" if m[0] == ?x
    [Integer(m)].pack('U*')
  }
  matches.collect!{|x|x[0]}.compact!
  if matches.size > 0
    if doctype
      matches.each do |entity_reference|
        unless filter and filter.include?(entity_reference)
          entity_value = doctype.entity( entity_reference )
          re = /&#{entity_reference};/
          rv.gsub!( re, entity_value ) if entity_value
        end
      end
    else
      matches.each do |entity_reference|
        unless filter and filter.include?(entity_reference)
          entity_value = DocType::DEFAULT_ENTITIES[ entity_reference ]
          re = /&#{entity_reference};/
          rv.gsub!( re, entity_value.value ) if entity_value
        end
      end
    end
    rv.gsub!( /&amp;/, '&' )
  end
  rv
end

Now, when you look at this, your first impression is that it just screams fast, right? Let’s run Hash.from_xml on the file I mentioned above.

# unnormalize.rb
require 'rubygems'
gem 'activesupport'
require 'active_support'

File.open("page.xml") do |f|
  Hash.from_xml(f.read)
end
$ time ruby unnormalize.rb

real    0m16.221s
user    0m14.447s
sys     0m0.346s

Whoa! Knock me over with a feather! It blows chunks! You mean calling #gsub! repeatedly in a loop with dregexps (regexp literals with interpolated strings) doesn’t go fast? It’s doubly worse on JRuby, too:

$ time jruby unnormalize.rb

real    0m33.637s
user    0m32.897s
sys     0m0.573s

All this on a paltry 393K xml file. Makes me wonder how anyone ever does any serious XML processing in Ruby.

I know, this is open source, I should be whipping up a patch for this and submitting it. Well, I did cook up a solution, but it unfortunately is only available for JRuby at the moment. (I also have much more faith in Sam Ruby than myself to get the semantics of a rewritten REXML::Text::unnormalize correct.)

A while back I cooked up JREXML because Regexp processing in JRuby was slow at the time, and the guts of REXML is driven by a Regexp-based parser. JREXML swaps out that regexp parser with a Java pull parser library, and at the time it provided a modest speedup.

So, in the context of JREXML, the solution now becomes simple, by taking advantage of the fact that Java XML parsers typically expand entities for you. The just-released JREXML 0.5.3 circumvents REXML::Text::unnormalize when constructing a document from the Java-based parser. And the results again don’t disappoint:

$ time jruby unnormalize_jrexml.rb

real    0m5.802s
user    0m5.315s
sys     0m0.345s

Update: At Sam’s request, I ran the numbers again with REXML trunk, which condenses entity expansion into a single gsub. Speed is more in line for MRI, but didn’t move much for JRuby (probably more a datapoint for JRuby developers than REXML developers).

$ time ruby -I~/Projects/ruby/rexml/src unnormalize.rb 

real    0m6.592s
user    0m0.845s
sys     0m0.345s

$ time jruby -I~/Projects/ruby/rexml/src unnormalize.rb

real    0m34.353s
user    0m33.023s
sys     0m0.714s

Tags , ,  | 6 comments