Ruby and XML not-so-simple?

2006-11-02T02:12:00+00:00

Update: Koz already fixed the issue in trunk, and the changes are also going into the 1.2 release as well. Thanks!

Man, I think I’ve been reading too much Sam Ruby lately (ok, that was a year ago, but not much has changed). You have to admit, though, that XML handling in Ruby is one of those things that just doesn’t feel quite right. REXML is pretty much the standard API for Ruby, yet it suffers from two showstoppers in my opinion:

In Ruby 1.8.4 it still has the glaring hole Sam mentioned last year with well-formedness. (No exception raised below!)

irb(main):001:0> require 'rexml/document'
=> true
irb(main):002:0> d = REXML::Document.new '<div>at&t'
=> <UNDEFINED> ... </>
irb(main):003:0> d.root
=> <div> ... </>
irb(main):004:0> d.root.text
=> "at&t"

The REXML::Text#to_s method violates the principle of least surprise. In just about every other XML parser written, when you ask a text node for its contents, it returns you the value with entities resolved. Not so Text#to_s. You have to call Text#value instead. Unfortunately, this would be difficult to reverse in future versions of REXML without breaking existing apps.
```
irb(main):001:0> require 'rexml/document'
=> true
irb(main):002:0> t = REXML::Text.new('at&t')
=> "at&t"
irb(main):003:0> t.to_s
=> "at&amp;t"
irb(main):004:0> t.value
=> "at&t"
```

This second problem manifests itself in subtle ways. If you’re calling Element#text (which is probably the most common way), you’re fine, because it implicitly does self.texts.first.value under the hood. But if you want to make sure you’re grabbing all the text content, you might be inclined to write element.texts.join('') to concatenate them together. But this method bypasses the value method and instead uses to_s, leaving you with unresolved entities.

It turns out this problem is exhibited in the version of XmlSimple now included with Edge Rails as of rev 4453. So if you’re living on the edge using the newly minted ActiveResource fetching XML from remote resources like a champion, you just got benched as soon as you tried to fetch XML that had normalized entities inside.

XmlSimple version 1.0.9 has a partial fix for this issue, but I submitted another patch to Maik Schmidt for review that he subsequently released as 1.0.10. I’ve attached the 1.0.10 version to ticket 6532 in hopes that it will be patched in Rails soon.

REXML a Drag...Again

2008-01-17T04:07:00+00:00

We’ve been here before. So here’s the scenario: You’re feeding medium-to-large chunks of XML out of one Rails app, to be consumed by another via ActiveResource. Maybe those chunks have embedded HTML, or maybe they’re an Atom feed containing several pieces of HTML with all the entities escaped. Maybe they contain entire Wikipedia pages in them. Lots of entities that need expansion when the file is parsed.

So what does ActiveResource do with this? Hash.from_xml. Which uses xml-simple. Which constructs a REXML::Document, and proceeds to navigate the entire DOM, scraping the text nodes out of it so they can be stuffed in a hash to be handed back to ActiveResource. And how does REXML expand all the entities it runs across? With this little lovely:

# Unescapes all possible entities
def Text::unnormalize( string, doctype=nil, filter=nil, illegal=nil )
  rv = string.clone
  rv.gsub!( /\r\n?/, "\n" )
  matches = rv.scan( REFERENCE )
  return rv if matches.size == 0
  rv.gsub!( NUMERICENTITY ) {|m|
    m=$1
    m = "0#{m}" if m[0] == ?x
    [Integer(m)].pack('U*')
  }
  matches.collect!{|x|x[0]}.compact!
  if matches.size > 0
    if doctype
      matches.each do |entity_reference|
        unless filter and filter.include?(entity_reference)
          entity_value = doctype.entity( entity_reference )
          re = /&#{entity_reference};/
          rv.gsub!( re, entity_value ) if entity_value
        end
      end
    else
      matches.each do |entity_reference|
        unless filter and filter.include?(entity_reference)
          entity_value = DocType::DEFAULT_ENTITIES[ entity_reference ]
          re = /&#{entity_reference};/
          rv.gsub!( re, entity_value.value ) if entity_value
        end
      end
    end
    rv.gsub!( /&amp;/, '&' )
  end
  rv
end

Now, when you look at this, your first impression is that it just screams fast, right? Let’s run Hash.from_xml on the file I mentioned above.

# unnormalize.rb
require 'rubygems'
gem 'activesupport'
require 'active_support'

File.open("page.xml") do |f|
  Hash.from_xml(f.read)
end

$ time ruby unnormalize.rb

real    0m16.221s
user    0m14.447s
sys     0m0.346s

Whoa! Knock me over with a feather! It blows chunks! You mean calling #gsub! repeatedly in a loop with dregexps (regexp literals with interpolated strings) doesn’t go fast? It’s doubly worse on JRuby, too:

$ time jruby unnormalize.rb

real    0m33.637s
user    0m32.897s
sys     0m0.573s

All this on a paltry 393K xml file. Makes me wonder how anyone ever does any serious XML processing in Ruby.

I know, this is open source, I should be whipping up a patch for this and submitting it. Well, I did cook up a solution, but it unfortunately is only available for JRuby at the moment. (I also have much more faith in Sam Ruby than myself to get the semantics of a rewritten REXML::Text::unnormalize correct.)

A while back I cooked up JREXML because Regexp processing in JRuby was slow at the time, and the guts of REXML is driven by a Regexp-based parser. JREXML swaps out that regexp parser with a Java pull parser library, and at the time it provided a modest speedup.

So, in the context of JREXML, the solution now becomes simple, by taking advantage of the fact that Java XML parsers typically expand entities for you. The just-released JREXML 0.5.3 circumvents REXML::Text::unnormalize when constructing a document from the Java-based parser. And the results again don’t disappoint:

$ time jruby unnormalize_jrexml.rb

real    0m5.802s
user    0m5.315s
sys     0m0.345s

Update: At Sam’s request, I ran the numbers again with REXML trunk, which condenses entity expansion into a single gsub. Speed is more in line for MRI, but didn’t move much for JRuby (probably more a datapoint for JRuby developers than REXML developers).

$ time ruby -I~/Projects/ruby/rexml/src unnormalize.rb 

real    0m6.592s
user    0m0.845s
sys     0m0.345s

$ time jruby -I~/Projects/ruby/rexml/src unnormalize.rb

real    0m34.353s
user    0m33.023s
sys     0m0.714s

Nick Sieger: Tag xml

Ruby and XML not-so-simple?

REXML a Drag...Again