RubyConf: I18n, M17n, Unicode, and all that

Nick Sieger — Sun, 22 Oct 2006 00:06:00 +0000

Tim Bray is going to talk about characters and strings. He will gladly talk about text until your arm falls off, and buy half the beer.

Introduction

English is no longer the majority language on the web. It’s nonsensical to ignore i18n issues with new apps.

This is probably a bug:

    /[a-zA-Z]+/

Problems to solve to help us with i18n

Identifying characters
Byte-character mapping and storage
A good string API

References

Worlds writing systems
Character model for the web (W3C)
The Unicode 5.0 Standard (forthcoming book) (same as ISO 10646)

Unicode

Numbers identified by code points (> 1,000,000)
17 Planes each with 64k
Original characters (available in any computer anywhere before Unicode was invented) in Basic Multilingual Plane (BMP, first plane)
Characters identified as U + (4 hex digits)
Unicode character database

Benefits

Repertoire
Room for growth -- lots of space left in the middle planes
Private use
Sane process --
Character database
Ubiquitous standards and tools

Difficulties

Combining forms -- need to normalize the characters for comparison (1/2 vs. ½) and this is not something you want to do in your String#== method
Awkward historical compromises with other encodings
Han unification (note: Tim considers wikipedia article to be biased)-- characters that might mean something different were given one codepoint by asian linguistic specialists.

Storage

Official: UTF-8, UTF-16, UTF-32
Practical: ASCII, EBCDIC, Shift-JIS, Big5, EUC-JP, EUC-KR, MS code pages, ISO-8859-*, etc.

But with Video the largest bandwidth eater, does text size really matter?

Identification

How to identify what text is coming in over the wire?

Guess -- browsers, python lib
Charset headers, which are known to be wrong
Trust -- two partners agree in advance for a pattern of exchange
XML [this last one was quite obvious right]

Language approaches

Java -- design flaw; characters are UTF-16, which is unfortunate. Implementation is sound and well tested, but Java is clunky.
Perl 5 has excellent support in theory, but practically speaking, it’s difficult to round-trip text through a DB without some breakage.
Python has byte arrays and strings, some string-like methods on byte arrays, binary or text data, glosses over issue of plaform file encoding

Ruby

Some core string methods have i18n problems due to counting, regexp, equality and whitespace concerns.
String#each_char seems to be a missing method; string class maybe should be aware of its encoding.
Behavior of String#[] seems ok for byte buffers but it probably doesn’t need to be efficient for characters. Most of the use-cases for String iteration should be for characters (exception: Expat)
Case-changing methods -- avoid them at all costs in a mixed language environment!
Regexps need unicode properties for safer matching (p{L} for lower-case letters, p{N} for numbers)
Does Ruby need a Character class or a Charset class?

What is next for Ruby? [Tune your divining rods toward ruby-talk for the rest of the story!] Matz has m17n; Julik, Manfred and crew now have ActiveSupport::MultiByte in Rails; JRuby is built on a platform that already has a Unicode string, so the discussion is heating up.

Update: slides available here.

Q & A

Q. What if I have a stream of bytes with no knowledge of encoding? Don’t try to impose an encoding lens above the level of a string, programmers want to treat a string as a string with associated methods.

Q. What if I need to change case of text? Get used to the fact that it won’t work reliably work. What about characterizing the finite amount of languages? Java does that, and it’s still not really possible. Shouldn’t the string class know the encoding? Couldn’t it optimize better? Couldn’t you raise an exception? Just don’t do it! Isn’t there a body of knowledge that could be acquired about case? Hmm, next question that isn’t about case!

Q. Is there a resource for edge cases of processing text in XML? Search for “xml test cases”. The decision is between ignoring the metadata provided, and choosing not to process it.

Q. What is the python library that can guess the charset? It’s in the feedvalidator suite.

Nick Sieger: Tag unicode