RubyConf: I18n, M17n, Unicode, and all that
Posted by Nick Sieger Sun, 22 Oct 2006 00:06:00 GMT
Tim Bray is going to talk about characters and strings. He will gladly talk about text until your arm falls off, and buy half the beer.
Introduction
English is no longer the majority language on the web. It’s nonsensical to ignore i18n issues with new apps.
This is probably a bug:
/[a-zA-Z]+/
Problems to solve to help us with i18n
- Identifying characters
- Byte-character mapping and storage
- A good string API
References
- Worlds writing systems
- Character model for the web (W3C)
- The Unicode 5.0 Standard (forthcoming book) (same as ISO 10646)
Unicode
- Numbers identified by code points (> 1,000,000)
- 17 Planes each with 64k
- Original characters (available in any computer anywhere before Unicode was invented) in Basic Multilingual Plane (BMP, first plane)
- Characters identified as U + (4 hex digits)
- Unicode character database
Benefits
- Repertoire
- Room for growth -- lots of space left in the middle planes
- Private use
- Sane process --
- Character database
- Ubiquitous standards and tools
Difficulties
- Combining forms -- need to normalize the characters for comparison (1/2 vs. ½) and this is not something you want to do in your
String#==
method - Awkward historical compromises with other encodings
- Han unification (note: Tim considers wikipedia article to be biased)-- characters that might mean something different were given one codepoint by asian linguistic specialists.
Storage
- Official: UTF-8, UTF-16, UTF-32
- Practical: ASCII, EBCDIC, Shift-JIS, Big5, EUC-JP, EUC-KR, MS code pages, ISO-8859-*, etc.
But with Video the largest bandwidth eater, does text size really matter?
Identification
How to identify what text is coming in over the wire?
- Guess -- browsers, python lib
- Charset headers, which are known to be wrong
- Trust -- two partners agree in advance for a pattern of exchange
- XML [this last one was quite obvious right]
Language approaches
- Java -- design flaw; characters are UTF-16, which is unfortunate. Implementation is sound and well tested, but Java is clunky.
- Perl 5 has excellent support in theory, but practically speaking, it’s difficult to round-trip text through a DB without some breakage.
- Python has byte arrays and strings, some string-like methods on byte arrays, binary or text data, glosses over issue of plaform file encoding
Ruby
- Some core string methods have i18n problems due to counting, regexp, equality and whitespace concerns.
String#each_char
seems to be a missing method; string class maybe should be aware of its encoding.- Behavior of
String#[]
seems ok for byte buffers but it probably doesn’t need to be efficient for characters. Most of the use-cases for String iteration should be for characters (exception: Expat) - Case-changing methods -- avoid them at all costs in a mixed language environment!
- Regexps need unicode properties for safer matching (
p{L}
for lower-case letters,p{N}
for numbers) - Does Ruby need a Character class or a Charset class?
What is next for Ruby? [Tune your divining rods toward ruby-talk for the rest of the story!] Matz has m17n; Julik, Manfred and crew now have ActiveSupport::MultiByte in Rails; JRuby is built on a platform that already has a Unicode string, so the discussion is heating up.
Update: slides available here.
Q & A
Q. What if I have a stream of bytes with no knowledge of encoding? Don’t try to impose an encoding lens above the level of a string, programmers want to treat a string as a string with associated methods.
Q. What if I need to change case of text? Get used to the fact that it won’t work reliably work. What about characterizing the finite amount of languages? Java does that, and it’s still not really possible. Shouldn’t the string class know the encoding? Couldn’t it optimize better? Couldn’t you raise an exception? Just don’t do it! Isn’t there a body of knowledge that could be acquired about case? Hmm, next question that isn’t about case!
Q. Is there a resource for edge cases of processing text in XML? Search for “xml test cases”. The decision is between ignoring the metadata provided, and choosing not to process it.
Q. What is the python library that can guess the charset? It’s in the feedvalidator suite.