Nick Sieger: RubyConf: I18n, M17n, Unicode, and all thattag:blog.nicksieger.com,2005:TypoTypo2007-08-31T19:04:34+00:00Nickurn:uuid:47a3e279-e8d7-4e0c-a98e-585185deca3f2006-10-24T14:43:19+00:002007-08-31T19:04:34+00:00Comment on RubyConf: I18n, M17n, Unicode, and all that by Nick<p>Thanks for the corrections. I’ll be upgrading Typo shortly, hopefully that will fix trackbacks.</p>iainurn:uuid:639dfea4-1d90-4c0b-bc5d-9262828491352006-10-24T08:29:10+00:002007-08-31T19:04:34+00:00Comment on RubyConf: I18n, M17n, Unicode, and all that by iain<p>The link to unicode should be: <a href='http://www.unicode.org/Public/UNIDATA/' rel="nofollow">http://www.unicode.org/Public/UNIDATA/</a></p>David N. Weltonurn:uuid:9ea75370-acf0-41e7-b099-ddde05ba65a92006-10-22T13:55:44+00:002007-08-31T19:04:32+00:00Comment on RubyConf: I18n, M17n, Unicode, and all that by David N. Welton<p>Your trackbacks are screwed up, or at least this one doesn’t work:</p>
<p><a href='http://blog.nicksieger.com/articles/trackback/91' rel="nofollow">http://blog.nicksieger.com/articles/trackback/91</a></p>Nick Siegerurn:uuid:9ec73cc7-e1d5-4ce6-a78e-1caaac24b5972006-10-22T00:06:00+00:002007-08-31T16:54:24+00:00RubyConf: I18n, M17n, Unicode, and all that<p>Tim Bray is going to talk about characters and strings. He will gladly talk about text until your arm falls off, and buy half the beer.</p>
<h2>Introduction</h2>
<p>English is no longer the majority language on the web. It’s nonsensical to ignore i18n issues with new apps.</p>
<p>This is probably a bug:</p>
<div class="typocode"><pre><code class="typocode_ruby "> <span class="punct">/</span><span class="regex">[a-zA-Z]+</span><span class="punct">/</span></code></pre></div>
<h2>Problems to solve to help us with i18n</h2>
<ul>
<li>Identifying characters</li>
<li>Byte-character mapping and storage</li>
<li>A good string API</li>
</ul>
<h2>References</h2>
<ul>
<li>Worlds writing systems</li>
<li>Character model for the web (W3C)</li>
<li>The Unicode 5.0 Standard (forthcoming book) (same as ISO 10646)</li>
</ul>
<h2>Unicode</h2>
<ul>
<li>Numbers identified by code points (> 1,000,000)</li>
<li>17 Planes each with 64k</li>
<li>Original characters (available in any computer anywhere before Unicode was invented) in Basic Multilingual Plane (BMP, first plane)</li>
<li>Characters identified as U + (4 hex digits)</li>
<li><a href="http://www.unicode.org/Public/UNIDATA">Unicode character database</a></li>
</ul>
<h2>Benefits</h2>
<ul>
<li>Repertoire</li>
<li>Room for growth – lots of space left in the middle planes</li>
<li>Private use</li>
<li>Sane process –</li>
<li>Character database</li>
<li>Ubiquitous standards and tools</li>
</ul>
<h2>Difficulties</h2>
<ul>
<li>Combining forms – need to normalize the characters for comparison (1/2 vs. ½) and this is not something you want to do in your <code>String#==</code> method</li>
<li>Awkward historical compromises with other encodings</li>
<li><a href="http://en.wikipedia.org/wiki/Han_unification">Han unification</a> (note: Tim considers wikipedia article to be biased)– characters that might mean something different were given one codepoint by asian linguistic specialists.</li>
</ul>
<h2>Storage</h2>
<ul>
<li>Official: UTF-8, UTF-16, UTF-32</li>
<li>Practical: ASCII, EBCDIC, Shift-JIS, Big5, EUC-JP, EUC-KR, MS code pages, ISO-8859-*, etc.</li>
</ul>
<p>But with Video the largest bandwidth eater, does text size really matter?</p>
<h2>Identification</h2>
<p>How to identify what text is coming in over the wire?</p>
<ul>
<li>Guess – browsers, python lib</li>
<li>Charset headers, which are known to be wrong</li>
<li>Trust – two partners agree in advance for a pattern of exchange</li>
<li>XML [this last one was quite obvious right]</li>
</ul>
<h2>Language approaches</h2>
<ul>
<li>Java – design flaw; characters are UTF-16, which is unfortunate. Implementation is sound and well tested, but Java is clunky.</li>
<li>Perl 5 has excellent support in theory, but practically speaking, it’s difficult to round-trip text through a DB without some breakage.</li>
<li>Python has byte arrays and strings, some string-like methods on byte arrays, binary or text data, glosses over issue of plaform file encoding</li>
</ul>
<h2>Ruby</h2>
<ul>
<li>Some core string methods have i18n problems due to counting, regexp, equality and whitespace concerns.</li>
<li><code>String#each_char</code> seems to be a missing method; string class maybe should be aware of its encoding.</li>
<li>Behavior of <code>String#[]</code> seems ok for byte buffers but it probably doesn’t need to be efficient for characters. Most of the use-cases for String iteration should be for characters (exception: Expat)</li>
<li>Case-changing methods – avoid them at all costs in a mixed language environment!</li>
<li>Regexps need unicode properties for safer matching (<code>p{L}</code> for lower-case letters, <code>p{N}</code> for numbers)</li>
<li>Does Ruby need a Character class or a Charset class?</li>
</ul>
<p>What is next for Ruby? [Tune your divining rods toward ruby-talk for the rest of the story!] Matz has m17n; Julik, Manfred and crew now have ActiveSupport::MultiByte in Rails; JRuby is built on a platform that already has a Unicode string, so the discussion is heating up.</p>
<p><em>Update: <a href="http://www.tbray.org/talks/rubyconf2006.pdf">slides available here</a>.</em></p>
<h2>Q & A</h2>
<p><em>Q. What if I have a stream of bytes with no knowledge of encoding?</em> Don’t try to impose an encoding lens above the level of a string, programmers want to treat a string as a string with associated methods.</p>
<p><em>Q. What if I need to change case of text?</em> Get used to the fact that it won’t work reliably work. <em>What about characterizing the finite amount of languages?</em> Java does that, and it’s still not really possible. <em>Shouldn’t the string class know the encoding? Couldn’t it optimize better? Couldn’t you raise an exception?</em> Just don’t do it! <em>Isn’t there a body of knowledge that could be acquired about case?</em> Hmm, next question that isn’t about case!</p>
<p><em>Q. Is there a resource for edge cases of processing text in XML?</em> Search for “xml test cases”. The decision is between ignoring the metadata provided, and choosing not to process it.</p>
<p><em>Q. What is the python library that can guess the charset?</em> It’s in the <a href="http://feedvalidator.org">feedvalidator</a> suite.</p>