Nick Sieger: RubyConf: I18n, M17n, Unicode, and all that http://blog.nicksieger.com/articles/2006/10/22/rubyconf-i18n-m17n-unicode-and-all-that en-us 40 "RubyConf: I18n, M17n, Unicode, and all that" by Nick <p>Thanks for the corrections. I&#8217;ll be upgrading Typo shortly, hopefully that will fix trackbacks.</p> Tue, 24 Oct 2006 14:43:19 +0000 urn:uuid:47a3e279-e8d7-4e0c-a98e-585185deca3f http://blog.nicksieger.com/articles/2006/10/22/rubyconf-i18n-m17n-unicode-and-all-that#comment-105 "RubyConf: I18n, M17n, Unicode, and all that" by iain <p>The link to unicode should be: <a href='http://www.unicode.org/Public/UNIDATA/' rel="nofollow">http://www.unicode.org/Public/UNIDATA/</a></p> Tue, 24 Oct 2006 08:29:10 +0000 urn:uuid:639dfea4-1d90-4c0b-bc5d-926282849135 http://blog.nicksieger.com/articles/2006/10/22/rubyconf-i18n-m17n-unicode-and-all-that#comment-102 "RubyConf: I18n, M17n, Unicode, and all that" by David N. Welton <p>Your trackbacks are screwed up, or at least this one doesn&#8217;t work:</p> <p><a href='http://blog.nicksieger.com/articles/trackback/91' rel="nofollow">http://blog.nicksieger.com/articles/trackback/91</a></p> Sun, 22 Oct 2006 13:55:44 +0000 urn:uuid:9ea75370-acf0-41e7-b099-ddde05ba65a9 http://blog.nicksieger.com/articles/2006/10/22/rubyconf-i18n-m17n-unicode-and-all-that#comment-94 RubyConf: I18n, M17n, Unicode, and all that <p>Tim Bray is going to talk about characters and strings. He will gladly talk about text until your arm falls off, and buy half the beer.</p> <h2>Introduction</h2> <p>English is no longer the majority language on the web. It&#8217;s nonsensical to ignore i18n issues with new apps.</p> <p>This is probably a bug:</p> <div class="typocode"><pre><code class="typocode_ruby "> <span class="punct">/</span><span class="regex">[a-zA-Z]+</span><span class="punct">/</span></code></pre></div> <h2>Problems to solve to help us with i18n</h2> <ul> <li>Identifying characters</li> <li>Byte-character mapping and storage</li> <li>A good string API</li> </ul> <h2>References</h2> <ul> <li>Worlds writing systems</li> <li>Character model for the web (W3C)</li> <li>The Unicode 5.0 Standard (forthcoming book) (same as ISO 10646)</li> </ul> <h2>Unicode</h2> <ul> <li>Numbers identified by code points (&gt; 1,000,000)</li> <li>17 Planes each with 64k</li> <li>Original characters (available in any computer anywhere before Unicode was invented) in Basic Multilingual Plane (BMP, first plane)</li> <li>Characters identified as U + (4 hex digits)</li> <li><a href="http://www.unicode.org/Public/UNIDATA">Unicode character database</a></li> </ul> <h2>Benefits</h2> <ul> <li>Repertoire</li> <li>Room for growth &#8211; lots of space left in the middle planes</li> <li>Private use</li> <li>Sane process &#8211;</li> <li>Character database</li> <li>Ubiquitous standards and tools</li> </ul> <h2>Difficulties</h2> <ul> <li>Combining forms &#8211; need to normalize the characters for comparison (1/2 vs. &frac12;) and this is not something you want to do in your <code>String#==</code> method</li> <li>Awkward historical compromises with other encodings</li> <li><a href="http://en.wikipedia.org/wiki/Han_unification">Han unification</a> (note: Tim considers wikipedia article to be biased)&#8211; characters that might mean something different were given one codepoint by asian linguistic specialists.</li> </ul> <h2>Storage</h2> <ul> <li>Official: UTF-8, UTF-16, UTF-32</li> <li>Practical: ASCII, EBCDIC, Shift-JIS, Big5, EUC-JP, EUC-KR, MS code pages, ISO-8859-*, etc.</li> </ul> <p>But with Video the largest bandwidth eater, does text size really matter?</p> <h2>Identification</h2> <p>How to identify what text is coming in over the wire?</p> <ul> <li>Guess &#8211; browsers, python lib</li> <li>Charset headers, which are known to be wrong</li> <li>Trust &#8211; two partners agree in advance for a pattern of exchange</li> <li>XML [this last one was quite obvious right]</li> </ul> <h2>Language approaches</h2> <ul> <li>Java &#8211; design flaw; characters are UTF-16, which is unfortunate. Implementation is sound and well tested, but Java is clunky.</li> <li>Perl 5 has excellent support in theory, but practically speaking, it&#8217;s difficult to round-trip text through a DB without some breakage.</li> <li>Python has byte arrays and strings, some string-like methods on byte arrays, binary or text data, glosses over issue of plaform file encoding</li> </ul> <h2>Ruby</h2> <ul> <li>Some core string methods have i18n problems due to counting, regexp, equality and whitespace concerns.</li> <li><code>String#each_char</code> seems to be a missing method; string class maybe should be aware of its encoding.</li> <li>Behavior of <code>String#[]</code> seems ok for byte buffers but it probably doesn&#8217;t need to be efficient for characters. Most of the use-cases for String iteration should be for characters (exception: Expat)</li> <li>Case-changing methods &#8211; avoid them at all costs in a mixed language environment!</li> <li>Regexps need unicode properties for safer matching (<code>p{L}</code> for lower-case letters, <code>p{N}</code> for numbers)</li> <li>Does Ruby need a Character class or a Charset class?</li> </ul> <p>What is next for Ruby? [Tune your divining rods toward ruby-talk for the rest of the story!] Matz has m17n; Julik, Manfred and crew now have ActiveSupport::MultiByte in Rails; JRuby is built on a platform that already has a Unicode string, so the discussion is heating up.</p> <p><em>Update: <a href="http://www.tbray.org/talks/rubyconf2006.pdf">slides available here</a>.</em></p> <h2>Q &amp; A</h2> <p><em>Q. What if I have a stream of bytes with no knowledge of encoding?</em> Don&#8217;t try to impose an encoding lens above the level of a string, programmers want to treat a string as a string with associated methods.</p> <p><em>Q. What if I need to change case of text?</em> Get used to the fact that it won&#8217;t work reliably work. <em>What about characterizing the finite amount of languages?</em> Java does that, and it&#8217;s still not really possible. <em>Shouldn&#8217;t the string class know the encoding? Couldn&#8217;t it optimize better? Couldn&#8217;t you raise an exception?</em> Just don&#8217;t do it! <em>Isn&#8217;t there a body of knowledge that could be acquired about case?</em> Hmm, next question that isn&#8217;t about case!</p> <p><em>Q. Is there a resource for edge cases of processing text in XML?</em> Search for &#8220;xml test cases&#8221;. The decision is between ignoring the metadata provided, and choosing not to process it.</p> <p><em>Q. What is the python library that can guess the charset?</em> It&#8217;s in the <a href="http://feedvalidator.org">feedvalidator</a> suite.</p> Sun, 22 Oct 2006 00:06:00 +0000 urn:uuid:9ec73cc7-e1d5-4ce6-a78e-1caaac24b597 Nick Sieger http://blog.nicksieger.com/articles/2006/10/22/rubyconf-i18n-m17n-unicode-and-all-that ruby rubyconf rubyconf2006 unicode http://blog.nicksieger.com/articles/trackback/91