Nick Sieger: Tag unicode http://blog.nicksieger.com/articles/tag/unicode?tag=unicode en-us 40 RubyConf: I18n, M17n, Unicode, and all that <p>Tim Bray is going to talk about characters and strings&#46; He will gladly talk about text until your arm falls off, and buy half the beer&#46;</p> <h2>Introduction</h2> <p>English is no longer the majority language on the web&#46; It&#8217;s nonsensical to ignore i18n issues with new apps&#46;</p> <p>This is probably a bug:</p> <div class="typocode"><pre><code class="typocode_ruby "> <span class="punct">/</span><span class="regex">[a-zA-Z]+</span><span class="punct">/</span></code></pre></div> <h2>Problems to solve to help us with i18n</h2> <ul> <li>Identifying characters</li> <li>Byte&#45;character mapping and storage</li> <li>A good string API</li> </ul> <h2>References</h2> <ul> <li>Worlds writing systems</li> <li>Character model for the web (W3C)</li> <li>The Unicode 5&#46;0 Standard (forthcoming book) (same as ISO 10646)</li> </ul> <h2>Unicode</h2> <ul> <li>Numbers identified by code points (&gt; 1,000,000)</li> <li>17 Planes each with 64k</li> <li>Original characters (available in any computer anywhere before Unicode was invented) in Basic Multilingual Plane (BMP, first plane)</li> <li>Characters identified as U + (4 hex digits)</li> <li><a href="http://www.unicode.org/Public/UNIDATA">Unicode character database</a></li> </ul> <h2>Benefits</h2> <ul> <li>Repertoire</li> <li>Room for growth &#45;&#45; lots of space left in the middle planes</li> <li>Private use</li> <li>Sane process &#45;&#45;</li> <li>Character database</li> <li>Ubiquitous standards and tools</li> </ul> <h2>Difficulties</h2> <ul> <li>Combining forms &#45;&#45; need to normalize the characters for comparison (1/2 vs&#46; &frac12;) and this is not something you want to do in your <code>String#==</code> method</li> <li>Awkward historical compromises with other encodings</li> <li><a href="http://en.wikipedia.org/wiki/Han_unification">Han unification</a> (note: Tim considers wikipedia article to be biased)&#45;&#45; characters that might mean something different were given one codepoint by asian linguistic specialists&#46;</li> </ul> <h2>Storage</h2> <ul> <li>Official: UTF&#45;8, UTF&#45;16, UTF&#45;32</li> <li>Practical: ASCII, EBCDIC, Shift&#45;JIS, Big5, EUC&#45;JP, EUC&#45;KR, MS code pages, ISO&#45;8859&#45;*, etc&#46;</li> </ul> <p>But with Video the largest bandwidth eater, does text size really matter?</p> <h2>Identification</h2> <p>How to identify what text is coming in over the wire?</p> <ul> <li>Guess &#45;&#45; browsers, python lib</li> <li>Charset headers, which are known to be wrong</li> <li>Trust &#45;&#45; two partners agree in advance for a pattern of exchange</li> <li>XML [this last one was quite obvious right]</li> </ul> <h2>Language approaches</h2> <ul> <li>Java &#45;&#45; design flaw; characters are UTF&#45;16, which is unfortunate&#46; Implementation is sound and well tested, but Java is clunky&#46;</li> <li>Perl 5 has excellent support in theory, but practically speaking, it&#8217;s difficult to round&#45;trip text through a DB without some breakage&#46;</li> <li>Python has byte arrays and strings, some string&#45;like methods on byte arrays, binary or text data, glosses over issue of plaform file encoding</li> </ul> <h2>Ruby</h2> <ul> <li>Some core string methods have i18n problems due to counting, regexp, equality and whitespace concerns&#46;</li> <li><code>String#each_char</code> seems to be a missing method; string class maybe should be aware of its encoding&#46;</li> <li>Behavior of <code>String#[]</code> seems ok for byte buffers but it probably doesn&#8217;t need to be efficient for characters&#46; Most of the use&#45;cases for String iteration should be for characters (exception: Expat)</li> <li>Case&#45;changing methods &#45;&#45; avoid them at all costs in a mixed language environment!</li> <li>Regexps need unicode properties for safer matching (<code>p{L}</code> for lower&#45;case letters, <code>p{N}</code> for numbers)</li> <li>Does Ruby need a Character class or a Charset class?</li> </ul> <p>What is next for Ruby? [Tune your divining rods toward ruby&#45;talk for the rest of the story!] Matz has m17n; Julik, Manfred and crew now have ActiveSupport::MultiByte in Rails; JRuby is built on a platform that already has a Unicode string, so the discussion is heating up&#46;</p> <p><em>Update: <a href="http://www.tbray.org/talks/rubyconf2006.pdf">slides available here</a>&#46;</em></p> <h2>Q &amp; A</h2> <p><em>Q&#46; What if I have a stream of bytes with no knowledge of encoding?</em> Don&#8217;t try to impose an encoding lens above the level of a string, programmers want to treat a string as a string with associated methods&#46;</p> <p><em>Q&#46; What if I need to change case of text?</em> Get used to the fact that it won&#8217;t work reliably work&#46; <em>What about characterizing the finite amount of languages?</em> Java does that, and it&#8217;s still not really possible&#46; <em>Shouldn&#8217;t the string class know the encoding? Couldn&#8217;t it optimize better? Couldn&#8217;t you raise an exception?</em> Just don&#8217;t do it! <em>Isn&#8217;t there a body of knowledge that could be acquired about case?</em> Hmm, next question that isn&#8217;t about case!</p> <p><em>Q&#46; Is there a resource for edge cases of processing text in XML?</em> Search for &#8220;xml test cases&#8221;&#46; The decision is between ignoring the metadata provided, and choosing not to process it&#46;</p> <p><em>Q&#46; What is the python library that can guess the charset?</em> It&#8217;s in the <a href="http://feedvalidator.org">feedvalidator</a> suite&#46;</p> Sun, 22 Oct 2006 00:06:00 +0000 urn:uuid:9ec73cc7-e1d5-4ce6-a78e-1caaac24b597 Nick Sieger http://blog.nicksieger.com/articles/2006/10/22/rubyconf-i18n-m17n-unicode-and-all-that ruby rubyconf rubyconf2006 unicode http://blog.nicksieger.com/articles/trackback/91