<?xml version="1.0" encoding="UTF-8"?>
<feed xml:lang="en-US" xmlns="http://www.w3.org/2005/Atom">
  <title>Nick Sieger: Tag unicode</title>
  <id>tag:blog.nicksieger.com,2005:Typo</id>
  <generator uri="http://www.typosphere.org" version="4.0">Typo</generator>
  <link rel="self" type="application/atom+xml" href="http://blog.nicksieger.com/xml/atom10/tag/unicode/feed.xml"/>
  <link rel="alternate" type="text/html" href="http://blog.nicksieger.com/articles/tag/unicode?tag=unicode"/>
  <updated>2007-08-31T16:54:24+00:00</updated>
  <entry>
    <author>
      <name>Nick Sieger</name>
    </author>
    <id>urn:uuid:9ec73cc7-e1d5-4ce6-a78e-1caaac24b597</id>
    <published>2006-10-22T00:06:00+00:00</published>
    <updated>2007-08-31T16:54:24+00:00</updated>
    <title>RubyConf: I18n, M17n, Unicode, and all that</title>
    <link rel="alternate" type="text/html" href="http://blog.nicksieger.com/articles/2006/10/22/rubyconf-i18n-m17n-unicode-and-all-that"/>
    <category term="ruby" label="ruby" scheme="http://blog.nicksieger.com/articles/category/ruby"/>
    <category term="rubyconf" scheme="http://blog.nicksieger.com/articles/tag/rubyconf"/>
    <category term="rubyconf2006" scheme="http://blog.nicksieger.com/articles/tag/rubyconf2006"/>
    <category term="unicode" scheme="http://blog.nicksieger.com/articles/tag/unicode"/>
    <content type="html">&lt;p&gt;Tim Bray is going to talk about characters and strings.  He will gladly talk about text until your arm falls off, and buy half the beer.&lt;/p&gt;

&lt;h2&gt;Introduction&lt;/h2&gt;

&lt;p&gt;English is no longer the majority language on the web.  It&amp;#8217;s nonsensical to ignore i18n issues with new apps.&lt;/p&gt;

&lt;p&gt;This is probably a bug:&lt;/p&gt;

&lt;div class="typocode"&gt;&lt;pre&gt;&lt;code class="typocode_ruby "&gt;    &lt;span class="punct"&gt;/&lt;/span&gt;&lt;span class="regex"&gt;[a-zA-Z]+&lt;/span&gt;&lt;span class="punct"&gt;/&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h2&gt;Problems to solve to help us with i18n&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Identifying characters&lt;/li&gt;
&lt;li&gt;Byte-character mapping and storage&lt;/li&gt;
&lt;li&gt;A good string API&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;References&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Worlds writing systems&lt;/li&gt;
&lt;li&gt;Character model for the web (W3C)&lt;/li&gt;
&lt;li&gt;The Unicode 5.0 Standard (forthcoming book) (same as ISO 10646)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;Unicode&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Numbers identified by code points (&amp;gt; 1,000,000)&lt;/li&gt;
&lt;li&gt;17 Planes each with 64k&lt;/li&gt;
&lt;li&gt;Original characters (available in any computer anywhere before Unicode was invented) in Basic Multilingual Plane (BMP, first plane)&lt;/li&gt;
&lt;li&gt;Characters identified as U + (4 hex digits)&lt;/li&gt;
&lt;li&gt;&lt;a href="http://www.unicode.org/Public/UNIDATA"&gt;Unicode character database&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;Benefits&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Repertoire&lt;/li&gt;
&lt;li&gt;Room for growth &amp;#8211; lots of space left in the middle planes&lt;/li&gt;
&lt;li&gt;Private use&lt;/li&gt;
&lt;li&gt;Sane process &amp;#8211;&lt;/li&gt;
&lt;li&gt;Character database&lt;/li&gt;
&lt;li&gt;Ubiquitous standards and tools&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;Difficulties&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Combining forms &amp;#8211; need to normalize the characters for comparison (1/2 vs. &amp;frac12;) and this is not something you want to do in your &lt;code&gt;String#==&lt;/code&gt; method&lt;/li&gt;
&lt;li&gt;Awkward historical compromises with other encodings&lt;/li&gt;
&lt;li&gt;&lt;a href="http://en.wikipedia.org/wiki/Han_unification"&gt;Han unification&lt;/a&gt; (note: Tim considers wikipedia article to be biased)&amp;#8211; characters that might mean something different were given one codepoint by asian linguistic specialists.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;Storage&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Official: UTF-8, UTF-16, UTF-32&lt;/li&gt;
&lt;li&gt;Practical: ASCII, EBCDIC, Shift-JIS, Big5, EUC-JP, EUC-KR, MS code pages, ISO-8859-*, etc.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But with Video the largest bandwidth eater, does text size really matter?&lt;/p&gt;

&lt;h2&gt;Identification&lt;/h2&gt;

&lt;p&gt;How to identify what text is coming in over the wire?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Guess &amp;#8211; browsers, python lib&lt;/li&gt;
&lt;li&gt;Charset headers, which are known to be wrong&lt;/li&gt;
&lt;li&gt;Trust &amp;#8211; two partners agree in advance for a pattern of exchange&lt;/li&gt;
&lt;li&gt;XML [this last one was quite obvious right]&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;Language approaches&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Java &amp;#8211; design flaw; characters are UTF-16, which is unfortunate.  Implementation is sound and well tested, but Java is clunky.&lt;/li&gt;
&lt;li&gt;Perl 5 has excellent support in theory, but practically speaking, it&amp;#8217;s difficult to round-trip text through a DB without some breakage.&lt;/li&gt;
&lt;li&gt;Python has byte arrays and strings, some string-like methods on byte arrays, binary or text data, glosses over issue of plaform file encoding&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;Ruby&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Some core string methods have i18n problems due to counting, regexp, equality and whitespace concerns.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;String#each_char&lt;/code&gt; seems to be a missing method; string class maybe should be aware of its encoding.&lt;/li&gt;
&lt;li&gt;Behavior of &lt;code&gt;String#[]&lt;/code&gt; seems ok for byte buffers but it probably doesn&amp;#8217;t need to be efficient for characters.  Most of the use-cases for String iteration should be for characters (exception: Expat)&lt;/li&gt;
&lt;li&gt;Case-changing methods &amp;#8211; avoid them at all costs in a mixed language environment!&lt;/li&gt;
&lt;li&gt;Regexps need unicode properties for safer matching (&lt;code&gt;p{L}&lt;/code&gt; for lower-case letters, &lt;code&gt;p{N}&lt;/code&gt; for numbers)&lt;/li&gt;
&lt;li&gt;Does Ruby need a Character class or a Charset class?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What is next for Ruby?  [Tune your divining rods toward ruby-talk for the rest of the story!]  Matz has m17n; Julik, Manfred and crew now have ActiveSupport::MultiByte in Rails; JRuby is built on a platform that already has a Unicode string, so the discussion is heating up.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Update: &lt;a href="http://www.tbray.org/talks/rubyconf2006.pdf"&gt;slides available here&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;Q &amp;amp; A&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Q. What if I have a stream of bytes with no knowledge of encoding?&lt;/em&gt;  Don&amp;#8217;t try to impose an encoding lens above the level of a string, programmers want to treat a string as a string with associated methods.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Q. What if I need to change case of text?&lt;/em&gt;  Get used to the fact that it won&amp;#8217;t work reliably work.  &lt;em&gt;What about characterizing the finite amount of languages?&lt;/em&gt;  Java does that, and it&amp;#8217;s still not really possible.  &lt;em&gt;Shouldn&amp;#8217;t the string class know the encoding?  Couldn&amp;#8217;t it optimize better?  Couldn&amp;#8217;t you raise an exception?&lt;/em&gt;  Just don&amp;#8217;t do it!  &lt;em&gt;Isn&amp;#8217;t there a body of knowledge that could be acquired about case?&lt;/em&gt;  Hmm, next question that isn&amp;#8217;t about case!&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Q. Is there a resource for edge cases of processing text in XML?&lt;/em&gt;  Search for &amp;#8220;xml test cases&amp;#8221;.  The decision is between ignoring the metadata provided, and choosing not to process it.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Q. What is the python library that can guess the charset?&lt;/em&gt;  It&amp;#8217;s in the &lt;a href="http://feedvalidator.org"&gt;feedvalidator&lt;/a&gt; suite.&lt;/p&gt;</content>
  </entry>
</feed>
