Lessons from three weeks of intensive I18N

When we launched LineBuzz on May 10, we had no idea that most of our press coverage was going to be Japanese. A site called 100Shiki.com put us up as dot-com of the day. All of a sudden we had lots of Japanese users. A few days later, a very popular blogger in China gave us a mention and we had lots of Chinese users too. Within a week we had over 15 languages on the site.

Three intense weeks later we launched an I18N version of the site.

Here’s a brief summary of some of the key issues we had to deal with when i18n’ing an app that has 50/50 client-server code and lots of communication between the two.

The code that is LineBuzz is very text intensive by the nature of the application. We provide inline comments without a browser plugin. One of the unique things about LineBuzz is that it doesn’t matter which page you post an inline comment on. The comment will appear anywhere on the website where the text and its surrounding paragraph appears.

So as you can imagine, we use a lot of regular expressions, character code conversions and text lengths.

Safari – not the worlds best browser

The first thing that broke was Safari. Safari’s regex engine in Javascript is seriously busted. It doesn’t support unicode characters at all. IIRC it simply returns true for any regex with unicode. So their claim that it’s the worlds best browser really irks me. So I had to write a fix-safari layer for anything that involved processing unicode.

No round-trip for jp charsets

The next thing that bit me was Japanese character set support. The Japanese use two main character sets: EUC-JP and Shift_JIS. The latter is a product of windows and the former is from unix. These both caused a major headache because they don’t round-trip convert to Unicode. Translated, that means that you can’t convert these characters to a unicode character set like UTF-8 and then convert them back to their native character set and expect the original to equal the converted characters. The solution: Store the raw character data for all character sets as binary and only convert to unicode if I absolutely must. I use UTF-8 on linebuzz.com, so that’s a scenario where I convert from binary to UTF-8.

When is a space not a space

Another thing that bit me was space character codes and spaces in regex. In unicode there are about 20 different space characters. Some regex engines are smart and recognize them all. Others only recognize the traditional ascii space character. So routines that for example, removed spaces, had to be hand tailored to deal with every unicode space.

String.charCodeAt() == lies lies lies!!

Character codes differ across operating systems. Some character sets contain characters that have a different character code on OS X than on Unix. Yes, even in the same browser using the same javascript engine (firefox for example), the character codes are different. So any routines that rely on consistent character codes across platforms have to deal with this little nightmare.

All this is behind us now and the Linebuzz code handles any character set in any language beautifully.