The reason the argument struck me as overly simplistic is because it assumes that a "character" is the same in each language. But this is just not the case. I've been knee-deep in localization for the last year, and have learned this the hard way, looking at Russian strings that flow like waterfalls down the screen while their Chinese equivalents barely take a line.
Since the argument is about localizing an application, I figured the best way to approach the subject would be to do some elementary school statistics on a localized application. Fortunately, I have one. This is a real application, used by millions of people internationally. It's in production. Localization has been done by localization "experts" well, not always and tested by localization testers. I have reason to believe that the translations are good, so I figure this is a good a sample as any.
It's a little small. This application has 542 strings, translated into 19 languages. (14 if you allow for two variants of English, French, Spanish, Chinese and Portuguese.) I have access to another application that has 1800 strings. I get similar results, but the data isn't quite as clean, so I am using this.
So I took these strings, converted to utf-8, utf-16 and utf-32. Here's what we have, sorted left to right, smallest files to biggest.
The first thing that should stand out is that the Asian locales have very small strings. The two Chinese variants require a third the number of characters as the average Western language to express a particular thought. That right there is most of the story. Yes, a Chinese character in utf-8 takes three times the bytes as a Roman character, but a Chinese word takes just about the same number of bytes as its Western counterpart. So unless the Chinese version of your application needs more memory, the encoding itself is a wash.
But there's some other interesting stuff:
When you hear the arguments against utf-8 for Asian languages, you get the impression that the space required for these characters will greatly increase. But with actual strings, that is not the case. Instead, the increase is marginal. Not double, but more like 10-15%. Why?
Well, because Asian countries use Latin characters. Here's an actual string from the Chinese localization files: "%1年 12月, ". Roman characters are used for numbers, for some trademarked names and for links. The Chinese file has the word "Facebook" in it.
This isn't Western arrogance. The Japanese file is much the same, and working for a Japanese company, the Japanese text isn't translated, it's written from scratch.
So if you go to utf-16, the Chinese characters take less space, but all the Roman characters double in size, eating up a lot of the gains.
The most significant benefit for utf-8 is not Chinese, but Japanese:
This is because Japanese uses the phonetic Hiragana syllabary along with Chinese characters, and unfortunately these end up being three bytes each in utf-8. Even so, going to utf-16 from utf-8 only reduces memory usage by a quarter.
It turns out that the language that really blows file size out of the water in utf-8 is not an Asian language, but Russian:
This is because Cyrillic requires two-bytes per character, but because it's an alphabet, and the language is long-winded, it uses more raw characters than most other languages. Then multiply by two. But for Russian, because it's still two bytes per character in utf-16, and because again Roman characters show up here and there, utf-16 just makes things worse.
So TL;DR - use utf-8 for everything.
< Give me any beer I'll drink it | bleh > |